Lost in Hex Translation: Scraping ЮТФ-8

Aug. 21, 2019, 10:38 a.m.



Like my previous Russian scraping project, I found a site that offered dual language books for studying and enjoyment. I spent a lunch break browsing through them and noting which ones I'd like to read. There were lots of options and I narrowed it down to about 50.

This project thus involved some prep work in getting data, since I only wanted a fraction of the authors and books offered on the web site. In hindsight I could have scraped the entire list and then cherry picked from there. At the time I didn't know I was going to even attempt to scrape, hence the mish-mash of operations.

Once I had all the links, I read them from a file and pushed them to an array then processed each one. I built a small tool with the requests library but realized I was dealing with a POST form when it came to actually downloading the book. I thought I would have to switch to Selenium to simulate the "click" to initiate the download however including these two lines perfectly simulated the POST event to download:

keys={'download':'booksnippet'}
r=requests.post(book,data=keys)
With that I could save the book to HTML:
with open(bookTitle + '.html', 'wb') as file:
file.write(r.content)
There was also an option to save as PDF with the link format being slightly different. I later decided PDF would be a better option so I would re-run the scraper tool with an alternative line to download as PDF instead of HTML.

I kept the snippet id of the book and appended it to the static link in-line, with 2 different variables available: one for HTML, the other for PDF. Each link had a different type of path which meant the variable could update the type accordingly just by inserting the book snippet in the appropriate spot.

The snippet had the format of "###-book-title-in-russian" The formatting was a bit tricky because the URL was a mix of English and Russian, with the host site being in English but the snippet being in Russian. I ended up getting bogged down in trying to figure out the encoding. I thought dealing with UTF-8 would have been enough, little did I realize that when you're working inside the Python interpreter itself, it's not necessarily in UTF-8.

Here was the error I was getting when Python tried reading the Cyrillic book snippet text:

"UnicodeEncodeError: 'ascii' codec can't encode characters in position 40-44: ordinal not in range(128)"

To begin debugging this issue, I logged some print statements to see what was happening behind-the-scenes. Using the urllib.parse library, I split the URL up and paid attention to the path. The Cyrillic portion seemed to be parsing fine on the first two print statements. However when I ran the parse.quote on the path, the Cyrillic was converted to hexadecimal, losing the UTF-8 encoding I thought was already in there.

import urllib.parse

url = "https://website.com/path/dir/to/book-123-Идиот.html"
url = urllib.parse.urlsplit(url)
print(url)

>> SplitResult(scheme='https', netloc='website.com', path='/path/dir/to/book-123-Идиот.html', query='', fragment='')

url = list(url)
print(url)

>>
['https', 'website.com', '/path/dir/to/book-123-Идиот.html', '', '']

url[2] = urllib.parse.quote(url[2])
url = urllib.parse.urlunsplit(url)
print(url)

>>
https://website.com/path/dir/to/book-123-%D0%98%D0%B4%D0%B8%D0%BE%D1%82.html
So the Cyrllic was converting to hexadecimal as both of these web sites verified by copying in '%D0%98%D0%B4%D0%B8%D0%BE%D1%82':

https://onlineutf8tools.com/url-encode-utf8
https://www.branah.com/unicode-converter

After stackoverflowing for a while, I eventually solved this problem by wrapping the Cyrillic book title in "quote(bookTitle)", which forced the interpreter to keep the Cyrillic text as UTF-8 and not convert it to hexadecimal.

** Update **
I re-ran this example in 2020 both with and without quote(bookTitle) and it seems now there is no issue with Cyrillic being input directly. I don't understand what's changed with my Python interpreter since then. I hadn't updated anything. I ran the original code and then the original without quote() and both work exactly as intended. I simply no longer have that in dealing with both Cyrillic and English mixed together.

Once I solved the problem of loading multi-language URLs into the requests library, the results were as expected. I was able to save a number of files where the filenames stayed Cyrillic and not in hexadecimal using the requests.post with proper dictionary value in addition to wrapping the Cyrillic in quote().

While I could have done all this manually in the website by clicking around for a while, it felt nice to be able to build this and watch the console window do my job for me, yet again.