Web scraping my way to the Kremlin
Aug. 13, 2019, 11:18 p.m.
I like the idea of getting into programming "because of something". I have a real problem that I want solved and I can use programming to help me with it. To me this is more pragmatic and helpful than to just learn data structures, algorithms and syntax, though I understand their place.
Getting my hands dirty in the code, documentation and dead ends helped me see the beauty of these data structures, but having a priori knowledge of them in themselves didn't inspire me or call me to action. There is no avoiding the gauntlet of getting muddy. For this reason I love Python because it helps me focus on producing output rather than worrying about < C >::yntax (though as I'll write about later, C++ is not as bad as made out). I had an interesting problem that I knew web scraping could solve and was itching to work my scraping muscles.
A news site had an opinions column that helped English speakers learn Russian language idioms. The writer has been writing for over 20 years, and sadly many of her older articles are no longer available online. Even though I just discovered her 2 years ago, I panicked. Echoes of my 5th grade Pokemon obsession came back, "Gotta scrape them all!"
In Soviet Russia Pokemon catch YOU!
"What if they get shut down? It's like Alexandria all over again!" Ok maybe it wasn't that dramatic, but it did make me seek out all the articles having to do with learning Russian. I wasn't interested in the English only or political stuff so I had to do some prepwork and pick only links of interest.
So now I had a task on my hands:
1) Create a list of articles to save (~70 for my purposes)
2) Inspect the HTML logic of a few samples, ensuring the formatting is uniform
3) Identify the exact parameters to scrape
4) Let Python work its magic. Write some lines with Python Requests library.
I had already done a few scrape projects to save some pages as HTML. Those pages had pictures, animations and other goodies that would make just having the text a bit bland. Saving the entire page was actually easier with Python. Filtering based on certain criteria proved a bit more challenging. While saving HTML helped me flex my scraping muscles as well, I ultimately found those examples trivial, as once you have your parameters (the links) set, the scraping runs itself. It's funny how mundane and trivial the solution is once reaching it.
My code scraped the page just fine, outputting me the exact 70-odd number of files, even preserving the Cyrillic (I would later have a problem using Cyrillic as input but that's for another post), but I couldn't get the formatting to work. There were lots of ['\r\n ']. I began playing around with Regular Expressions. I wanted to preserve the paragraphs and headers while removing null characters and stray HTML formatting, lest my collection of articles turn out to need more work just to make them legible
While I initially worked with re.sub() to parse out any garbage HTML formatting, I found ultimately calling the .xpath's text_content() method combined with a quick clean-up with str(article).strip() did a much better job than trying to manually whack-a-mole every iteration of garbage characters.
Whack-a-mole seems fun at first, but is not necessary when you have beautiful methods to whack 'em for you!
I didn't regret my attempt at using regular expressions, in fact it worked fine and I would go on to use it in a different project example where it made more sense.
However in this situation, .strip() gave me exactly what I needed - removal of all garbage before and after my body of text and converting \n to standard line breaks instead of staying as characters.
So now I had 70+ articles I could read lest the website take them down in the future. A bit of formatting exploration gave me the output I wanted, despite having the data on hand quickly in a garbled format. I would go on to flex my scraping muscles both in Russian and nonRussian characters.