Boomerang: Accounting => Programming <= Marketing

July 19, 2019, 1:49 p.m.



I assisted my colleague at work with a web scraping project he wanted to demo for the marketing department. They were manually checking a website's public pricing several times a week for market research but were tired of doing everything manually. I asked him what he wanted exactly. So we set up a meeting and he demonstrated how he accesses the web site and gets the info he needs. He them showed me a sketch of how he wanted everything presented.

I thought this would be a perfect candidate for some web scraping automation. I initially wanted to make the project in Python but one of the front-end developers ended up creating the skeleton in JavaScript. I'm not opposed to JavaScript but the number of "gotchas" made me hesitant.

I worked with the front-ender and explained to him what marketing wanted. We analyzed the web site's elements in the browser, and found AJAX calls outputting to JSON data. Once we isolated the parameters marketing needed, he knocked out some JavaScript and jQuery functions in a few hours. He handed the script off to me to tinker with. The data we wanted was coming in, but there were still some formatting issues. Since I had been automating some accounting reports and procedures, I thought it would be pretty easy to work with.

I analyzed his 7 functions and felt overwhelmed at first. Here I am, in the accounting department, working with a front-end developer on a JavaScript project for marketing. It seemed like so much code to scrape a site that gives you what you want in just a few clicks. Then I realized, we weren't scraping a simple website. The AJAX calls and dynamic data gathering meant it wasn't as simple as finding an xpath and looping through some pages.

I decided to print out the script and go home and try and re-write something in Python. I had several options to work with. Scrapy, which seemed very good at easily accessible data elements, and Selenium which allured me with its ability to be a "virtual browser". I got caught up in the deprecated PhantomJS library and Selenium clicking around. While I was close to getting the desired output, Selenium had its own set of bugs that I had to work out. Not to mention that it was slow and memory consuming, while the JavaScript version was relatively fast.

I spent a few more days tinkering around in Python, happy with the testing and discoveries I had made despite not getting the desired solution. I went back to my front-ender colleague and asked him a few questions about his code. I seemed to be able to follow it but had issues identifying at which line the data actually gets saved. We were taking the JSON data, cleaning it up a bit and then dumping it into a CSV file to be saved. This required a JavaScript library known as PapaParse which I wasn't familiar with.

Once he showed me the little details that eluded me, I know I could focus my energy on JavaScript matrix manipulations. Luckily the heavy lifting with AJAX was done and spreadsheet manipulation became second nature after so long in accounting.

Transposing in Excel is as easy as right clicking and tapping 'T', but JavaScript makes you work for it. I thought about the same issue with Selenium automating everything for you while urllib or scrapy make you do some extra work as well.

I thought about writing a VBA macro that would automatically transpose the matrix once saved but felt it was more of a "hack" than a true understanding, even though it would give me the same results. The deadline wasn't imminent so I didn't need to produce something fast and dirty. I could afford to put some time perfecting it.

After realizing my front-ender colleague checked stack overflow for some of his JavaScript scraping code, I figured I didn't have to re-invent the wheel as much as edit it to my needs. Arrays are usually the first data structures taught in programming courses as they're the building blocks of the rest. I was dealing with a multi-dimensional array, a little more complex but fundamentally still the same. The nature of the JSON data with our custom headers meant I'd be transposing a 4x14 array, with each row except the first containing parameters. The first had the headers we needed. For my colleague's presentation purposes, the headers needed to be column 1 with each cluster of parameters acting as its own column.

After breaking down my front-ender colleague's 7 functions and becoming familiar with what every part did at every part of the scraping process, I was ready to insert my final touches for transposing this matrix.

I found a compact JavaScript function that would help me transpose using keys and maps:

function transpose(a) {
return Object.keys(a[0]).map(function(c) {
return a.map(function(r) { return r[c]; });
});
}
At first I thought it would be easy to transpose an array of arrays. But upon closer inspection of the browser console, I realized what I had was actually an array of objects. The map function would only work on arrays, not objects. Thus my isolated testing of this function worked fine when the input was:

[[1,2,3],
[4,5,6],
[7,8,9]]
But

[{1,2,3},
{4,5,6},
{7,8,9}]
would not work. Thus JavaScript unshifting was necessary. I had to loop through the array of objects and bring every element into a new variable: an array of arrays.

Some well placed console.log statements helped me pinpoint exactly where the transposition should happen (the 7 functions were structured all over the place and didn't allow for standard top-down scanning). What made it all the more tricky was the use of PapaParse.js, which is what handled the conversion of JSON to CSV. I had to time the transposition to occur after the JSON was generated but before it hit PapaParse. This was done by creating a new copy of the list and then transposing it, thus allowing Papa to unparse it and save it as a CSV file. We tested between 1 and 7 different parameter combinations and everything seemed to be going smoothly.

It felt good to be able to learn the hard way and implement a matrix transposition of an array of objects without resorting to a VBA macro quickfix. Marketing was pleased, my colleague got his foot through the door to his next big project, and my front-ender colleague moved on to the start-up world.

*** Update as of January 2020 ***

While this code worked fine throughout the year, the web site has changed dramatically, rendering all that code useless. As our front-ender left, I ended up implementing a new tool from scratch.