If Songs Could Make People Code Things
June 26, 2020, 11:14 p.m.
My first Django course featured a word counter app. I took an idea I had to aggregate all of Frank Sinatra's songs and count how many times the word "love" was said. I decided to make it more general and find out the top words of any artist's songs based on a lyric site.
By this point I had become comfortable with my scraping routine of loading a list from an outside source into an array for which to loop through and scrape what was needed.
Step one was to to gather all song links for an artist. I had the Python console ask for the name of the artist, where any spaces and instances of the word "the" would be replaced by blanks. I double checked the site and any "The Band" type names were simplified to "band" for their artist URLs, which was great for my purposes.
By using URLopen on the artist's static main page (I was lucky for this site), I was able to gather all song links at once. With print statements I made sure the song title and lyrics were printing properly to console first.
While I was getting the data I needed, I found the lyrics had a lot of formatting issues due to the HTML being captured as text. The issue was mostly with stray " \ \n \r breaks. I tried playing with the xpath and some string parsing methods but I still wasn't getting the formatting I wanted (lyrics only with single line breaks after each line and double for each stanza.)
Whatever artist name entered would generate the URL (which has a standard format so simply involves substituting the slug) and then ping that page. A text file would then generate called "artist_song_list.txt" which would save all anchor tagged links from the artist's page. At this point if the artist typed in isn't found, the program will gracefully exit.
If the text file successfully generates, Python will clean up the text file to pattern match only lyrics URLs
# Save only valid song title linksThe next step involved reading this cleaned up text file, loading it into the Py array, then looping through each link. Once the lyrics are collected, they would append to a "artist_lyrics.txt" file while the link appended to a "done" file. I implemented it this way because depending on how many songs an
with open(linkSave, 'w') as fileSave:
for line in filedata:
if "https://www.website.com/lyrics/" in line:
fileSave.write(line + "\n")
artist has, the site may temp ban you and your program will be stuck with an incomplete list. Having a record of what's still left makes it easy to pick up where the program left off. All this is despite enabling a time.sleep function to not hit the database hard.
Each song would have its title and lyrics identified by xpath and then cleaned up of all extraneous HTML characters.
song_title=str(song_title).strip("(\[\'\"|\"\']|\\)")The lyrics were a bit trickier as they had multiple lines. After a few trial and error runs, I found that it was best to clean them up using .strip() and re.sub to fix up the text.
# Clean LyricsFirst I wanted the obvious garbage leftovers from the xpath/HTML stripping such as brackets, carriage returns removed. Once that was done, certain patterns were consistently leftover that would help me identify where the line breaks were removed and needed to be re-integrated.
song_lyrics=str(song_lyrics)
lyricsReplace = re.sub(r"(\['|\']|\"]|\\n|\\t|\\x..|\\r|'\r\n)", "", song_lyrics)
lyricsReplace2 = re.sub(r"(\", '|\", \"|\', \"|\', ')", "\n", lyricsReplace)
song_lyrics = str(lyricsReplace2).strip()
Any combination of \", \' or a " '" (space apostrophe) that was replaced by a line break \n gave me the exact formatting I was looking for.
Once the lyrics format was confirmed, the lyrics would append to the lyrics file, the current link saved to a done file and removed from the initial songList file.
As a temporary fix for the captcha issue, I made a second .py file which asks you to input the artist's name you were working on. It will then continue from where you left off in the original file without having to get all the links from scratch.
From my last testing, the Django Word Counter is limited to 4094 characters, which makes getting an accurate word count on the entire lyrical sheet difficult. In the future I may implement a text upload of the entire lyrical sheet and find a method that allows for a much higher character limit.
While I didn't 100% achieve the goal I wanted, it was nice to be able to build yet another scraping tool, this time for an artist's lyrics with an interactive console and text file interplay.