7.11.2004

mp3 blog bash snarf script

I was recently pointed to this blog post in which Jeffrey Veen gives advice on using wget to automatically download songs from your favorite mp3 blogs.

The downside is that once you download the songs you have to keep them because wget will download them again if they are deleted. I tried it but I quickly racked up many megabytes of songs. I listened to them and decided to keep about 5 for further listening, but if the rest are deleted, wget will just download them again as long as they appear on the mp3 blogs.

Enter this bash script I jotted down which doesn't require keeping the files. It keeps a filelist of what you've downloaded and it cruises the blogs to get a list of what's available for download then compares the two and downloads only the new links.

I suggest making an mp3 directory and putting this in there and running it from there with a "./". The only modification you need to make is filling out the bloglist with your favorite mp3 blogs.

It's not foolproof. If the blog is up but the link to an mp3 is dead when the script visits the blog, the mp3 will still be put on the list as "downloaded". Hey, you get what you pay for.

(edited for cosmetic script changes)



#! /bin/sh

# This shell script is designed to automatically
# download songs from your favorite mp3 blogs
# It ain't pretty but it works

echo Initializing wage\'s mp3 blog snarfer - http://villa-straylight.blogspot.com/
echo 'Visiting blogs.....'
# It takes zip files because some blogs zip their mp3s,
# If no sites you visit use zips for mp3s but have zips for something else
# and it's screwing you up, replace the "\.mp3\|\.zip" below with "\.mp3"

function fetch {
lynx -dump $1 | grep "\.mp3\|\.zip" | awk '{$1=""; print }' | sed 's/^[ ^t]*//;s/[ ^]*$//' >> .tmp
}

# ^^ start the function to visit the blogs and put
# a list of all the mp3 and .zip links in a hidden file named '.tmp'

# list your favorite sites below here, make sure to
# add "fetch" before their URL. i.e. 'fetch http://my.favorite.blog.com/'

##### ---------------bloglist--------------- #####
fetch http://newflux.blogspot.com/
fetch http://tofuhut.blogspot.com/
fetch http://www.mysticalbeast.blogspot.com/
fetch http://www.tangmonkey.com/blogs/music/
fetch http://thenewpink.net/womenfolk/
fetch http://amillionlovesongs.blogspot.com/
fetch http://www.londonlee.com/blog.html/
##### ---------------bloglist--------------- #####

echo 'Checking for new music.....'

cat ./.tmp | sed 's/[ ^t]*$//' | grep -v -f ./.downloaded > .tmp2;
# Filters '.tmp' through list of already
# downloaded songs and outputs it to '.tmp2'

rm ./.tmp;
# Removes the .tmp file

wget -N -i ./.tmp2;
# downloads all songs in 'tmp2'

cat ./.tmp2 >> ./.downloaded;
# adds these files to the hidden 'downloaded' file

rm ./.tmp2
# remove '.tmp2'

# Note, the list of downloaded songs will get larger over time
# though probably not enough to really notice.
# Since we only get songs off the index page of the blogs,
# it should be safe to delete older entries that no longer appear on the blogs
# the older entries will be at the top of the file

#
# TODO maybe:
# make self contained by appending downloaded list to this script
# add proxy option for wget and lynx
# add dialogue for adding new blogs?
# add scheme for remembering which blog the mp3 came from
# hassle bloggers to put the band name in the filenames, dammit


-------------------------------


(short version)

#! /bin/sh
echo Initializing wage\'s mp3 blog snarfer - http://villa-straylight.blogspot.com/
echo 'Visiting blogs.....'
function fetch {
lynx -dump $1 | grep "\.mp3\|\.zip" | awk '{$1=""; print }' | sed 's/^[ ^t]*//;s/[ ^]*$//' >> .tmp
}##### ---------------bloglist--------------- #####
# add blogs one per line with "fetch" before them like this:
fetch http://some-blog.com
##### ---------------bloglist--------------- #####
echo 'Checking for new music.....'
cat ./.tmp | sed 's/[ ^t]*$//' | grep -v -f ./.downloaded > .tmp2; rm ./.tmp; wget -N -i ./.tmp2; cat ./.tmp2 >> ./.downloaded; rm ./.tmp2



<< Home

This page is powered by Blogger. Isn't yours?