For the last few years I have listened to the podcasts from SXSW while travelling to work or more recently while at gym.
Each year the number of podcasts has increased to the point where this year I am now too lazy to click through to every page and DownThemAll!. Besides it’s more fun to write a bash script to do the heavy lifting.
You will need …
- cURL
- grep
- uniq
- wget
- bash version 4 or above. If you don’t have bash 4, there are slightly different instructions for you.
Except for cURL, and maybe wget, everything else should come standard with your Linux distro. To check which version of bash you’re running enter the following at the command line:
echo $BASH_VERSION
Consult your distro’s documentation on how to install anything you might be missing and continue to the next section.
The script
Open your favorite text editor and copy and save the following into a new file (I called mine getpodcasts.sh):
\#!/bin/bash
for i in {0..13..1}; do
curl -s http://2009.sxsw.com/interactive/news/videos_and_podcasts/more?page=$i | grep -o 'http.*\.mp3' | uniq | wget -i -w 2 -
done
Let’s go through the script line by line:
#!/bin/bash
The first line tells bash what interperter to use for this script. You could replace /bin/bash with /usr/bin/python or /usr/bin/php if you had written the script in Python or PHP.
for i in {0..13..1}; do
This starts the bash for (start..end..increment) loop. This type of for loop will only work in bash v4 and above. If you are using an older version of bash, find out more about other for loops. I got the number 13 from the SXSW site. There are 13 pages.
curl -s http://2009.sxsw.com/interactive/news/videos_and_podcasts/more?page=$i
The next line will be broken up into its’ components. You can try each component from the command line to watch how the text output gets transformed as we add components to the command. First is the cURL instruction, here we’re using it to download and spit out SXSW HTML pages. cURL normally prints out its’ progress, but the -s tells cURL to be silent. The URL ends with an $i which is incremented with each pass of the loop. Essentially the for loop and cURL are downloading podcast pages one page at a time.
grep -o ‘http.*\.mp3’
Grep searches each line of the downloaded pages for anything that looks like an MP3 URL, -o tells it to print out only the matching part of each line.
uniq
Removes duplicates. The grep matches at least two of the same podcast on each page, and uniq condenses this down to one only. Ordinarily I would use sort -u
to first sort and then remove duplication, but since the returned data is already sorted we can just use a uniq.
wget -w 2 -i –
Finally we start downloading the podcasts with wget. The -w 2 tells wget to wait 2 seconds in between each download. This is just to be friendly to the servers. The -i – tells wget to use stdin (the text piped through from uniq) as input file for download links.
Finally
Save your script and chmod it so that it can be executed:
chmod u+x getpodcasts.sh
Run it:
./getpodcasts.sh
It may take a moment while cURL downloads the initial web page (and then every 4 or so podcasts it will pause again), but soon you should be up to your ears in podcasts.
P.S.
If all you want is a list of podcast URLs in a file, there are 2 things you need to do. Firstly change the 3rd line to:
echo `curl -s http://2009.sxsw.com/interactive/news/videos_and_podcasts/more?page=$i | grep -o 'http.*\.mp3' | uniq`
ie. Drop the wget, add an echo in the front and wrap the whole line in ‘`’ (not sure what they’re called).
Secondly run the script like so to save the data in a file called podcasts.txt:
./getpodcasts.sh > podcasts.txt