Scraping artists bios off of Wikipedia

I’ve been hacking away at BrightonSound.com and I’ve been looking for a way of automatically sourcing biographical information from artists, so that visitors are presented with more information on the event.

The Songbird media player plugin ‘mashTape’ draws upon a number of web services to grab artist bio, event listings, youtube vidoes and flickr pictures of the currently playing artist. I was reading through the mashTape code, and then found this posting by its developer, which helpfully provided the exact method I needed.

I then hacked up two versions of the code, a PHP version using simpleXML:

<?php
function grabwiki($band){
    $band = urlencode($band);
    $yahoourl = "http://api.search.yahoo.com/WebSearchService/V1/webSearch?".
"appi   d=YahooDemo&query=%22$band%22%20music&site=wikipedia.org";
    $x = file_get_contents($yahoourl);
    $s = new SimpleXMLElement($x);
    $ar = split('/',$s->Result->Url);
    if($ar[2] == 'en.wikipedia.org'){
      $wikikey = $ar[4]; // more than likely to be the wikipedia page
    }else{
      return ""; // nothing on wikipediea
    }
    $url = "http://dbpedia.org/data/$wikikey";
    $x = file_get_contents($url);
    $s = new SimpleXMLElement($x);
    $b = $s->xpath("//p:abstract[@xml:lang='en']");
     return $b[0];
 }
?>

and a pythonic version using the amara XML library (has to be installed seperately):

import amara
import urllib2
from urllib import urlencode

def getwikikey(band):
  url = "http://api.search.yahoo.com/WebSearchService/V1/webSearch?appid=YahooDemo&query=%22"+band+"%22&site=wikipedia.org";
  print url
  c=urllib2.urlopen(url)
  f=c.read()
  doc = amara.parse(f)
  url = str(doc.ResultSet.Result[0].Url)
  return url.split('/')[4]

def uurlencode(text):
   """single URL-encode a given 'text'.  Do not return the 'variablename=' portion."""
   blah = urlencode({'u':text})
   blah = blah[2:]
   return blah

def getwikibio(key):
  url = "http://dbpedia.org/data/"+str(key);
  print url
  try:
    c=urllib2.urlopen(url)
    f=c.read()
  except Exception, e:
    return ''
  doc = amara.parse(f)
  b = doc.xml_xpath("//p:abstract[@xml:lang='en']")
  try:
    r = str(b[0])
  except Exception, e:
    return ''
  return r

def scrapewiki(band):
  try:
    key = getwikikey(uurlencode(band))
  except Exception, e:
    return ''
  return getwikibio(key)

  #unit test
  #print scrapewiki('guns n bombs')
  #print scrapewiki('diana ross')

There we go, artist bio scraping from wikipedia.

Leave a Reply