Scraping artists bios off of Wikipedia

June 18th, 2008 by David Craddock

I’ve been hacking away at BrightonSound.com and I’ve been looking for a way of automatically sourcing biographical information from artists, so that visitors are presented with more information on the event.

The Songbird media player plugin ‘mashTape’ draws upon a number of web services to grab artist bio, event listings, youtube vidoes and flickr pictures of the currently playing artist. I was reading through the mashTape code, and then found this posting by its developer, which helpfully provided the exact method I needed.

I then hacked up two versions of the code, a PHP version using simpleXML:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<?php
function grabwiki($band){
    $band = urlencode($band);
    $yahoourl = "http://api.search.yahoo.com/WebSearchService/V1/webSearch?".
"appi   d=YahooDemo&query=%22$band%22%20music&site=wikipedia.org";
    $x = file_get_contents($yahoourl);
    $s = new SimpleXMLElement($x);
    $ar = split('/',$s->Result->Url);
    if($ar[2] == 'en.wikipedia.org'){
      $wikikey = $ar[4]; // more than likely to be the wikipedia page
    }else{
      return ""; // nothing on wikipediea
    }
    $url = "http://dbpedia.org/data/$wikikey";
    $x = file_get_contents($url);
    $s = new SimpleXMLElement($x);
    $b = $s->xpath("//p:abstract[@xml:lang='en']");
     return $b[0];
 }
?>

and a pythonic version using the amara XML library (has to be installed seperately):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import amara
import urllib2
from urllib import urlencode
 
def getwikikey(band):
  url = "http://api.search.yahoo.com/WebSearchService/V1/webSearch?appid=YahooDemo&query=%22"+band+"%22&site=wikipedia.org";
  print url
  c=urllib2.urlopen(url)
  f=c.read()
  doc = amara.parse(f)
  url = str(doc.ResultSet.Result[0].Url)
  return url.split('/')[4]
 
def uurlencode(text):
   """single URL-encode a given 'text'.  Do not return the 'variablename=' portion."""
   blah = urlencode({'u':text})
   blah = blah[2:]
   return blah
 
def getwikibio(key):
  url = "http://dbpedia.org/data/"+str(key);
  print url
  try:
    c=urllib2.urlopen(url)
    f=c.read()
  except Exception, e:
    return ''
  doc = amara.parse(f)
  b = doc.xml_xpath("//p:abstract[@xml:lang='en']")
  try:
    r = str(b[0])
  except Exception, e:
    return ''
  return r
 
def scrapewiki(band):
  try:
    key = getwikikey(uurlencode(band))
  except Exception, e:
    return ''
  return getwikibio(key)
 
  #unit test
  #print scrapewiki('guns n bombs')
  #print scrapewiki('diana ross')

There we go, artist bio scraping from wikipedia.

Posted in Uncategorized |

3 Responses

  1. James Says:

    I’m looking to implement the php code with a form element but im having trouble going about this.
    I was trying to work with the following form element because it originated from a script using the lyricwiki api.

    <form action=”" method=”get”>
    Artist:
    <input type=”text” name=”artist” value=”" id=”artist” />
    Songname:
    <input type=”text” name=”song” value=”" id=”song”/>

    If you could assist me I would greatly appreciate it.
    Thanks,
    James

  2. James Says:

    I think the yahoo api reference is outdated.

  3. James Says:

    I got the form element to work, but I can not find whats wrong with this line of code. $b = $s->xpath(”//p:abstract[@xml:lang='en']“);
    It constantly returns me an error.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.