Scraping artists bios off of Wikipedia
June 18th, 2008 by
David Craddock
I’ve been hacking away at BrightonSound.com and I’ve been looking for a way of automatically sourcing biographical information from artists, so that visitors are presented with more information on the event.
The Songbird media player plugin ‘mashTape’ draws upon a number of web services to grab artist bio, event listings, youtube vidoes and flickr pictures of the currently playing artist. I was reading through the mashTape code, and then found this posting by its developer, which helpfully provided the exact method I needed.
I then hacked up two versions of the code, a PHP version using simpleXML:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | <?php function grabwiki($band){ $band = urlencode($band); $yahoourl = "http://api.search.yahoo.com/WebSearchService/V1/webSearch?". "appi d=YahooDemo&query=%22$band%22%20music&site=wikipedia.org"; $x = file_get_contents($yahoourl); $s = new SimpleXMLElement($x); $ar = split('/',$s->Result->Url); if($ar[2] == 'en.wikipedia.org'){ $wikikey = $ar[4]; // more than likely to be the wikipedia page }else{ return ""; // nothing on wikipediea } $url = "http://dbpedia.org/data/$wikikey"; $x = file_get_contents($url); $s = new SimpleXMLElement($x); $b = $s->xpath("//p:abstract[@xml:lang='en']"); return $b[0]; } ?> |
and a pythonic version using the amara XML library (has to be installed seperately):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 | import amara import urllib2 from urllib import urlencode def getwikikey(band): url = "http://api.search.yahoo.com/WebSearchService/V1/webSearch?appid=YahooDemo&query=%22"+band+"%22&site=wikipedia.org"; print url c=urllib2.urlopen(url) f=c.read() doc = amara.parse(f) url = str(doc.ResultSet.Result[0].Url) return url.split('/')[4] def uurlencode(text): """single URL-encode a given 'text'. Do not return the 'variablename=' portion.""" blah = urlencode({'u':text}) blah = blah[2:] return blah def getwikibio(key): url = "http://dbpedia.org/data/"+str(key); print url try: c=urllib2.urlopen(url) f=c.read() except Exception, e: return '' doc = amara.parse(f) b = doc.xml_xpath("//p:abstract[@xml:lang='en']") try: r = str(b[0]) except Exception, e: return '' return r def scrapewiki(band): try: key = getwikikey(uurlencode(band)) except Exception, e: return '' return getwikibio(key) #unit test #print scrapewiki('guns n bombs') #print scrapewiki('diana ross') |
There we go, artist bio scraping from wikipedia.
Posted in Uncategorized |
August 15th, 2008 at 2:36 pm
I’m looking to implement the php code with a form element but im having trouble going about this.
I was trying to work with the following form element because it originated from a script using the lyricwiki api.
<form action=”" method=”get”>
Artist:
<input type=”text” name=”artist” value=”" id=”artist” />
Songname:
<input type=”text” name=”song” value=”" id=”song”/>
If you could assist me I would greatly appreciate it.
Thanks,
James
August 15th, 2008 at 3:28 pm
I think the yahoo api reference is outdated.
August 16th, 2008 at 12:31 pm
I got the form element to work, but I can not find whats wrong with this line of code. $b = $s->xpath(”//p:abstract[@xml:lang='en']“);
It constantly returns me an error.