<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DavidCraddock.net &#187; python</title>
	<atom:link href="http://www.davidcraddock.net/tag/python/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.davidcraddock.net</link>
	<description>My Technology Site</description>
	<lastBuildDate>Tue, 22 Nov 2011 13:45:33 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Scraping Gumtree Property Adverts with Python and BeautifulSoup</title>
		<link>http://www.davidcraddock.net/2011/05/01/scraping-gumtree-property-adverts-with-python-and-beautifulsoup/</link>
		<comments>http://www.davidcraddock.net/2011/05/01/scraping-gumtree-property-adverts-with-python-and-beautifulsoup/#comments</comments>
		<pubDate>Sun, 01 May 2011 14:07:02 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[property adverts]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[scraping Gumtree]]></category>
		<category><![CDATA[web scraping]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=886</guid>
		<description><![CDATA[I am moving to Manchester soon, and so I thought I&#8217;d get an idea of the housing market there by scraping all the Manchester Gumtree property adverts into a MySQL database. Once in the database, I could do things like find the average monthly price for a 2 bedroom flat in an area, and spot [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/05/soup.jpg"><img style="border: none" src="http://www.davidcraddock.net/wp-content/uploads/2011/05/soup-300x199.jpg" alt="" title="soup" width="300" height="199" class="aligncenter size-medium wp-image-897" /></a></p>
<p>I am moving to Manchester soon, and so I thought I&#8217;d get an idea of the housing market there by scraping all the Manchester Gumtree property adverts into a MySQL database. Once in the database, I could do things like find the average monthly price for a 2 bedroom flat in an area, and spot bargains through using standard deviation from the mean on the price through using simple SQL queries via <a href="http://www.phpmyadmin.net/home_page/index.php">phpMyAdmin</a>.</p>
<p>I really like the Python library <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> for writing scrapers, there is also a Java version called <a href="http://jsoup.org/">JSoup</a>. BeautifulSoup does a really good job of tolerating markup mistakes in the input data, and transforms a page into a tree structure that is easy to work with.</p>
<p>I chose the following layout for the program:</p>
<p><strong>advert.py</strong> &#8211; Stores all information about each property advert, with a &#8216;save&#8217; method that inserts the data into the mysql database<br />
<strong>listing.py</strong> &#8211; Stores all the information on each listing page, which is broken down into links for specific adverts, and also the link to the next listing page in the sequence (ie: the &#8216;next page&#8217; link)<br />
<strong>scrapeAdvert.py</strong> &#8211; When given an advert URL, this creates and populates an advert object<br />
<strong>scrapeListing.py</strong> &#8211; When given a listing URL, this creates and populates a listing object<br />
<strong>scrapeSequence.py</strong> &#8211; This walks through a series of listings, calling scrapeListing and scrapeAdvert for all of them, and finishes when there are no more listings in the sequence to scrape</p>
<p>Here is the MySQL table I created for this project (which you will have to setup if you want to run the scraper):</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #808080; font-style: italic;">--</span>
<span style="color: #808080; font-style: italic;">-- Database: `manchester`</span>
<span style="color: #808080; font-style: italic;">--</span>
&nbsp;
<span style="color: #808080; font-style: italic;">-- --------------------------------------------------------</span>
&nbsp;
<span style="color: #808080; font-style: italic;">--</span>
<span style="color: #808080; font-style: italic;">-- Table structure for table `adverts`</span>
<span style="color: #808080; font-style: italic;">--</span>
&nbsp;
<span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> <span style="color: #993333; font-weight: bold;">IF</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">EXISTS</span> <span style="color: #ff0000;">`adverts`</span> <span style="color: #66cc66;">&#40;</span>
  <span style="color: #ff0000;">`url`</span> <span style="color: #993333; font-weight: bold;">VARCHAR</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">255</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`title`</span> text <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`pricePW`</span> <span style="color: #993333; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">10</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">UNSIGNED</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`pricePCM`</span> <span style="color: #993333; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">11</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`location`</span> text <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`dateAvailable`</span> <span style="color: #993333; font-weight: bold;">DATE</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`propertyType`</span> text <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`bedroomNumber`</span> <span style="color: #993333; font-weight: bold;">INT</span><span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">11</span><span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #ff0000;">`description`</span> text <span style="color: #993333; font-weight: bold;">NOT</span> <span style="color: #993333; font-weight: bold;">NULL</span><span style="color: #66cc66;">,</span>
  <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span> <span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">`url`</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>MyISAM <span style="color: #993333; font-weight: bold;">DEFAULT</span> CHARSET<span style="color: #66cc66;">=</span>latin1;</pre></div></div>

<p>PricePCM is price per calendar month, PricePW is price per week. Usually each advert with have one or the other specified.</p>
<p><b>advert.py:</b></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">import</span> MySQLdb
<span style="color: #ff7700;font-weight:bold;">import</span> chardet
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">sys</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> advert:
&nbsp;
        url = <span style="color: #483d8b;">&quot;&quot;</span>
        title = <span style="color: #483d8b;">&quot;&quot;</span>
        pricePW = <span style="color: #ff4500;">0</span>
        pricePCM = <span style="color: #ff4500;">0</span>
        location = <span style="color: #483d8b;">&quot;&quot;</span>
        dateAvailable = <span style="color: #483d8b;">&quot;&quot;</span>
        propertyType = <span style="color: #483d8b;">&quot;&quot;</span>
        bedroomNumber = <span style="color: #ff4500;">0</span>
        description = <span style="color: #483d8b;">&quot;&quot;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> save<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
                <span style="color: #808080; font-style: italic;"># you will need to change the following to match your mysql credentials:</span>
                db=MySQLdb.<span style="color: black;">connect</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;localhost&quot;</span>,<span style="color: #483d8b;">&quot;root&quot;</span>,<span style="color: #483d8b;">&quot;secret&quot;</span>,<span style="color: #483d8b;">&quot;manchester&quot;</span><span style="color: black;">&#41;</span>
                c=db.<span style="color: black;">cursor</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #008000;">self</span>.<span style="color: black;">description</span> = <span style="color: #008000;">unicode</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">description</span>, errors=<span style="color: #483d8b;">'replace'</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">description</span> = <span style="color: #008000;">self</span>.<span style="color: black;">description</span>.<span style="color: black;">encode</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'ascii'</span>,<span style="color: #483d8b;">'ignore'</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># TODO: might need to convert the other strings in the advert if there are any unicode conversetion errors</span>
&nbsp;
                sql = <span style="color: #483d8b;">&quot;INSERT INTO adverts (url,title,pricePCM,pricePW,location,dateAvailable,propertyType,bedroomNumber,description) VALUES('&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">url</span>+<span style="color: #483d8b;">&quot;','&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">title</span>+<span style="color: #483d8b;">&quot;',&quot;</span>+<span style="color: #008000;">str</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">pricePCM</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot;,&quot;</span>+<span style="color: #008000;">str</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">pricePW</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot;,'&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">location</span>+<span style="color: #483d8b;">&quot;','&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">dateAvailable</span>+<span style="color: #483d8b;">&quot;','&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">propertyType</span>+<span style="color: #483d8b;">&quot;',&quot;</span>+<span style="color: #008000;">str</span><span style="color: black;">&#40;</span><span style="color: #008000;">self</span>.<span style="color: black;">bedroomNumber</span><span style="color: black;">&#41;</span>+<span style="color: #483d8b;">&quot;,'&quot;</span>+<span style="color: #008000;">self</span>.<span style="color: black;">description</span>+<span style="color: #483d8b;">&quot;' )&quot;</span>
&nbsp;
                c.<span style="color: black;">execute</span><span style="color: black;">&#40;</span>sql<span style="color: black;">&#41;</span></pre></div></div>

<p>In advert.py we convert the unicode output that BeautifulSoup gives us into plain ASCII so that we can put it in the MySQL database without any problems. I could have used Unicode in the database as well, but the chances of really needing Unicode for representing Gumtree ads is quite slim. If you intend to use this code then you will also want to enter the MySQL credentials for your database.</p>
<p><b>listing.py:</b></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">class</span> listing:
&nbsp;
        url=<span style="color: #483d8b;">&quot;&quot;</span>
        adverturls=<span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
        nextLink=<span style="color: #483d8b;">&quot;&quot;</span>
&nbsp;
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> addAdvertURL<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>,url<span style="color: black;">&#41;</span>:
&nbsp;
                <span style="color: #008000;">self</span>.<span style="color: black;">adverturls</span>.<span style="color: black;">append</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span></pre></div></div>

<p><b>scrapeAdvert.py:</b></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> BeautifulSoup <span style="color: #ff7700;font-weight:bold;">import</span> BeautifulSoup          <span style="color: #808080; font-style: italic;"># For processing HTML</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">urllib2</span>
<span style="color: #ff7700;font-weight:bold;">from</span> advert <span style="color: #ff7700;font-weight:bold;">import</span> advert
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">time</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> scrapeAdvert:
&nbsp;
        page = <span style="color: #483d8b;">&quot;&quot;</span>
        soup = <span style="color: #483d8b;">&quot;&quot;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> scrape<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>,advertURL<span style="color: black;">&#41;</span>:
&nbsp;
                <span style="color: #808080; font-style: italic;"># give it a bit of time so gumtree doesn't</span>
                <span style="color: #808080; font-style: italic;"># ban us</span>
                <span style="color: #dc143c;">time</span>.<span style="color: black;">sleep</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#41;</span>
&nbsp;
                url = advertURL
                <span style="color: #808080; font-style: italic;"># print &quot;-- scraping &quot;+url+&quot; --&quot;</span>
                page = <span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">soup</span> = BeautifulSoup<span style="color: black;">&#40;</span>page<span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span> = advert<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">url</span> = url
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">title</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractTitle</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">pricePW</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractPricePW</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">pricePCM</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractPricePCM</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">location</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractLocation</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">dateAvailable</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractDateAvailable</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">propertyType</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractPropertyType</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">bedroomNumber</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractBedroomNumber</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">anAd</span>.<span style="color: black;">description</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractDescription</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractTitle<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                location = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'h1'</span><span style="color: black;">&#41;</span>
                <span style="color: #dc143c;">string</span> = location.<span style="color: black;">contents</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                stripped = <span style="color: #483d8b;">' '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;'&quot;</span>,<span style="color: #483d8b;">'&amp;quot;'</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + stripped + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> stripped
&nbsp;
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractPricePCM<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                location = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'span'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;class&quot;</span> : <span style="color: #483d8b;">&quot;price&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                        <span style="color: #dc143c;">string</span> = location.<span style="color: black;">contents</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                        <span style="color: #dc143c;">string</span>.<span style="color: black;">index</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'pcm'</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">AttributeError</span>: <span style="color: #808080; font-style: italic;"># for ads with no prices set</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">ValueError</span>: <span style="color: #808080; font-style: italic;"># for ads with pw specified</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
&nbsp;
                stripped = <span style="color: #dc143c;">string</span>.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'&amp;pound;'</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'pcm'</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">','</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;'&quot;</span>,<span style="color: #483d8b;">'&amp;quot;'</span><span style="color: black;">&#41;</span>
                stripped = <span style="color: #483d8b;">' '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>stripped.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + stripped + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">int</span><span style="color: black;">&#40;</span>stripped<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractPricePW<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                location = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'span'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;class&quot;</span> : <span style="color: #483d8b;">&quot;price&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                        <span style="color: #dc143c;">string</span> = location.<span style="color: black;">contents</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                        <span style="color: #dc143c;">string</span>.<span style="color: black;">index</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'pw'</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">AttributeError</span>: <span style="color: #808080; font-style: italic;"># for ads with no prices set</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">ValueError</span>: <span style="color: #808080; font-style: italic;"># for ads with pcm specified</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
                stripped = <span style="color: #dc143c;">string</span>.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'&amp;pound;'</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'pw'</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">','</span>,<span style="color: #483d8b;">''</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;'&quot;</span>,<span style="color: #483d8b;">'&amp;quot;'</span><span style="color: black;">&#41;</span>
                stripped = <span style="color: #483d8b;">' '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span>stripped.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + stripped + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #008000;">int</span><span style="color: black;">&#40;</span>stripped<span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractLocation<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                location = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'span'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;class&quot;</span> : <span style="color: #483d8b;">&quot;location&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                <span style="color: #dc143c;">string</span> = location.<span style="color: black;">contents</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                stripped = <span style="color: #483d8b;">' '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;'&quot;</span>,<span style="color: #483d8b;">'&amp;quot;'</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + stripped + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> stripped
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractDateAvailable<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                current_year = <span style="color: #483d8b;">'2011'</span>
&nbsp;
                ul = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'ul'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;id&quot;</span> : <span style="color: #483d8b;">&quot;ad-details&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                firstP = ul.<span style="color: black;">findAll</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'p'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                <span style="color: #dc143c;">string</span> = firstP.<span style="color: black;">contents</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                stripped = <span style="color: #483d8b;">' '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                date_to_convert = stripped + <span style="color: #483d8b;">'/'</span>+current_year
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                        date_object = <span style="color: #dc143c;">time</span>.<span style="color: black;">strptime</span><span style="color: black;">&#40;</span>date_to_convert, <span style="color: #483d8b;">&quot;%d/%m/%Y&quot;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">ValueError</span>: <span style="color: #808080; font-style: italic;"># for adverts with no date available</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">&quot;&quot;</span>
&nbsp;
                full_date = <span style="color: #dc143c;">time</span>.<span style="color: black;">strftime</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'%Y-%m-%d %H:%M:%S'</span>, date_object<span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + full_date + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> full_date
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractPropertyType<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                ul = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'ul'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;id&quot;</span> : <span style="color: #483d8b;">&quot;ad-details&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                        secondP = ul.<span style="color: black;">findAll</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'p'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">1</span><span style="color: black;">&#93;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">IndexError</span>: <span style="color: #808080; font-style: italic;"># for properties with no type</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">&quot;&quot;</span>
                <span style="color: #dc143c;">string</span> = secondP.<span style="color: black;">contents</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                stripped = <span style="color: #483d8b;">' '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;'&quot;</span>,<span style="color: #483d8b;">'&amp;quot;'</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + stripped + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> stripped
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractBedroomNumber<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                ul = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'ul'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;id&quot;</span> : <span style="color: #483d8b;">&quot;ad-details&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                        thirdP = ul.<span style="color: black;">findAll</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'p'</span><span style="color: black;">&#41;</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">2</span><span style="color: black;">&#93;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">IndexError</span>: <span style="color: #808080; font-style: italic;"># for properties with no bedroom number</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #ff4500;">0</span>
                <span style="color: #dc143c;">string</span> = thirdP.<span style="color: black;">contents</span><span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span>
                stripped = <span style="color: #483d8b;">' '</span>.<span style="color: black;">join</span><span style="color: black;">&#40;</span><span style="color: #dc143c;">string</span>.<span style="color: black;">split</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span><span style="color: black;">&#41;</span>
                stripped = stripped.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;'&quot;</span>,<span style="color: #483d8b;">'&amp;quot;'</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + stripped + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> stripped
&nbsp;
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractDescription<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                div = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'div'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;id&quot;</span> : <span style="color: #483d8b;">&quot;description&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                description = div.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'p'</span><span style="color: black;">&#41;</span>
                contents = description.<span style="color: black;">renderContents</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                contents = contents.<span style="color: black;">replace</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;'&quot;</span>,<span style="color: #483d8b;">'&amp;quot;'</span><span style="color: black;">&#41;</span>
                <span style="color: #808080; font-style: italic;"># print '|' + contents + '|'</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> contents</pre></div></div>

<p>In scrapeAdvert.py there are a lot of string manipulation statements to pull out any unwanted characters, such as the &#8216;pw&#8217; characters (short for per week) found in the price string, which we need to remove in order to store the property price per week as an integer.</p>
<p>Using BeautifulSoup to pull out elements is quite easy, for example:</p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;">ul = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">find</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'ul'</span>,attrs=<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;id&quot;</span> : <span style="color: #483d8b;">&quot;ad-details&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span></pre></div></div>

<p>That finds all the HTML elements under &lt;ul id=&#8221;ad-details&#8221;&gt;, so all the list elements in that list. More detail can be found in the <a href="http://www.crummy.com/software/BeautifulSoup/documentation.html">Beautiful Soup documentation</a> which is very good.</p>
<p><b>scrapeListing.py:</b></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> BeautifulSoup <span style="color: #ff7700;font-weight:bold;">import</span> BeautifulSoup          <span style="color: #808080; font-style: italic;"># For processing HTML</span>
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">urllib2</span>
<span style="color: #ff7700;font-weight:bold;">from</span> listing <span style="color: #ff7700;font-weight:bold;">import</span> listing
<span style="color: #ff7700;font-weight:bold;">import</span> <span style="color: #dc143c;">time</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">class</span> scrapeListing:
&nbsp;
        soup = <span style="color: #483d8b;">&quot;&quot;</span>
        url = <span style="color: #483d8b;">&quot;&quot;</span>
        aListing = <span style="color: #483d8b;">&quot;&quot;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> scrape<span style="color: black;">&#40;</span><span style="color: #008000;">self</span>,url<span style="color: black;">&#41;</span>:
                <span style="color: #808080; font-style: italic;"># give it a bit of time so gumtree doesn't</span>
                <span style="color: #808080; font-style: italic;"># ban us</span>
                <span style="color: #dc143c;">time</span>.<span style="color: black;">sleep</span><span style="color: black;">&#40;</span><span style="color: #ff4500;">3</span><span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;scraping url = &quot;</span>+<span style="color: #008000;">str</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>
&nbsp;
                page = <span style="color: #dc143c;">urllib2</span>.<span style="color: black;">urlopen</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">soup</span> = BeautifulSoup<span style="color: black;">&#40;</span>page<span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #008000;">self</span>.<span style="color: black;">aListing</span> = listing<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">aListing</span>.<span style="color: black;">url</span> = url
                <span style="color: #008000;">self</span>.<span style="color: black;">aListing</span>.<span style="color: black;">adverturls</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractAdvertURLs</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #008000;">self</span>.<span style="color: black;">aListing</span>.<span style="color: black;">nextLink</span> = <span style="color: #008000;">self</span>.<span style="color: black;">extractNextLink</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractAdvertURLs<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                toReturn = <span style="color: black;">&#91;</span><span style="color: black;">&#93;</span>
                h3s = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">findAll</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;h3&quot;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">for</span> h3 <span style="color: #ff7700;font-weight:bold;">in</span> h3s:
                        links = h3.<span style="color: black;">findAll</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">'a'</span>,<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;class&quot;</span>:<span style="color: #483d8b;">&quot;summary&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                        <span style="color: #ff7700;font-weight:bold;">for</span> link <span style="color: #ff7700;font-weight:bold;">in</span> links:
                                <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;|&quot;</span>+link<span style="color: black;">&#91;</span><span style="color: #483d8b;">'href'</span><span style="color: black;">&#93;</span>+<span style="color: #483d8b;">&quot;|&quot;</span>
                                toReturn.<span style="color: black;">append</span><span style="color: black;">&#40;</span>link<span style="color: black;">&#91;</span><span style="color: #483d8b;">'href'</span><span style="color: black;">&#93;</span><span style="color: black;">&#41;</span>
&nbsp;
                <span style="color: #ff7700;font-weight:bold;">return</span> toReturn
&nbsp;
        <span style="color: #ff7700;font-weight:bold;">def</span> extractNextLink<span style="color: black;">&#40;</span><span style="color: #008000;">self</span><span style="color: black;">&#41;</span>:
&nbsp;
                links = <span style="color: #008000;">self</span>.<span style="color: black;">soup</span>.<span style="color: black;">findAll</span><span style="color: black;">&#40;</span><span style="color: #483d8b;">&quot;a&quot;</span>,<span style="color: black;">&#123;</span><span style="color: #483d8b;">&quot;class&quot;</span>:<span style="color: #483d8b;">&quot;next&quot;</span><span style="color: black;">&#125;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;&gt;&quot;</span>+links<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'href'</span><span style="color: black;">&#93;</span>+<span style="color: #483d8b;">&quot;&gt;&quot;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> <span style="color: #008000;">IndexError</span>: <span style="color: #808080; font-style: italic;"># if there is no 'next' link found..</span>
                        <span style="color: #ff7700;font-weight:bold;">return</span> <span style="color: #483d8b;">&quot;&quot;</span>
                <span style="color: #ff7700;font-weight:bold;">return</span> links<span style="color: black;">&#91;</span><span style="color: #ff4500;">0</span><span style="color: black;">&#93;</span><span style="color: black;">&#91;</span><span style="color: #483d8b;">'href'</span><span style="color: black;">&#93;</span></pre></div></div>

<p>The extractNextLink method here extracts the pagination &#8216;next&#8217; link which will bring up the next listing page from the selection of listing pages to browse. We use it to step through the pagination &#8216;sequence&#8217; of resultant listing pages.</p>
<p><b>scrapeSequence.py:</b></p>

<div class="wp_syntax"><div class="code"><pre class="python" style="font-family:monospace;"><span style="color: #ff7700;font-weight:bold;">from</span> scrapeListing <span style="color: #ff7700;font-weight:bold;">import</span> scrapeListing
<span style="color: #ff7700;font-weight:bold;">from</span> scrapeAdvert <span style="color: #ff7700;font-weight:bold;">import</span> scrapeAdvert
<span style="color: #ff7700;font-weight:bold;">from</span> listing <span style="color: #ff7700;font-weight:bold;">import</span> listing
<span style="color: #ff7700;font-weight:bold;">from</span> advert <span style="color: #ff7700;font-weight:bold;">import</span> advert
<span style="color: #ff7700;font-weight:bold;">import</span> MySQLdb
<span style="color: #ff7700;font-weight:bold;">import</span> _mysql_exceptions
&nbsp;
<span style="color: #808080; font-style: italic;"># change this to the gumtree page you want to start scraping from</span>
url = <span style="color: #483d8b;">&quot;http://www.gumtree.com/flats-and-houses-for-rent/salford-quays&quot;</span>
&nbsp;
<span style="color: #ff7700;font-weight:bold;">while</span> url <span style="color: #66cc66;">!</span>= <span style="color: #008000;">None</span>:
        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;scraping URL = &quot;</span>+url
        sl = <span style="color: #483d8b;">&quot;&quot;</span>
        sl = scrapeListing<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
        sl.<span style="color: black;">scrape</span><span style="color: black;">&#40;</span>url<span style="color: black;">&#41;</span>
        <span style="color: #ff7700;font-weight:bold;">for</span> advertURL <span style="color: #ff7700;font-weight:bold;">in</span> sl.<span style="color: black;">aListing</span>.<span style="color: black;">adverturls</span>:
                sa = <span style="color: #483d8b;">&quot;&quot;</span>
                sa = scrapeAdvert<span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                sa.<span style="color: black;">scrape</span><span style="color: black;">&#40;</span>advertURL<span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">try</span>:
                        sa.<span style="color: black;">anAd</span>.<span style="color: black;">save</span><span style="color: black;">&#40;</span><span style="color: black;">&#41;</span>
                <span style="color: #ff7700;font-weight:bold;">except</span> _mysql_exceptions.<span style="color: black;">IntegrityError</span>:
                        <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;** Advert &quot;</span> + sa.<span style="color: black;">anAd</span>.<span style="color: black;">url</span> + <span style="color: #483d8b;">&quot; already saved **&quot;</span>
                sa.<span style="color: black;">onAd</span> = <span style="color: #483d8b;">&quot;&quot;</span>
&nbsp;
        url = <span style="color: #483d8b;">&quot;&quot;</span>
        <span style="color: #ff7700;font-weight:bold;">if</span> sl.<span style="color: black;">aListing</span>.<span style="color: black;">nextLink</span>:
                <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">&quot;nextLink = &quot;</span>+sl.<span style="color: black;">aListing</span>.<span style="color: black;">nextLink</span>
                url = sl.<span style="color: black;">aListing</span>.<span style="color: black;">nextLink</span>
        <span style="color: #ff7700;font-weight:bold;">else</span>:
                <span style="color: #ff7700;font-weight:bold;">print</span> <span style="color: #483d8b;">'all done.'</span>
                <span style="color: #ff7700;font-weight:bold;">break</span></pre></div></div>

<p>This is the file you run to kick off the scrape. It uses an MySQL IntegrityError  try/except block to pick out when an advert has already been entered into the database, this will throw an error because the URL of the advert is the primary key in the database. So no two records can have the same primary key.</p>
<p>The URL you provide it above gives you the starting page from which to scrape from.</p>
<p>The above code worked well for scraping several hundred Manchester Gumtree ads into a database, from which point I was able to use a combination of phpMyAdmin and OpenOffice Spreadsheet to analyse the data and find out useful statistics about the property market in said area.</p>
<p><center><a href="http://www.davidcraddock.net/uploads/gumtree-scraper.tgz">Download the scraper source code in a tar.gz archive</a></center></p>
<p>Note: Due to the nature of web scraping, if &#8211; or more accurately, when &#8211; Gumtree changes its user interface, the scraper I have written will need to be tweaked accordingly to find the right data. This is meant to be an informative tutorial, not a finished product.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/05/01/scraping-gumtree-property-adverts-with-python-and-beautifulsoup/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MicroKORG + Python = MIDI fun!</title>
		<link>http://www.davidcraddock.net/2009/03/30/microkorg-python-midi-fun/</link>
		<comments>http://www.davidcraddock.net/2009/03/30/microkorg-python-midi-fun/#comments</comments>
		<pubDate>Mon, 30 Mar 2009 00:14:10 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Tips]]></category>
		<category><![CDATA[microKorg]]></category>
		<category><![CDATA[music]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=81</guid>
		<description><![CDATA[So, about a month ago I got a second-hand microKORG from Ebay. Fiddling around with the preset patches, and creating new patches is great fun, even though I only know a few chords. Recently I plugged it in to my PC via my M-Audio Uno USB->MIDI interface, and soon was using Ableton Live to program [...]]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.davidcraddock.net/wp-content/uploads/2009/03/microkorg.jpg" alt="microKORG and cat" title="microKORG and cat" width="400" height="300" class="alignleft size-full wp-image-82" /></p>
<p>So, about a month ago I got a second-hand <a href="http://en.wikipedia.org/wiki/MicroKORG">microKORG</a> from Ebay. Fiddling around with the preset patches, and creating new patches is great fun, even though I only know a few chords. Recently I plugged it in to my PC via my <a href="http://www.dolphinmusic.co.uk/product/1773-m-audio-uno-usb.html">M-Audio Uno USB->MIDI interface</a>, and soon was using Ableton Live to program drums in time with the microKORG&#8217;s arp.</p>
<p>I thought I&#8217;d experiment the music libraries available in python, and see if I could send notes to the synth via MIDI. Turns out that the M-Audio Uno is supported under Ubuntu, all you have to do is install the <code>midisport-firmware</code> package.  With the help of <a href="http://trac2.assembla.com/pkaudio/wiki/pyrtmidi">pyrtmidi</a>, a set of python wrappers around the C++ audio library rtmidi I was able to recieve MIDI signals in realtime from the microKORG, and send them in realtime also. With the help of <a href="http://www.davidcraddock.net/images/midilib.py">this</a> old midi file reader/writer library that I found posted to a python mailing list, I&#8217;ve made some progress in writing a simple MIDI file player that sends notes to the &#8216;KORG.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2009/03/30/microkorg-python-midi-fun/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Eclipse 3.4.2 + Pydev + Eclim = win</title>
		<link>http://www.davidcraddock.net/2009/03/27/eclipse-342-pydev-eclim-win/</link>
		<comments>http://www.davidcraddock.net/2009/03/27/eclipse-342-pydev-eclim-win/#comments</comments>
		<pubDate>Fri, 27 Mar 2009 23:47:59 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Tips]]></category>
		<category><![CDATA[agile]]></category>
		<category><![CDATA[eclipse]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[vim]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=74</guid>
		<description><![CDATA[So, after saying all that stuff about how vimplugin and EasyEclipse was great, I actually started to use the setup heavily, and it started to annoy me. For one, EE is not a recent build of eclipse, nor does it come with a full set of recent plugins. This makes it annoyingly difficult to use [...]]]></description>
			<content:encoded><![CDATA[<p><img alt="" src="http://www.davidcraddock.net/images/computer-anger.jpg" title="Computer Rage" class="alignnone" width="350" height="262" /></p>
<p>So, after saying all that stuff about how vimplugin and EasyEclipse was great, I actually started to use the setup heavily, and it started to annoy me.</p>
<p>For one, EE is not a recent build of eclipse, nor does it come with a full set of recent plugins. This makes it annoyingly difficult to use when you want to use more than the set of plugins it packages for you. As far as vimplugin goes, it does not provide the vim integration I thought it might from embedded vim. Not really even close.</p>
<p>What I use now, after lots of trial and error, and at least 4 reinstalls of Eclipse, is a combination of Eclipse 3.4.2, <a href="http://eclim.sourceforge.net/">Eclim</a>, (which is the most mature of the free vi-binding plugins around, and actually includes an improved version of the vimplugin previously mentioned), and the latest pydev, Mylyn and Subeclipse.</p>
<p>I&#8217;m using it now to refactor a largeish python project, and I&#8217;m really appreciating the help it gives me. Definitely worth trying an Eclipse setup similar to this if you&#8217;re writing any python apps that are more than small-scale.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2009/03/27/eclipse-342-pydev-eclim-win/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

