<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>DavidCraddock.net</title>
	<atom:link href="http://www.davidcraddock.net/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.davidcraddock.net</link>
	<description></description>
	<lastBuildDate>Thu, 12 Apr 2012 15:45:10 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>Converting a single M2V frame into JPEG under OSX</title>
		<link>http://www.davidcraddock.net/2012/04/12/converting-a-single-m2v-frame-into-jpeg/</link>
		<comments>http://www.davidcraddock.net/2012/04/12/converting-a-single-m2v-frame-into-jpeg/#comments</comments>
		<pubDate>Thu, 12 Apr 2012 15:20:23 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Solutions to a Specific Problem]]></category>
		<category><![CDATA[converting frame to jpeg]]></category>
		<category><![CDATA[FFmpeg]]></category>
		<category><![CDATA[M2V]]></category>
		<category><![CDATA[macports]]></category>
		<category><![CDATA[osx]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=976</guid>
		<description><![CDATA[I needed to view a single frame of a m2v file that had been encoded by our designers for playing out on TV. The file name was .mpg but in actuality it was a single .m2v frame renamed to be &#8230; <a href="http://www.davidcraddock.net/2012/04/12/converting-a-single-m2v-frame-into-jpeg/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2012/04/Stainless_Steel_Number_Plate_Frame_Square.jpg"><img src="http://www.davidcraddock.net/wp-content/uploads/2012/04/Stainless_Steel_Number_Plate_Frame_Square.jpg" alt="" title="Frame" width="400" height="295" class="aligncenter size-full wp-image-977" /></a></p>
<p>I needed to view a single frame of a m2v file that had been encoded by our designers for playing out on TV. The file name was .mpg but in actuality it was a single .m2v frame renamed to be a .mpg. Windows Media Player classic used to display the frame fine when I opened the file normally, under Windows XP. However now I have switched to a Mac, I have found that Quicktime and VLC refused to display the single frame. I couldn&#8217;t find a video player that would open the single frame. So I resorted to the command line version of ffmpeg, which I installed via macports, to convert this single frame to a jpg file to view as normal. This line worked a treat:<br />
<code><br />
ffmpeg -i north.mpg -ss 00:00:00 -t 00:00:1 -s 1024x768 -r 1 -f mjpeg north.jpg<br />
</code></p>
<p>Where &#8216;north.mpg&#8217; was the m2v file, and &#8216;north.jpg&#8217; was the output jpeg.</p>
<p>And this:<br />
<code><br />
find -name *.mpg -exec ffmpeg -i {} -ss 00:00:00 -t 00:00:1 -s 1024x768 -r 1 -f mjpeg {}.jpg \;<br />
</code></p>
<p>Will go through all the mpg files in the current directory and below, and create their jpeg single frame equivalents, ie: for north.mpg it will create north.mpg.jpg.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2012/04/12/converting-a-single-m2v-frame-into-jpeg/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Java 1.6 on RHEL4</title>
		<link>http://www.davidcraddock.net/2012/02/11/java-1-6-on-rhel4/</link>
		<comments>http://www.davidcraddock.net/2012/02/11/java-1-6-on-rhel4/#comments</comments>
		<pubDate>Sat, 11 Feb 2012 01:36:52 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[Java 1.6]]></category>
		<category><![CDATA[Redhat Enterprise Linux]]></category>
		<category><![CDATA[RHEL]]></category>
		<category><![CDATA[RHEL4]]></category>
		<category><![CDATA[Sun]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=967</guid>
		<description><![CDATA[After I wrote a Java application in JDK 1.6, I was stuck for a while when I realised that the target deployment machine was Red Hat Enterprise Linux 4. RHEL4 does not support Java 1.6 in its default configuration. Luckily &#8230; <a href="http://www.davidcraddock.net/2012/02/11/java-1-6-on-rhel4/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2012/02/red-hat-theme-party.jpg"><img src="http://www.davidcraddock.net/wp-content/uploads/2012/02/red-hat-theme-party.jpg" alt="" title="Red Hat" width="230" height="175" class="aligncenter size-full wp-image-968" /></a></p>
<p>After I wrote a Java application in JDK 1.6, I was stuck for a while when I realised that the target deployment machine was Red Hat Enterprise Linux 4. RHEL4 does not support Java 1.6 in its default configuration.</p>
<p>Luckily I found this article on the CentOS wiki which included instructions on how to install Java 1.6 on CentOS 4. Remembering that RHEL4 and CentOS 4 are almost identical, I tried the method supplied, and it worked. This is the page with the method:</p>
<p><a href="http://wiki.centos.org/HowTos/JavaOnCentOS">http://wiki.centos.org/HowTos/JavaOnCentOS</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2012/02/11/java-1-6-on-rhel4/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Test Driven Systems Development with Nagios</title>
		<link>http://www.davidcraddock.net/2012/02/07/test-driven-systems-development-with-nagios/</link>
		<comments>http://www.davidcraddock.net/2012/02/07/test-driven-systems-development-with-nagios/#comments</comments>
		<pubDate>Tue, 07 Feb 2012 14:42:50 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=956</guid>
		<description><![CDATA[Nagios can be seen as a automated test tool for systems, just as you would have automated tests for software projects. In test driven development (TDD), you write the tests first, and then use those tests to build up a &#8230; <a href="http://www.davidcraddock.net/2012/02/07/test-driven-systems-development-with-nagios/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2012/02/nagios.jpg"><img src="http://www.davidcraddock.net/wp-content/uploads/2012/02/nagios.jpg" alt="" title="nagios" width="260" height="194" class="aligncenter size-full wp-image-958" /></a></p>
<p>Nagios can be seen as a automated test tool for systems, just as you would have automated tests for software projects. In test driven development (TDD), you write the tests first, and then use those tests to build up a software project that you can have confidence that it works. We can use this method to build up systems, or networks of systems. Plan out which services and processes should be running on your new systems, and then implement Nagios tests for every one. You can check the progress of your build by checking Nagios. I have been doing this at the BBC. It is a simple idea but one that seems to work.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2012/02/07/test-driven-systems-development-with-nagios/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>JSoup Method for Page Scraping</title>
		<link>http://www.davidcraddock.net/2011/09/07/jsoup-method-for-page-scraping/</link>
		<comments>http://www.davidcraddock.net/2011/09/07/jsoup-method-for-page-scraping/#comments</comments>
		<pubDate>Wed, 07 Sep 2011 18:35:17 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Solutions to a Specific Problem]]></category>
		<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[Java]]></category>
		<category><![CDATA[JSoup]]></category>
		<category><![CDATA[Scraper]]></category>
		<category><![CDATA[Scraping webpages]]></category>
		<category><![CDATA[screen scrape]]></category>
		<category><![CDATA[web scraping]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=938</guid>
		<description><![CDATA[I&#8217;m currently in the process of writing a web scraper for the forums on Gaia Online. Previously, I used to use Python to develop web scrapers, with the very handy Python library BeautifulSoup. Java has an equivalent called JSoup. Here &#8230; <a href="http://www.davidcraddock.net/2011/09/07/jsoup-method-for-page-scraping/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/09/soup.jpg"><img src="http://www.davidcraddock.net/wp-content/uploads/2011/09/soup.jpg" alt="Soup bowl" title="Soup" width="300" height="300" class="aligncenter size-full wp-image-946" /></a></p>
<p>I&#8217;m currently in the process of writing a web scraper for the forums on <a href="http://www.gaiaonline.com/forum" title="Gaia Online">Gaia Online</a>. Previously, I used to use Python to develop web scrapers, with the very handy Python library <a href="http://www.crummy.com/software/BeautifulSoup/" title="BeautifulSoup">BeautifulSoup</a>. Java has an equivalent called JSoup.</p>
<p>Here I have written a class which is extended by each class in my project that wants to scrape HTML. This &#8216;Scraper&#8217; class deals with the fetching of the HTML and converting it into a JSoup tree to be navigated and have the data picked out of. It advertises itself as a &#8216;web spider&#8217; type of web agent and also adds a 0-7 second random wait before fetching the page to make sure it isn&#8217;t used to overload a web server. It also converts the entire page to ASCII, which may not be the best thing to do for multi-language web pages, but certainly has made the scraping of the English language site Gaia Online much easier.</p>
<p>Here it is:</p>
<pre lang="Java">
import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.text.Normalizer;
import java.util.Random;
import org.apache.commons.io.IOUtils;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;

/**
* Generic scraper object that contains the basic methods required to fetch
* and parse HTML content. Extended by other classes that need to scrape.
*
* @author David
*/
public class Scraper {

        public String pageHTML = ""; // the HTML for the page
        public Document pageSoup; // the JSoup scraped hierachy for the page

        public String fetchPageHTML(String URL) throws IOException{

            // this makes sure we don't scrape the same page twice
            if(this.pageHTML != ""){
                return this.pageHTML;
            }

            System.getProperties().setProperty("httpclient.useragent", "spider");

            Random randomGenerator = new Random();
            int sleepTime = randomGenerator.nextInt(7000);
            try{
                Thread.sleep(sleepTime); //sleep for x milliseconds
            }catch(Exception e){
                // only fires if topic is interruped by another process, should never happen
            }

            String pageHTML = "";

            HttpClient httpclient = new DefaultHttpClient();
            HttpGet httpget = new HttpGet(URL);

                HttpResponse response = httpclient.execute(httpget);
                HttpEntity entity = response.getEntity();

                if (entity != null) {
                    InputStream instream = entity.getContent();
                    String encoding = "UTF-8";

                    StringWriter writer = new StringWriter();
                    IOUtils.copy(instream, writer, encoding);

                    pageHTML = writer.toString();

                    // convert entire page scrape to ASCII-safe string
                    pageHTML = Normalizer.normalize(pageHTML, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");

                }

                return pageHTML;
        }

        public Document fetchPageSoup(String pageHTML) throws FetchSoupException{

            // this makes sure we don't soupify the same page twice
            if(this.pageSoup != null){
                return this.pageSoup;
            }

            if(pageHTML.equalsIgnoreCase("")){
                throw new FetchSoupException("We have no supplied HTML to soupify.");
            }

            Document pageSoup = Jsoup.parse(pageHTML);

            return pageSoup;
        }
}
</pre>
<p>Then each class subclasses this scraper class, and adds the actual drilling down through the JSoup hierachy tree to get what is required:</p>
<pre lang="java">
...
this.pageHTML = this.fetchPageHTML(this.rootURL);
this.pageSoup = this.fetchPageSoup(this.pageHTML);

// get the first
<div id="forum_hd_topic_pagelinks">..</div>

 section on the page
Element forumPageLinkSection = this.pageSoup.getElementsByAttributeValue("id","forum_hd_topic_pagelinks").first();
// get all the links in the above
<div> section
Elements forumPageLinks = forumPageLinkSection.getElementsByAttribute("href");
...
</pre>
<p>I&#8217;ve found that this method provides a simple and effective way of scraping pages and using the resultant JSoup tree to pick out important data.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/09/07/jsoup-method-for-page-scraping/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Disabling Control-Enter and Control-B shortcut keys in Outlook 2003</title>
		<link>http://www.davidcraddock.net/2011/07/13/disabling-control-enter-and-control-b-shortcut-keys-in-outlook-2003/</link>
		<comments>http://www.davidcraddock.net/2011/07/13/disabling-control-enter-and-control-b-shortcut-keys-in-outlook-2003/#comments</comments>
		<pubDate>Wed, 13 Jul 2011 16:34:39 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Solutions to a Specific Problem]]></category>
		<category><![CDATA[disabling shortcut]]></category>
		<category><![CDATA[outlook 2003]]></category>
		<category><![CDATA[regedit]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=924</guid>
		<description><![CDATA[At work, I still have to use Windows XP and Outlook 2003. I don&#8217;t particually mind this, except when I draft an email to someone and accidently I press Control-B instead of Control-V. Control-B will go ahead and send your &#8230; <a href="http://www.davidcraddock.net/2011/07/13/disabling-control-enter-and-control-b-shortcut-keys-in-outlook-2003/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/07/email-oops.jpg"><img src="http://www.davidcraddock.net/wp-content/uploads/2011/07/email-oops.jpg" alt="" title="" width="240" height="159" class="aligncenter size-full wp-image-936" /></a></p>
<p>At work, I still have to use Windows XP and Outlook 2003. I don&#8217;t particually mind this, except when I draft an email to someone and accidently I press Control-B instead of Control-V. Control-B will go ahead and send your partially composed email, resulting in some embarassment as you have to tell everyone to disregard it.</p>
<p>So I wanted to remove the &#8216;send email&#8217; shortcut keys in Outlook 2003. There are two ways of doing this, one involves editing your group policy, which is something only my IT administration team can do, and I didn&#8217;t want to have to involve them. The other way is by making a change to your registry, which I will describe here.</p>
<ol>
<li>Open up regedit, and browse to the following registry key: HKEY_CURRENT_USER -> Software -> Policies -> Microsoft -> office -> 11.0 -> outlook</li>
<li>Then create a new key called: &#8220;DisabledShortcutKeysCheckBoxes&#8221;.</li>
<li>Under that key, create two new String Values:<br />
Name: CtrlB Data: 66,8<br />
Name: CtrlEnter Data: 13,8
</li>
<li>Then restart Outlook and those keys will be disabled.</li>
</ol>
<p>Click on the thumbnail below to see what the finished edit should look like:</p>
<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/07/disablingshortcutkeys.jpg"><img src="http://www.davidcraddock.net/wp-content/uploads/2011/07/disablingshortcutkeys-300x123.jpg" alt="" title="disablingshortcutkeys" width="300" height="123" class="aligncenter size-medium wp-image-928" /></a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/07/13/disabling-control-enter-and-control-b-shortcut-keys-in-outlook-2003/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Directory names not visable under ls? Change your colours.</title>
		<link>http://www.davidcraddock.net/2011/05/04/directory-names-not-visable-under-ls-change-your-colours/</link>
		<comments>http://www.davidcraddock.net/2011/05/04/directory-names-not-visable-under-ls-change-your-colours/#comments</comments>
		<pubDate>Wed, 04 May 2011 16:03:55 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Solutions to a Specific Problem]]></category>
		<category><![CDATA[centos]]></category>
		<category><![CDATA[console]]></category>
		<category><![CDATA[directory name not visable]]></category>
		<category><![CDATA[fedora]]></category>
		<category><![CDATA[ls]]></category>
		<category><![CDATA[LS_COLORS]]></category>
		<category><![CDATA[putty]]></category>
		<category><![CDATA[redhat]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=910</guid>
		<description><![CDATA[There is a problem I frequently encouter on Redhat/Fedora/CentOS systems with the output of the ls command. Under those distributions, the default setup is to display directories in a very dark colour. If you usually use a white foreground and &#8230; <a href="http://www.davidcraddock.net/2011/05/04/directory-names-not-visable-under-ls-change-your-colours/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/05/range_of_colours.jpg"><img style="border: none;" src="http://www.davidcraddock.net/wp-content/uploads/2011/05/range_of_colours-300x199.jpg" alt="" title="range_of_colours" width="300" height="199" class="aligncenter size-medium wp-image-913" /></a></p>
<p>There is a problem I frequently encouter on Redhat/Fedora/CentOS systems with the output of the <strong>ls</strong> command. Under those distributions, the default setup is to display directories in a very dark colour. If you usually use a white foreground and a black background on your terminal client (such as Putty) then you will struggle to read the names of the directories under Redhat-based distributions. </p>
<p>There are two soloutions that I have used:</p>
<p><strong>1. Change the colour settings in Putty </strong></p>
<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/05/screenshot-of-use-system-colors.bmp"><img src="http://www.davidcraddock.net/wp-content/uploads/2011/05/screenshot-of-use-system-colors.bmp" alt="" style="border: none;" title="screenshot of use system colors" class="aligncenter size-full wp-image-911" /></a></p>
<p>If you use Putty, ticking &#8216;Use System Colours&#8217; here changes the &#8220;white foreground, black background&#8221; default into a &#8220;white background, black foreground&#8221;. This way you can at least read the console properly, good for a quick fix. You can also save these settings in putty to be the default for the host that you are connecting to, or even all hosts.</p>
<p><strong>2. Change the LS_COLORS directive temporarily in the shell.</strong></p>
<p>Alternatively, you can ask the <strong>ls</strong> command to display directories and other entries in colours that you specify. You could add these lines to the bottom of your .bashrc to make these changes permanent, or if you are using a shared machine, just copy and paste the following lines into the terminal and they will change the colours to a reddish more visable set, until you logout. :</p>
<pre lang="bash">
alias ls='ls --color' # just to make sure we are using coloured ls
LS_COLORS='di=94:fi=0:ln=31:pi=5:so=5:bd=5:cd=5:or=31:mi=0:ex=35:*.rpm=90'
export LS_COLORS
</pre>
<p>(Original source for this particular LS_COLORS combo: <a href="http://linux-sxs.org/housekeeping/lscolors.html">http://linux-sxs.org/housekeeping/lscolors.html</a>)</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/05/04/directory-names-not-visable-under-ls-change-your-colours/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Scraping Gumtree Property Adverts with Python and BeautifulSoup</title>
		<link>http://www.davidcraddock.net/2011/05/01/scraping-gumtree-property-adverts-with-python-and-beautifulsoup/</link>
		<comments>http://www.davidcraddock.net/2011/05/01/scraping-gumtree-property-adverts-with-python-and-beautifulsoup/#comments</comments>
		<pubDate>Sun, 01 May 2011 14:07:02 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[BeautifulSoup]]></category>
		<category><![CDATA[property adverts]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[scraping]]></category>
		<category><![CDATA[scraping Gumtree]]></category>
		<category><![CDATA[web scraping]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=886</guid>
		<description><![CDATA[I am moving to Manchester soon, and so I thought I&#8217;d get an idea of the housing market there by scraping all the Manchester Gumtree property adverts into a MySQL database. Once in the database, I could do things like &#8230; <a href="http://www.davidcraddock.net/2011/05/01/scraping-gumtree-property-adverts-with-python-and-beautifulsoup/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/05/soup.jpg"><img style="border: none" src="http://www.davidcraddock.net/wp-content/uploads/2011/05/soup-300x199.jpg" alt="" title="soup" width="300" height="199" class="aligncenter size-medium wp-image-897" /></a></p>
<p>I am moving to Manchester soon, and so I thought I&#8217;d get an idea of the housing market there by scraping all the Manchester Gumtree property adverts into a MySQL database. Once in the database, I could do things like find the average monthly price for a 2 bedroom flat in an area, and spot bargains through using standard deviation from the mean on the price through using simple SQL queries via <a href="http://www.phpmyadmin.net/home_page/index.php">phpMyAdmin</a>.</p>
<p>I really like the Python library <a href="http://www.crummy.com/software/BeautifulSoup/">BeautifulSoup</a> for writing scrapers, there is also a Java version called <a href="http://jsoup.org/">JSoup</a>. BeautifulSoup does a really good job of tolerating markup mistakes in the input data, and transforms a page into a tree structure that is easy to work with.</p>
<p>I chose the following layout for the program:</p>
<p><strong>advert.py</strong> &#8211; Stores all information about each property advert, with a &#8216;save&#8217; method that inserts the data into the mysql database<br />
<strong>listing.py</strong> &#8211; Stores all the information on each listing page, which is broken down into links for specific adverts, and also the link to the next listing page in the sequence (ie: the &#8216;next page&#8217; link)<br />
<strong>scrapeAdvert.py</strong> &#8211; When given an advert URL, this creates and populates an advert object<br />
<strong>scrapeListing.py</strong> &#8211; When given a listing URL, this creates and populates a listing object<br />
<strong>scrapeSequence.py</strong> &#8211; This walks through a series of listings, calling scrapeListing and scrapeAdvert for all of them, and finishes when there are no more listings in the sequence to scrape</p>
<p>Here is the MySQL table I created for this project (which you will have to setup if you want to run the scraper):</p>
<pre lang="SQL">
--
-- Database: `manchester`
--

-- --------------------------------------------------------

--
-- Table structure for table `adverts`
--

CREATE TABLE IF NOT EXISTS `adverts` (
  `url` varchar(255) NOT NULL,
  `title` text NOT NULL,
  `pricePW` int(10) unsigned NOT NULL,
  `pricePCM` int(11) NOT NULL,
  `location` text NOT NULL,
  `dateAvailable` date NOT NULL,
  `propertyType` text NOT NULL,
  `bedroomNumber` int(11) NOT NULL,
  `description` text NOT NULL,
  PRIMARY KEY (`url`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
</pre>
<p>PricePCM is price per calendar month, PricePW is price per week. Usually each advert with have one or the other specified.</p>
<p><b>advert.py:</b></p>
<pre lang="python">
import MySQLdb
import chardet
import sys

class advert:

        url = ""
        title = ""
        pricePW = 0
        pricePCM = 0
        location = ""
        dateAvailable = ""
        propertyType = ""
        bedroomNumber = 0
        description = ""

        def save(self):
                # you will need to change the following to match your mysql credentials:
                db=MySQLdb.connect("localhost","root","secret","manchester")
                c=db.cursor()

                self.description = unicode(self.description, errors='replace')
                self.description = self.description.encode('ascii','ignore')
                # TODO: might need to convert the other strings in the advert if there are any unicode conversetion errors

                sql = "INSERT INTO adverts (url,title,pricePCM,pricePW,location,dateAvailable,propertyType,bedroomNumber,description) VALUES('"+self.url+"','"+self.title+"',"+str(self.pricePCM)+","+str(self.pricePW)+",'"+self.location+"','"+self.dateAvailable+"','"+self.propertyType+"',"+str(self.bedroomNumber)+",'"+self.description+"' )"

                c.execute(sql)
</pre>
<p>In advert.py we convert the unicode output that BeautifulSoup gives us into plain ASCII so that we can put it in the MySQL database without any problems. I could have used Unicode in the database as well, but the chances of really needing Unicode for representing Gumtree ads is quite slim. If you intend to use this code then you will also want to enter the MySQL credentials for your database.</p>
<p><b>listing.py:</b></p>
<pre lang="python">
class listing:

        url=""
        adverturls=[]
        nextLink=""

        def addAdvertURL(self,url):

                self.adverturls.append(url)
</pre>
<p><b>scrapeAdvert.py:</b></p>
<pre lang="python">
from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2
from advert import advert
import time

class scrapeAdvert:

        page = ""
        soup = ""

        def scrape(self,advertURL):

                # give it a bit of time so gumtree doesn't
                # ban us
                time.sleep(2)

                url = advertURL
                # print "-- scraping "+url+" --"
                page = urllib2.urlopen(url)
                self.soup = BeautifulSoup(page)

                self.anAd = advert()

                self.anAd.url = url
                self.anAd.title = self.extractTitle()
                self.anAd.pricePW = self.extractPricePW()
                self.anAd.pricePCM = self.extractPricePCM()

                self.anAd.location = self.extractLocation()
                self.anAd.dateAvailable = self.extractDateAvailable()
                self.anAd.propertyType = self.extractPropertyType()
                self.anAd.bedroomNumber = self.extractBedroomNumber()
                self.anAd.description = self.extractDescription()

        def extractTitle(self):

                location = self.soup.find('h1')
                string = location.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'&quot;')
                # print '|' + stripped + '|'
                return stripped

        def extractPricePCM(self):

                location = self.soup.find('span',attrs={"class" : "price"})
                try:
                        string = location.contents[0]
                        string.index('pcm')
                except AttributeError: # for ads with no prices set
                        return 0
                except ValueError: # for ads with pw specified
                        return 0

                stripped = string.replace('&pound;','')
                stripped = stripped.replace('pcm','')
                stripped = stripped.replace(',','')
                stripped = stripped.replace("'",'&quot;')
                stripped = ' '.join(stripped.split())
                # print '|' + stripped + '|'
                return int(stripped)

        def extractPricePW(self):

                location = self.soup.find('span',attrs={"class" : "price"})
                try:
                        string = location.contents[0]
                        string.index('pw')
                except AttributeError: # for ads with no prices set
                        return 0
                except ValueError: # for ads with pcm specified
                        return 0
                stripped = string.replace('&pound;','')
                stripped = stripped.replace('pw','')
                stripped = stripped.replace(',','')
                stripped = stripped.replace("'",'&quot;')
                stripped = ' '.join(stripped.split())
                # print '|' + stripped + '|'
                return int(stripped)

        def extractLocation(self):

                location = self.soup.find('span',attrs={"class" : "location"})
                string = location.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'&quot;')
                # print '|' + stripped + '|'
                return stripped

        def extractDateAvailable(self):

                current_year = '2011'

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                firstP = ul.findAll('p')[0]
                string = firstP.contents[0]
                stripped = ' '.join(string.split())
                date_to_convert = stripped + '/'+current_year
                try:
                        date_object = time.strptime(date_to_convert, "%d/%m/%Y")
                except ValueError: # for adverts with no date available
                        return ""

                full_date = time.strftime('%Y-%m-%d %H:%M:%S', date_object)
                # print '|' + full_date + '|'
                return full_date

        def extractPropertyType(self):

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                try:
                        secondP = ul.findAll('p')[1]
                except IndexError: # for properties with no type
                        return ""
                string = secondP.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'&quot;')
                # print '|' + stripped + '|'
                return stripped

        def extractBedroomNumber(self):

                ul = self.soup.find('ul',attrs={"id" : "ad-details"})
                try:
                        thirdP = ul.findAll('p')[2]
                except IndexError: # for properties with no bedroom number
                        return 0
                string = thirdP.contents[0]
                stripped = ' '.join(string.split())
                stripped = stripped.replace("'",'&quot;')
                # print '|' + stripped + '|'
                return stripped

        def extractDescription(self):

                div = self.soup.find('div',attrs={"id" : "description"})
                description = div.find('p')
                contents = description.renderContents()
                contents = contents.replace("'",'&quot;')
                # print '|' + contents + '|'
                return contents
</pre>
<p>In scrapeAdvert.py there are a lot of string manipulation statements to pull out any unwanted characters, such as the &#8216;pw&#8217; characters (short for per week) found in the price string, which we need to remove in order to store the property price per week as an integer.</p>
<p>Using BeautifulSoup to pull out elements is quite easy, for example:</p>
<pre lang="python">
ul = self.soup.find('ul',attrs={"id" : "ad-details"})
</pre>
<p>That finds all the HTML elements under &lt;ul id=&#8221;ad-details&#8221;&gt;, so all the list elements in that list. More detail can be found in the <a href="http://www.crummy.com/software/BeautifulSoup/documentation.html">Beautiful Soup documentation</a> which is very good.</p>
<p><b>scrapeListing.py:</b></p>
<pre lang="python">
from BeautifulSoup import BeautifulSoup          # For processing HTML
import urllib2
from listing import listing
import time

class scrapeListing:

        soup = ""
        url = ""
        aListing = ""

        def scrape(self,url):
                # give it a bit of time so gumtree doesn't
                # ban us
                time.sleep(3)

                print "scraping url = "+str(url)

                page = urllib2.urlopen(url)
                self.soup = BeautifulSoup(page)

                self.aListing = listing()
                self.aListing.url = url
                self.aListing.adverturls = self.extractAdvertURLs()
                self.aListing.nextLink = self.extractNextLink()

        def extractAdvertURLs(self):

                toReturn = []
                h3s = self.soup.findAll("h3")
                for h3 in h3s:
                        links = h3.findAll('a',{"class":"summary"})
                        for link in links:
                                print "|"+link['href']+"|"
                                toReturn.append(link['href'])

                return toReturn

        def extractNextLink(self):

                links = self.soup.findAll("a",{"class":"next"})
                try:
                        print ">"+links[0]['href']+">"
                except IndexError: # if there is no 'next' link found..
                        return ""
                return links[0]['href']
</pre>
<p>The extractNextLink method here extracts the pagination &#8216;next&#8217; link which will bring up the next listing page from the selection of listing pages to browse. We use it to step through the pagination &#8216;sequence&#8217; of resultant listing pages.</p>
<p><b>scrapeSequence.py:</b></p>
<pre lang="python">
from scrapeListing import scrapeListing
from scrapeAdvert import scrapeAdvert
from listing import listing
from advert import advert
import MySQLdb
import _mysql_exceptions

# change this to the gumtree page you want to start scraping from
url = "http://www.gumtree.com/flats-and-houses-for-rent/salford-quays"

while url != None:
        print "scraping URL = "+url
        sl = ""
        sl = scrapeListing()
        sl.scrape(url)
        for advertURL in sl.aListing.adverturls:
                sa = ""
                sa = scrapeAdvert()
                sa.scrape(advertURL)
                try:
                        sa.anAd.save()
                except _mysql_exceptions.IntegrityError:
                        print "** Advert " + sa.anAd.url + " already saved **"
                sa.onAd = ""

        url = ""
        if sl.aListing.nextLink:
                print "nextLink = "+sl.aListing.nextLink
                url = sl.aListing.nextLink
        else:
                print 'all done.'
                break
</pre>
<p>This is the file you run to kick off the scrape. It uses an MySQL IntegrityError  try/except block to pick out when an advert has already been entered into the database, this will throw an error because the URL of the advert is the primary key in the database. So no two records can have the same primary key.</p>
<p>The URL you provide it above gives you the starting page from which to scrape from.</p>
<p>The above code worked well for scraping several hundred Manchester Gumtree ads into a database, from which point I was able to use a combination of phpMyAdmin and OpenOffice Spreadsheet to analyse the data and find out useful statistics about the property market in said area.</p>
<p><center><a href="http://www.davidcraddock.net/uploads/gumtree-scraper.tgz">Download the scraper source code in a tar.gz archive</a></center></p>
<p>Note: Due to the nature of web scraping, if &#8211; or more accurately, when &#8211; Gumtree changes its user interface, the scraper I have written will need to be tweaked accordingly to find the right data. This is meant to be an informative tutorial, not a finished product.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/05/01/scraping-gumtree-property-adverts-with-python-and-beautifulsoup/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>RESTful Web Services</title>
		<link>http://www.davidcraddock.net/2011/03/02/restful-web-services/</link>
		<comments>http://www.davidcraddock.net/2011/03/02/restful-web-services/#comments</comments>
		<pubDate>Wed, 02 Mar 2011 14:21:23 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Tutorials]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=876</guid>
		<description><![CDATA[REST (Representational State Transfer) is a way of delivering web services. When a web service conforms to REST, it is known as RESTful. The largest RESTful web service is the Hypertext Transfer Protocol (HTTP) which you use every day to &#8230; <a href="http://www.davidcraddock.net/2011/03/02/restful-web-services/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p><img src="http://www.davidcraddock.net/wp-content/uploads/2011/03/hammock-200x300.jpg" alt="Hammock with the background of a clear blue sky" title="hammock" width="200" height="300" class="aligncenter size-medium wp-image-882" /></p>
<p>REST (Representational State Transfer) is a way of delivering web services. When a web service conforms to REST, it is known as RESTful. The largest RESTful web service is the Hypertext Transfer Protocol (HTTP) which you use every day to send and receive information from web servers while browsing the internet.</p>
<p>To implement RESTful web services,  you should implement four methods: GET, PUT, POST and DELETE. Resources on RESTful web services are typically defined as collections of elements. The REST methods can either act on a whole collection, or a specific element in a collection.</p>
<p>A collection is usually logically defined as a hierarchy on the URL, for example take this fictitious layout:</p>
<p><strong>Collection:</strong> www.bbc.co.uk/iplayer/programmes/<br />
<strong>Element:</strong> www.bbc.co.uk/iplayer/programmes/24<br />
<strong>Element:</strong> www.bbc.co.uk/iplayer/programmes/25<br />
<strong>Element:</strong> www.bbc.co.uk/iplayer/programmes/26</p>
<p>The REST methods you use do different things depending on whether you are interacting with a Collection resource or an Element resource. See below:</p>
<p><strong>On a Collection: ie: www.bbc.co.uk/iplayer/programmes/</strong><br/><br />
GET – Lists the URLs of the collection’s members.<br />
PUT – Replace the entire collection with another collection.<br />
POST – Create a new element in a collection, returning the new element’s URL.<br />
DELETE – Deletes the entire collection.</p>
<p><strong>On an Element: ie: www.bbc.co.uk/iplayer/programmes/24</strong><br/><br />
GET – Retrieve the addressed element in the appropriate internet media type, ie: music file or image<br />
PUT – Replace the addressed element of the collection, or if it doesn’t exist, create it in the parent collection.<br />
POST – Treat the addressed element of the collection as a new collection, and add an element into it.<br />
DELETE – Delete the addressed element of the collection.</p>
<p>REST is a simple and clear way of implementing the basic methods of data storage; CRUD (Create, Read, Update and Delete), see: <a href="http://en.wikipedia.org/wiki/Create,_read,_update_and_delete">http://en.wikipedia.org/wiki/Create,_read,_update_and_delete</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/03/02/restful-web-services/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>&#8216;Weather Forecast&#8217; Calendar Service in PHP</title>
		<link>http://www.davidcraddock.net/2011/02/24/a-3-day-weather-forecast-calendar-service/</link>
		<comments>http://www.davidcraddock.net/2011/02/24/a-3-day-weather-forecast-calendar-service/#comments</comments>
		<pubDate>Thu, 24 Feb 2011 19:31:48 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Tutorials]]></category>
		<category><![CDATA[bbc weather feed]]></category>
		<category><![CDATA[ical service]]></category>
		<category><![CDATA[php]]></category>
		<category><![CDATA[weather forecast]]></category>
		<category><![CDATA[web services]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=857</guid>
		<description><![CDATA[The BBC provide 3 day weather RSS feeds for most locations in the UK. I thought it would be interesting to create a web service to turn the weather feed into calendar feed format, so I could have a constantly &#8230; <a href="http://www.davidcraddock.net/2011/02/24/a-3-day-weather-forecast-calendar-service/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>The BBC provide 3 day weather RSS feeds for most locations in the UK. I thought it would be interesting to create a web service to turn the weather feed into calendar feed format, so I could have a constantly updated forecast of the next 3 days of weather mapped on to my iPhone’s calendar. Here it is on my iPhone:</p>
<p><a href="http://www.davidcraddock.net/wp-content/uploads/2011/02/weathercal.png"><img src="http://www.davidcraddock.net/wp-content/uploads/2011/02/weathercal.png" alt="Picture shows weather forecast on an iPhone calendar screenshot" title="weathercal" width="320" height="480" class="aligncenter size-full wp-image-868" /></a></p>
<p><strong>Overview</strong></p>
<p>The service is separated into five files:</p>
<ul>
<li><b>ical.php</b> – this contains the class ical which corresponds to a single calendar feed. A method called ‘addevent’ allows you to add new events to the calendar, and a method called ‘returncal’ redirects the resulting calendar file to the browser so people can subscribe to it using their calendar application.</li>
<li><b>forecast.php</b> – this file contains the class forecast, which has properties for all aspects that we want to record for each day’s forecast, ie: Wind Speed and Humidity.  It also contains the forecast set, which is a collection of forecast objects. The set class is serializable, which means each forecast object can be stored in a text file, including the Wind Speed, Humidity and all other things we want to record for each day.</li>
<li><b>scrape-weather.php</b> – this file contains code that scrapes the weather feed, populates the forecast set with all the weather information for the next 3 days, and stores the result in a file called forecasts.ser.</li>
<li><b>forecasts.ser</b> – this is all the data for the three day weather forecast, in serialized format. It is automatically deleted and recreated when the scrape-weather.php script is run.</li>
<li><b>reader.php</b> – this file converts the forecasts.ser file into an iCal calendar, and outputs the iCal formatted result to the calendar application that accesses reader.php page.</li>
</ul>
<p>It uses two external libraries:</p>
<ul>
<li><b>MagpieRSS 0.72</b> – this popular library is used for reading the calendar RSS feed and converting it into a PHP object that is easier to manipulate by scrape-weather.php.</li>
<li><b>iCalcreator 2.8</b> – this is used for creating the output iCal format of the calendar in ical.php and outputting it to the browser in reader.php.</li>
</ul>
<p><strong>Files</strong></p>
<pre lang="PHP">
<?php
// ical.php
require_once( 'ical/iCalcreator.class.php' );

class ical {
	public $v;

	function ical(){
		$this->init();
	}	

	function init(){
		$config = array( 'unique_id' => 'weather.davidcraddock.net' );
		  // set Your unique id
		$this->v = new vcalendar( $config );
		  // create a new calendar instance

		$this->v->setProperty( 'method', 'PUBLISH' );
		  // required of some calendar software
		$this->v->setProperty( "x-wr-calname", "Calendar Sample" );
		  // required of some calendar software
		$this->v->setProperty( "X-WR-CALDESC", "Calendar Description" );
		  // required of some calendar software
		$this->v->setProperty( "X-WR-TIMEZONE", "Europe/London" );
		  // required of some calendar software
	}

	function addevent($start_year,$start_month,$start_day,$start_hour,$start_min,
		  $finish_year,$finish_month,$finish_day,$finish_hour,$finish_min,
		  $summary,$description,$comment
	){
		$vevent = &#038; $this->v->newComponent( 'vevent' );
		  // create an event calendar component
		$start = array( 'year'=>$start_year, 'month'=>$start_month, 'day'=>$start_day, 'hour'=>$start_hour, 'min'=>$start_min, 'sec'=>0 );
		$vevent->setProperty( 'dtstart', $start );
		$end = array( 'year'=>$finish_year, 'month'=>$finish_month, 'day'=>$finish_day, 'hour'=>$finish_hour, 'min'=>$finish_min, 'sec'=>0 );
		$vevent->setProperty( 'dtend', $end );
		$vevent->setProperty( 'LOCATION', '' );
		  // property name - case independent
		$vevent->setProperty( 'summary', $summary );
		$vevent->setProperty( 'description',$description );
		$vevent->setProperty( 'comment', $comment );
		$vevent->setProperty( 'attendee', 'contact@davidcraddock.net' );
	}

	function returncal(){
		// redirect calendar file to browser
		$this->v->returnCalendar();
	}
}
?>
</pre>
<pre lang="PHP">
<?php
//forecast.php

class forecast {
	public $day;
	public $month;
	public $year;

	public $high;
	public $low;
	public $summary;

	public $humidity;
	public $windspeed;
}

class forecast_set {
	public $forecasts;

	function forecast_set(){
		$this->forecasts = new ArrayObject();
	}
}
</pre>
<pre lang="PHP">
<?php
// scrape-weather.php
require_once('magpierss/rss_fetch.inc');
require_once('forecast.php');

class scrape3day {
	var $set; // forecast set

	// configuration variables

	// weather forecasts are stored in this file:
	var $store_path = "/home/david_craddock/work.davidcraddock.net/weather/forecasts.ser";
	// weather forecasts are fetched from this BBC feed:
	var $feed_url = "http://newsrss.bbc.co.uk/weather/forecast/2376/Next3DaysRSS.xml";

	function scrape3day(){
		$this->scrapecurrent();
		$this->store();
	}

	function store(){
		$store_path = $this->store_path;
		unlink($store_path);
		file_put_contents($store_path, serialize($this->set));
	}

	function scrapecurrent(){
		$url = $this->feed_url;
		$rss = fetch_rss( $url );
		$message = "";
		if(sizeof($rss->items) != 3){
			die("Problem with BBC weather feed.. dying");
		}
		$i=0;
		$set = new forecast_set();
		$curdate = date("Y-m-d");
		echo $curdate;
		foreach ($rss->items as $item) {
			$href = $item['link'];
			$title = $item['title'];
			$description = $item['description'];
			print_r($item);
			$curyear = date('Y',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			$curmonth = date('m',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			$curday = date('d',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
			preg_match('/:.+?,/',$title,$summary);
			preg_match('/Min Temp:.+?-*\d*/',$title,$mintemp);
			preg_match('/Max Temp:.+?-*\d*/',$title,$maxtemp);
			preg_match('/Wind Speed:.+?-*\d*/',$description,$windspeed);
			preg_match('/Humidity:.+?-*\d*/',$description,$humidity);
			$summary[0] = str_replace(': ','',$summary[0]);
			$summary[0] = str_replace(',','',$summary[0]);
			$mintemp[0] = str_replace('Min Temp: ','',$mintemp[0]);
			$maxtemp[0] = str_replace('Max Temp: ','',$maxtemp[0]);
			$windspeed[0] = str_replace('Wind Speed: ','',$windspeed[0]);
			$humidity[0] = str_replace('Humidity: ','',$humidity[0]);
			$mins[$i] = (int)$mintemp[0];
			$maxs[$i] = (int)$maxtemp[0];
			$forecast = new forecast();
			$forecast->low = (int)$mintemp[0];
			$forecast->high = (int)$maxtemp[0];
			$forecast->year = (int)$curyear;
			$forecast->month = (int)$curmonth;
			$forecast->day = (int)$curday;
			$forecast->windspeed = $windspeed[0];
			$forecast->humidity = $humidity[0];
			$forecast->summary = ucwords($summary[0]);
			$set->forecasts->append($forecast);
			$i++;
			$curdate = date('Y-m-d',strtotime(date("Y-m-d", strtotime($curdate)) . " +1 day"));
		}
		print_r($set);
		$this->set = $set;

	}

}
$s = new scrape3day();
</pre>
<pre lang="PHP">
<?php
require_once('ical.php');
require_once('forecast.php');

$c = new ical();
$f = unserialize(file_get_contents('forecasts.ser'));
for($i=0;$i<3;$i++){
	$curforecast = $f->forecasts[$i];
	$weather_digest = "Max: ".$curforecast->high." Min: ".$curforecast->low." Humidity: ".$curforecast->humidity."% Wind Speed: ".$curforecast->windspeed."mph.";
	$c->addevent($curforecast->year,$curforecast->month,$curforecast->day,7,0,$curforecast->year,$curforecast->month,$curforecast->day,7,30,$curforecast->summary,$weather_digest,$weather_digest);
}
$c->returncal();
?>
</pre>
<p><strong>SVN Version</strong></p>
<p>If you have subversion, you can check out the project from: http://svn.davidcraddock.net/weather-services/. There are a couple extra files in that directory for my automated freezing weather alerts, but you can safely ignore those.</p>
<p><strong>Installation</strong></p>
<p>You will have to add this entry to your crontab to run once per day. You could set the script to run at midnight through adding the following:</p>
<pre>0 0 * * * &lt;path to PHP interpreter&gt; &lt;path to scrape-weather.php&gt;</pre>
<p>For example, in my case:</p>
<pre>0 0 * * * /usr/local/bin/php /home/david_craddock/work.davidcraddock.net/weather/scrape-weather.php </pre>
<p>You will then need to edit the contents of the $store_path and $feed_url variables in scrape-weather.php. Store_path should refer to a file path that the web server can create and edit files in, and feed_url should refer to the RSS feed of your local area that you have copied and pasted from the <a href="http://news.bbc.co.uk/weather/">http://news.bbc.co.uk/weather/</a> site, don&#8217;t use mine because your area is likely different. After that, you&#8217;re set to go.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/02/24/a-3-day-weather-forecast-calendar-service/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Find large files by using the OSX commandline</title>
		<link>http://www.davidcraddock.net/2011/02/22/find-large-files-by-using-the-osx-commandline/</link>
		<comments>http://www.davidcraddock.net/2011/02/22/find-large-files-by-using-the-osx-commandline/#comments</comments>
		<pubDate>Tue, 22 Feb 2011 00:16:12 +0000</pubDate>
		<dc:creator>David Craddock</dc:creator>
				<category><![CDATA[Solutions to a Specific Problem]]></category>
		<category><![CDATA[command line]]></category>
		<category><![CDATA[finding large files]]></category>
		<category><![CDATA[osx]]></category>

		<guid isPermaLink="false">http://www.davidcraddock.net/?p=852</guid>
		<description><![CDATA[To quickly find large files to delete if you have filled your startup disk, enter this command on the OSX terminal: sudo find / -size +500000 -print This will find and print out file paths to files over 500MB. You &#8230; <a href="http://www.davidcraddock.net/2011/02/22/find-large-files-by-using-the-osx-commandline/">Continue reading <span class="meta-nav">&#8594;</span></a>]]></description>
			<content:encoded><![CDATA[<p>To quickly find large files to delete if you have filled your startup disk, enter this command on the OSX terminal:</p>
<pre lang="bash">
sudo find / -size +500000 -print
</pre>
<p>This will find and print out file paths to files over 500MB. You can then go through them and delete them individually by typing <strong>rm &#8220;&lt;file path&gt;&#8221;</strong>, although there is no undelete so make sure you know you won&#8217;t miss them.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.davidcraddock.net/2011/02/22/find-large-files-by-using-the-osx-commandline/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

