Solutions to a Specific Problem

...now browsing by category

Posts that are a record of my solution to a specific problem that I’ve encountered.

 

JSoup Method for Page Scraping

Wednesday, September 7th, 2011

Soup bowl

I’m currently in the process of writing a web scraper for the forums on Gaia Online. Previously, I used to use Python to develop web scrapers, with the very handy Python library BeautifulSoup. Java has an equivalent called JSoup.

Here I have written a class which is extended by each class in my project that wants to scrape HTML. This ‘Scraper’ class deals with the fetching of the HTML and converting it into a JSoup tree to be navigated and have the data picked out of. It advertises itself as a ‘web spider’ type of web agent and also adds a 0-7 second random wait before fetching the page to make sure it isn’t used to overload a web server. It also converts the entire page to ASCII, which may not be the best thing to do for multi-language web pages, but certainly has made the scraping of the English language site Gaia Online much easier.

Here it is:

import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.text.Normalizer;
import java.util.Random;
import org.apache.commons.io.IOUtils;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
/**
* Generic scraper object that contains the basic methods required to fetch
* and parse HTML content. Extended by other classes that need to scrape.
*
* @author David
*/
public class Scraper {
 
        public String pageHTML = ""; // the HTML for the page
        public Document pageSoup; // the JSoup scraped hierachy for the page
 
 
        public String fetchPageHTML(String URL) throws IOException{
 
            // this makes sure we don't scrape the same page twice
            if(this.pageHTML != ""){
                return this.pageHTML;
            }
 
            System.getProperties().setProperty("httpclient.useragent", "spider");
 
            Random randomGenerator = new Random();
            int sleepTime = randomGenerator.nextInt(7000);
            try{
                Thread.sleep(sleepTime); //sleep for x milliseconds
            }catch(Exception e){
                // only fires if topic is interruped by another process, should never happen
            }
 
            String pageHTML = "";
 
            HttpClient httpclient = new DefaultHttpClient();
            HttpGet httpget = new HttpGet(URL);
 
                HttpResponse response = httpclient.execute(httpget);
                HttpEntity entity = response.getEntity();
 
                if (entity != null) {
                    InputStream instream = entity.getContent();
                    String encoding = "UTF-8";
 
                    StringWriter writer = new StringWriter();
                    IOUtils.copy(instream, writer, encoding);
 
                    pageHTML = writer.toString();
 
                    // convert entire page scrape to ASCII-safe string
                    pageHTML = Normalizer.normalize(pageHTML, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
 
                }
 
                return pageHTML;
        }
 
        public Document fetchPageSoup(String pageHTML) throws FetchSoupException{
 
            // this makes sure we don't soupify the same page twice
            if(this.pageSoup != null){
                return this.pageSoup;
            }
 
            if(pageHTML.equalsIgnoreCase("")){
                throw new FetchSoupException("We have no supplied HTML to soupify.");
            }
 
            Document pageSoup = Jsoup.parse(pageHTML);
 
            return pageSoup;
        }
}

Then each class subclasses this scraper class, and adds the actual drilling down through the JSoup hierachy tree to get what is required:

...
this.pageHTML = this.fetchPageHTML(this.rootURL);
this.pageSoup = this.fetchPageSoup(this.pageHTML);
 
// get the first <div id="forum_hd_topic_pagelinks">..</div> section on the page
Element forumPageLinkSection = this.pageSoup.getElementsByAttributeValue("id","forum_hd_topic_pagelinks").first();
// get all the links in the above <div> section
Elements forumPageLinks = forumPageLinkSection.getElementsByAttribute("href");
...

I’ve found that this method provides a simple and effective way of scraping pages and using the resultant JSoup tree to pick out important data.

Disabling Control-Enter and Control-B shortcut keys in Outlook 2003

Wednesday, July 13th, 2011

At work, I still have to use Windows XP and Outlook 2003. I don’t particually mind this, except when I draft an email to someone and accidently I press Control-B instead of Control-V. Control-B will go ahead and send your partially composed email, resulting in some embarassment as you have to tell everyone to disregard it.

So I wanted to remove the ‘send email’ shortcut keys in Outlook 2003. There are two ways of doing this, one involves editing your group policy, which is something only my IT administration team can do, and I didn’t want to have to involve them. The other way is by making a change to your registry, which I will describe here.

  1. Open up regedit, and browse to the following registry key: HKEY_CURRENT_USER -> Software -> Microsoft -> office -> 11.0 -> outlook
  2. Then create a new key called: “DisabledShortcutKeysCheckBoxes”.
  3. Under that key, create two new String Values:
    Name: CtrlB Data: 66,8
    Name: CtrlEnter Data: 13,8
  4. Then restart Outlook and those keys will be disabled.

Click on the thumbnail below to see what the finished edit should look like:

Directory names not visable under ls? Change your colours.

Wednesday, May 4th, 2011

There is a problem I frequently encouter on Redhat/Fedora/CentOS systems with the output of the ls command. Under those distributions, the default setup is to display directories in a very dark colour. If you usually use a white foreground and a black background on your terminal client (such as Putty) then you will struggle to read the names of the directories under Redhat-based distributions.

There are two soloutions that I have used:

1. Change the colour settings in Putty

If you use Putty, ticking ‘Use System Colours’ here changes the “white foreground, black background” default into a “white background, black foreground”. This way you can at least read the console properly, good for a quick fix. You can also save these settings in putty to be the default for the host that you are connecting to, or even all hosts.

2. Change the LS_COLORS directive temporarily in the shell.

Alternatively, you can ask the ls command to display directories and other entries in colours that you specify. You could add these lines to the bottom of your .bashrc to make these changes permanent, or if you are using a shared machine, just copy and paste the following lines into the terminal and they will change the colours to a reddish more visable set, until you logout. :

alias ls='ls --color' # just to make sure we are using coloured ls
LS_COLORS='di=94:fi=0:ln=31:pi=5:so=5:bd=5:cd=5:or=31:mi=0:ex=35:*.rpm=90'
export LS_COLORS

(Original source for this particular LS_COLORS combo: http://linux-sxs.org/housekeeping/lscolors.html)

Find large files by using the OSX commandline

Tuesday, February 22nd, 2011

To quickly find large files to delete if you have filled your startup disk, enter this command on the OSX terminal:

sudo find / -size +500000 -print

This will find and print out file paths to files over 500MB. You can then go through them and delete them individually by typing rm “<file path>”, although there is no undelete so make sure you know you won’t miss them.

Finding files in Linux modified between two dates

Wednesday, February 16th, 2011

You use the ‘touch’ command to create two blank files, with a last modified date that you specify – one with a date of the start of the range you want to specify, and the second with a date at the end of the range you want to specify. Then you reference to those two files in your find command:

touch /tmp/temp -t 200604141130
touch /tmp/ntemp -t 200604261630
find /data/ -cnewer /tmp/temp -and ! -cnewer /tmp/ntemp