JSoup

...now browsing by tag

 
 

JSoup Method for Page Scraping

Wednesday, September 7th, 2011

Soup bowl

I’m currently in the process of writing a web scraper for the forums on Gaia Online. Previously, I used to use Python to develop web scrapers, with the very handy Python library BeautifulSoup. Java has an equivalent called JSoup.

Here I have written a class which is extended by each class in my project that wants to scrape HTML. This ‘Scraper’ class deals with the fetching of the HTML and converting it into a JSoup tree to be navigated and have the data picked out of. It advertises itself as a ‘web spider’ type of web agent and also adds a 0-7 second random wait before fetching the page to make sure it isn’t used to overload a web server. It also converts the entire page to ASCII, which may not be the best thing to do for multi-language web pages, but certainly has made the scraping of the English language site Gaia Online much easier.

Here it is:

import java.io.IOException;
import java.io.InputStream;
import java.io.StringWriter;
import java.text.Normalizer;
import java.util.Random;
import org.apache.commons.io.IOUtils;
import org.apache.http.HttpEntity;
import org.apache.http.HttpResponse;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.impl.client.DefaultHttpClient;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
 
/**
* Generic scraper object that contains the basic methods required to fetch
* and parse HTML content. Extended by other classes that need to scrape.
*
* @author David
*/
public class Scraper {
 
        public String pageHTML = ""; // the HTML for the page
        public Document pageSoup; // the JSoup scraped hierachy for the page
 
 
        public String fetchPageHTML(String URL) throws IOException{
 
            // this makes sure we don't scrape the same page twice
            if(this.pageHTML != ""){
                return this.pageHTML;
            }
 
            System.getProperties().setProperty("httpclient.useragent", "spider");
 
            Random randomGenerator = new Random();
            int sleepTime = randomGenerator.nextInt(7000);
            try{
                Thread.sleep(sleepTime); //sleep for x milliseconds
            }catch(Exception e){
                // only fires if topic is interruped by another process, should never happen
            }
 
            String pageHTML = "";
 
            HttpClient httpclient = new DefaultHttpClient();
            HttpGet httpget = new HttpGet(URL);
 
                HttpResponse response = httpclient.execute(httpget);
                HttpEntity entity = response.getEntity();
 
                if (entity != null) {
                    InputStream instream = entity.getContent();
                    String encoding = "UTF-8";
 
                    StringWriter writer = new StringWriter();
                    IOUtils.copy(instream, writer, encoding);
 
                    pageHTML = writer.toString();
 
                    // convert entire page scrape to ASCII-safe string
                    pageHTML = Normalizer.normalize(pageHTML, Normalizer.Form.NFD).replaceAll("[^\\p{ASCII}]", "");
 
                }
 
                return pageHTML;
        }
 
        public Document fetchPageSoup(String pageHTML) throws FetchSoupException{
 
            // this makes sure we don't soupify the same page twice
            if(this.pageSoup != null){
                return this.pageSoup;
            }
 
            if(pageHTML.equalsIgnoreCase("")){
                throw new FetchSoupException("We have no supplied HTML to soupify.");
            }
 
            Document pageSoup = Jsoup.parse(pageHTML);
 
            return pageSoup;
        }
}

Then each class subclasses this scraper class, and adds the actual drilling down through the JSoup hierachy tree to get what is required:

...
this.pageHTML = this.fetchPageHTML(this.rootURL);
this.pageSoup = this.fetchPageSoup(this.pageHTML);
 
// get the first <div id="forum_hd_topic_pagelinks">..</div> section on the page
Element forumPageLinkSection = this.pageSoup.getElementsByAttributeValue("id","forum_hd_topic_pagelinks").first();
// get all the links in the above <div> section
Elements forumPageLinks = forumPageLinkSection.getElementsByAttribute("href");
...

I’ve found that this method provides a simple and effective way of scraping pages and using the resultant JSoup tree to pick out important data.