Skip to content

April 28, 2011

Web scraping in Java with Jsoup, Part 1

In order to obtain the data to feed into my blog analyzer, content must be parsed from the pages of the blog itself.  This is called “web scraping”.  Jsoup will be used to parse the pages, and because this is a Spring project, Spring scheduling will be used to invoke the parser.

The following classes were created:

  • BlogRequest – invokes the parser on a given blog URL, passes parsed content to service layer
  • BlogRequestQueue – queues up and executes blog requests
  • BlogParser – interface with parseURL method
  • DishBlogParser – implements BlogParser, used to parse the blog The Dish

Each of these (aside from the interface) is configured as a Spring-managed bean.  The code for BlogRequest:

@Component("blogRequest")
public class BlogRequest {
    @Autowired(required=true)
    private BlogParser parser;
    @Autowired(required=true)
    private BlogService service;
    @Autowired(required=true)
    private BlogRequestQueue requestQueue;
    /**
     * blogUrl URL to pages of the blog, in a format like
     * http://andrewsullivan.thedailybeast.com/page/##/ where
     * ## stands in for the page number.
     */
    @Value("${config.blogUrl}")
    private String blogUrl;

    /**
     * Execute a blog request for a given page number, invoking the parser,
     * and passing the data to the service layer for further processing.
     * @param pageNumber
     */
    public void execute(int pageNumber) {
        System.out.println("executing blog request");
        try {
            List<Link> links = parser.parseURL(new URL(blogUrl.replace("##", String.valueOf(pageNumber))));
            for (Link link : links) {
                service.addLinkAsync(link);
            }
            //if page > 1 and links not empty, queue a request for the next page
            if (pageNumber > 1 && !links.isEmpty())
                requestQueue.enqueue(pageNumber + 1);
        }

        catch (ParseException pe) {
            Logger.getLogger(BlogRequest.class.getName()).log(Level.SEVERE, null, pe);
            //reprocess
            requestQueue.enqueue(pageNumber);
        }
        catch (MalformedURLException mue) {
            Logger.getLogger(BlogRequest.class.getName()).log(Level.SEVERE, null, mue);
        }
    }
}

The @Component annotation allows Spring to autowire this bean to other beans.  Here, we autowire in a parser, a service bean, and the request queue bean.  blogUrl is configured in a properties file and @Value is used to wire it in.

The execute method invokes the parser, which returns a List of Link objects.  A “Link”, in the context of this project, is a blog entry–specifically, its URL, posted date, excerpt, and assorted other data.  This list is passed to the service bean for further processing.  Provided certain conditions are met, the next page is added to the queue.  In this way, older blog pages will be scraped one by one until the oldest is reached.

The code for BlogRequestQueue is pretty simple:

public class BlogRequestQueue {
    @Autowired(required=true)
    private BlogRequest blogRequest;
    private Queue<Integer> queue = new LinkedList<Integer>();
    /**
     * pageScanStart page number to initialize the queue to.
     * A value of zero or less will disable page scanning.
     */
    @Value("${config.pageScanStart}")
    private int pageScanStart = 0;
    
    @PostConstruct    
    public void initializeQueue() {
        if (pageScanStart > 0)
            queue.add(pageScanStart);
    }

    public boolean enqueue(int pageNumber) {
        return queue.add(pageNumber);
    }

    /**
     * Execute the next request in the queue
     */
    //@Scheduled(fixedDelay=30000l)
    public synchronized void executeNext() {
        Integer pageNumber = queue.poll();
        if (pageNumber != null)
            blogRequest.execute(pageNumber);
    }

The value of pageScanStart is set via @Value from a properties file. The @PostConstruct annotation is used to mark initializeQueue() as an init method. Note that the @Scheduled annotation on the executeNext method is commented out. Rather than hard-code this setting, I moved the scheduling to applicationContext.xml to make the interval configurable.

BlogParser.java:

public interface BlogParser {

    public List<Link> parseURL(URL url) throws ParseException;

}

The code for DishBlogParser is where all the interesting web scraping happens. See my post web scraping Part 2 for the details.

Share your thoughts, post a comment.

(required)
(required)

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments