Skip to content

May 14, 2011

3

Web scraping in Java with Jsoup, Part 2 (How-to)

Web scraping refers to programmatically downloading a page and traversing its DOM to extract the data you are interested in. I wrote a parser class in Java to perform the web scraping for my blog analyzer project. In Part 1 of this how-to I explained how I set up the calling mechanism for executing the parser against blog URLs. Here, I explain the parser class itself.

But before getting into the code, it is important to take note of the HTML structure of the document that will be parsed. The pages of The Dish are quite heavy–full of menus and javascript and other stuff, but the area of interest is the set of blog posts themselves. This example shows the HTML structure of each blog post on The Dish:

<article>
    <aside>
        <ul class="entryActions" id="meta-6a00d83451c45669e2014e885e4354970d">
            <li class="entryEmail ir">
                <div class="st_email_custom maildiv" st_url="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html" st_title="Face Of The Day">email</div>
            </li>
            <li class="entryLink ir">
                <a href="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html" title="permalink this entry">permalink</a>
            </li>
            <li class="entryTweet"></li>
            <li class="entryLike"></li>
        </ul>

        <time datetime="2011-05-12T23:37:00-4:00" pubdate>12 May 2011 07:37 PM</time>
    </aside>

    <div class="entry">
        <h1>
            <a href="http://andrewsullivan.thedailybeast.com/2011/05/fac-5.html">Face Of The Day</a>
        </h1>
        <p>
            <a href="http://dailydish.typepad.com/.a/6a00d83451c45669e2014e885e4233970d-popup" onclick="window.open( this.href, &#39;_blank&#39;, &#39;width=640,height=480,scrollbars=no,resizable=no,toolbar=no,directories=no,location=no,menubar=no,status=no,left=0,top=0&#39; ); return false" style="display: inline;">
                <img alt="GT_WWII-VET-JEWISH-110511" class="asset  asset-image at-xid-6a00d83451c45669e2014e885e4233970d" src="http://dailydish.typepad.com/.a/6a00d83451c45669e2014e885e4233970d-550wi" style="width: 515px;" title="GT_WWII-VET-JEWISH-110511" />
            </a>
        </p>
        <p>
        A decorated  veteran takes part [truncated]
        </p>
    </div>
</article>

Blog posts are each contained within an HTML5 article tag. There is a time tag holding the date and time the post was published. A div with class aentry holds both the title and body of the post. The title is within an h1 and also contains the permalink for the post.

Now, the code to parse this page.

The simple blog parser interface again:

public interface BlogParser {
    public List<Link> parseURL(URL url) throws ParseException;
} 

Now to talk about the implementation class: DishBlogParser. The goal is to return a list of Link objects (a “Link” in this context represents one blog URL and its associated data). DishBlogParser will extract the title and body text of each blog post along with the post date, images, videos, and links contained therein. I’ll go through the class a section at a time. Starting from the top:

@Component("blogParser")
public class DishBlogParser implements BlogParser {
    
    @Value("${config.excerptLength}")
    private int excerptLength;
    @Autowired
    private DateTimeFormatter blogDateFormat;
    private final Cleaner cleaner;
    private final UrlValidator urlvalidator;
    
    public DishBlogParser() {
        Whitelist clean = Whitelist.simpleText().addTags("blockquote", "cite", "code", "p", "q", "s", "strike");
        cleaner = new Cleaner(clean);        
        urlvalidator = new UrlValidator(new String[]{"http","https"});
    }

The excerptLength field defines the maximum length for post body excerpts. The @Value annotation pulls in the value from a properties file configured in applicationContext.xml.

The blogDateFormat is a Joda formatter configured also in applicationContext.xml to match the date/time format used on The Dish. It will be used to parse dates from HTML into Joda DateTime objects. Here is how blogDateFormat is configured in applicationContext.xml:

<bean id="blogDateFormat" 
         class="org.joda.time.format.DateTimeFormat" 
         factory-method="forPattern">
    <constructor-arg value="dd MMM yyyy hh:mm aa"/>
</bean>

The Cleaner object is a Jsoup class that applies a whitelist filter to HTML. In this case, the cleaner is used to whitelist tags that will be allowed to appear in blog body excerpts.

Finally, the UrlValidator comes from Apache Commons and will be used to validate the syntax of URLs contained within blog posts.

Now, for the parseURL method:

    public List<Link> parseURL(URL url) throws ParseException {
        try {            
            // retrieve the document using Jsoup
            Connection conn = Jsoup.connect(url.toString());
            conn.timeout(12000);
            conn.userAgent("Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Trident/4.0)");
            Document doc = conn.get();

            // select all article tags
            Elements posts = doc.select("article");
            
            // base URI will be used within the loop below
            String baseUri = (new StringBuilder())
                .append(url.getProtocol())
                .append("://")
                .append(url.getHost())
                .toString();

            // initialize a list of Links
            List<Link> links = new ArrayList<Link>();

Here, Jsoup is used to connect to the URL. I set a generous connection timeout, because at times The Dish server is not very snappy. I also set a common user agent, just as a general practice when requesting a web page programmatically.

On Line 7 the Document is retrieved–this is a DOM representation of the entire page. For this project, only the blog posts themselves are needed. Because each blog post is contained in an article tag, the set of posts is obtained by calling doc.select("article") (Line 10). We’re about to loop through them, but first we need to define the base URI of our URL for something a bit further down, and also initialize the List which will hold our extracted Link objects.

Now, the loop. It starts like this:

            // loop through, extracting relevant data
            for (Element post : posts) {
                Link link = new Link();
                
                // extract the title of the post
                Elements elms = post.select(".entry h1");
                String title = (elms.isEmpty() ? "No Title" : elms.first().text().trim());
                link.setTitle(title);

First, an empty Link object is initialized. Then we extract the title. Recall that “post” is a Jsoup element pointing to the article tag in the DOM. post.select(".entry h1") grabs the h1 title tag, from which we get the title string.

In a similar fashion, we grab the URL and the date:

                // extract the URL of the post
                elms = post.select("aside .entryLink a");
                if (elms.isEmpty()) {
                    Logger.getLogger(DishBlogParser.class.getName()).log(Level.WARNING, "UNABLE TO LOCATE PERMALINK, TITLE = "+ title +", URL = "+ url);
                    continue;
                }
                link.setUrl(elms.first().attr("href"));

                // extract the date of the post
                elms = post.select("aside time");
                if (elms.isEmpty()) {
                    Logger.getLogger(DishBlogParser.class.getName()).log(Level.WARNING, "UNABLE TO LOCATE DATE, TITLE = "+ title +", URL = "+ url);
                    continue;
                }
                // parse the date string into a Joda DateTime object
                DateTime dt = blogDateFormat.parseDateTime(elms.first().text().trim());
                link.setLinkDate(dt);

Note that failure to extract the URL or date is unacceptable, a warning is logged, and further processing is skipped. Note also on Line 16 blogDateFormat is used to parse the date string from the HTML into a DateTime object.

Next, let’s grab the body of the post and create an excerpt from it:

                // extract the body of the post (includes title tag at this point)
                Elements body = post.select(".entry");
                // remove the "more" link
                body.select(".moreLink").remove();
                
                // remove the title (h1) now from the body
                body.select("h1").remove();
                // set full text on link, used for indexing/searching (not stored)
                link.setFullText(body.text());

                // create a body "Document"                
                Document bodyDoc = Document.createShell(baseUri);
                for (Element bodyEl : body)
                    bodyDoc.body().appendChild(bodyEl);
                // remove unwanted tags by applying a tag whitelist
                // the whitelisted tags will appear when displaying excerpts
                String bodyhtml = cleaner.clean(bodyDoc).body().html();
                
                if (bodyhtml.length() > excerptLength) {
                    // we need to trim it down to excerptLength
                    bodyhtml = trimExerpt(bodyhtml, excerptLength);
                    // we need to parse this again now to fix possible unclosed tags caused by trimming
                    bodyhtml = Jsoup.parseBodyFragment(bodyhtml).body().html();
                }
                link.setExerpt(bodyhtml);

Recall the body is contained in a div classed entry. The body may contain a “read on” link that expands the content. That link, if present, is removed on Line 4. The title h1 tag is also removed, and the remaining text is stored on Line 9. This full text is not destined to be stored in the database–instead it will be indexed by our search engine.

To create the excerpt, unwanted HTML tags must be removed. This is where the Jsoup Cleaner comes in. Because the Cleaner only processes Document objects, a dummy Document is created for the post (this is also where baseUri is used).

If, after processing the post body through the Cleaner, the length exceeds the excerptLength, it must be trimmed down to size. The trimExcerpt method does this. Because trimming might truncate closing HTML tags, Jsoup is used once more to parse the excerpt string, correcting any unbalanced tags. Finally, we have our excerpt.

This is the trimExerpt method that is called on Line 21 above:

    private String trimExcerpt(String str, int maxLen) {
        if (str.length() <= maxLen)
            return str;
        
        int endIdx = maxLen;
        while (endIdx > 0 && str.charAt(endIdx) != ' ')
            endIdx--;

        return str.substring(0, endIdx);
    }

The idea is to use maxLen as a suggestion, and keep backing up until a space character is found. In this way, words will not be cut off in the middle.

Continuing the loop, next the links are extracted. They are represented by InnerLink objects. Any invalid or self links are skipped.

                // extract the links within the post
                List<InnerLink> inlinks = new ArrayList<InnerLink>();
                Elements innerlinks = body.select("a[href]");                
                
                // loop through each link, discarding self-links and invalids
                for (Element innerlink : innerlinks) {
                    String linkUrl = innerlink.attr("abs:href").trim();
                    if (linkUrl.equals(link.getUrl()))
                        continue;
                    else if (urlvalidator.isValid(linkUrl)) {                        
                        //System.out.println("link = "+ linkUrl);
                        InnerLink inlink = new InnerLink();
                        inlink.setUrl(linkUrl);
                        inlinks.add(inlink);
                    }
                    else
                        Logger.getLogger(DishBlogParser.class.getName()).log(Level.INFO, "INVALID URL: "+ linkUrl);
                }
                link.setInnerLinks(inlinks);

Next, extract any images:

                // extract the images from the post
                List<Image> linkimgs = new ArrayList<Image>();
                Elements images = body.select("img");
                for (Element image : images) {
                    Image img = new Image();
                    img.setOrigUrl(image.attr("src"));
                    img.setAltText(image.attr("alt").replaceAll("_", " "));
                    linkimgs.add(img);
                }
                link.setImages(linkimgs);

Finally, extract any Youtube or Vimeo videos (the two most-popular types). Note that this requires a more complex selector syntax (Line 2), in particular because over the years several different HTML codes have been used:

                // extract Youtube and Vimeo videos from the post
                elms = body.select("iframe[src~=(youtube\\.com|vimeo\\.com)], object[data~=(youtube\\.com|vimeo\\.com)], embed[src~=(youtube\\.com|vimeo\\.com)]");
                List<Video> videos = new ArrayList<Video>(2);
                for (Element video : elms) {
                    String vidurl = video.attr("src");
                    if (vidurl == null)
                        vidurl = video.attr("data");
                    if (vidurl == null || vidurl.trim().equals(""))
                        continue;
                    Video vid = new Video();
                    vid.setUrl(vidurl);
                    if (vidurl.toLowerCase().contains("vimeo.com"))
                        vid.setProvider(VideoProvider.VIMEO);
                    else
                        vid.setProvider(VideoProvider.YOUTUBE);
                    videos.add(vid);
                }
                link.setVideos(videos);

Finally, the loop is finished; all data has been gathered. So this Link object is added to the List, end loop, and return:

                links.add(link);
            }
            return links;
        }
        catch (IOException ex) {
            Logger.getLogger(DishBlogParser.class.getName()).log(Level.SEVERE, "IOException when attemping to parse URL "+ url, ex);
            throw new ParseException(ex);
        }
    }

In conclusion…

This post has demonstrated web scraping using the open-source Jsoup library. Specifically, we loaded a page from a URL and used Jsoup’s selector syntax to extract the desired pieces of data. In a future post, I will write about what happens next: the list of Links is processed by a service bean and stored in the database.

3 Comments Post a comment
  1. Helen Rosy
    Aug 3 2011

    If you are looking for easy and efficient web screen scraper – here you are! SmokeDoc will extract almost any type of data and help you to use it according to your purposes!

    http://smokedoc.org

    Reply
  2. sagh
    Aug 28 2012

    How can i use jsoup to omit ads from websites if I have the list of ad sites links

    Reply
  3. Gene
    Mar 22 2013

    In a future post, I will write about what happens next: the list of Links is processed by a service bean and stored in the database.

    I know this is an old post, but I’m just curious: did you ever get around to writing up that part of the tutorial? I couldn’t find it on your site.

    I need to do something very similar, so I’m interested in your approach.

    Thanks.

    Reply

Leave a Reply to Gene

(required)
(required)

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments