Skip to content

March 19, 2011

10

How to extract titles from web pages in Java

Let’s say you have a set of URLs and you want the web page titles associated with them. Maybe you’ve data-mined a bunch of links from HTML pages, or acquired a flat file listing URLs. How would you go about getting the corresponding page titles, and associating them with the URLs using Java?

You could use an HTML parser such as Jsoup to request the HTML document associated with each URL and parse it into a DOM document.  Once obtained, you could navigate the document and select the text from the title tag, like so:

String titleText = document.select("title").first().text();

Elegant, but a lot of overhead for such a simple task. You’d be loading the whole page into memory and parsing it into a DOM structure just to extract the title.  Instead, you could use the Apache HTTP Client library, which provides a robust API for requesting resources over the HTTP protocol.  But it would be unnecessary in this case.  Let’s keep it simple and rely only on the java standard library.

To extract the title from a web page, you need to open up a URLConnection.  With this connection, you’ll be able to read response headers from the server as well as the response body (which ought to contain a title tag).  Before attempting to grab the page title, you should consider the Content-Type response header.  Validate that the URL does indeed reference a document of type text/html, otherwise your URL may be referencing an image file, PDF or other type of resource.

Next, it is good practice to determine the character set of the HTML page.  This piece of data is frequently sent by the server in the Content-Type header value.  It isn’t always, and may instead be sent in an HTML meta tag.  For this example we’ll look only to the Content-Type header, and if the character set is not specified there, we will default to your platform’s default character set.

After you’ve verified that the URL points to an HTML page and have determined the character set, the next step is to extract the title text from the response body.  In this example, I use regular expressions to extract and clean up the title.  Have a look at this TitleExtractor class, with comments to explain what is going on:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.nio.charset.Charset;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class TitleExtractor {
    /* the CASE_INSENSITIVE flag accounts for
     * sites that use uppercase title tags.
     * the DOTALL flag accounts for sites that have
     * line feeds in the title text */
    private static final Pattern TITLE_TAG =
        Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
 
    /**
     * @param url the HTML page
     * @return title text (null if document isn't HTML or lacks a title tag)
     * @throws IOException
     */
    public static String getPageTitle(String url) throws IOException {
        URL u = new URL(url);
        URLConnection conn = u.openConnection();
 
        // ContentType is an inner class defined below
        ContentType contentType = getContentTypeHeader(conn);
        if (!contentType.contentType.equals("text/html"))
            return null; // don't continue if not HTML
        else {
            // determine the charset, or use the default
            Charset charset = getCharset(contentType);
            if (charset == null)
                charset = Charset.defaultCharset();
 
            // read the response body, using BufferedReader for performance
            InputStream in = conn.getInputStream();
            BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset));
            int n = 0, totalRead = 0;
            char[] buf = new char[1024];
            StringBuilder content = new StringBuilder();
 
            // read until EOF or first 8192 characters
            while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) {
                content.append(buf, 0, n);
                totalRead += n;
            }
            reader.close();
 
            // extract the title
            Matcher matcher = TITLE_TAG.matcher(content);
            if (matcher.find()) {
                /* replace any occurrences of whitespace (which may
                 * include line feeds and other uglies) as well
                 * as HTML brackets with a space */
                return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim();
            }
            else
                return null;
        }
    }
 
    /**
     * Loops through response headers until Content-Type is found.
     * @param conn
     * @return ContentType object representing the value of
     * the Content-Type header
     */
    private static ContentType getContentTypeHeader(URLConnection conn) {
        int i = 0;
        boolean moreHeaders = true;
        do {
            String headerName = conn.getHeaderFieldKey(i);
            String headerValue = conn.getHeaderField(i);
            if (headerName != null && headerName.equals("Content-Type"))
                return new ContentType(headerValue);
 
            i++;
            moreHeaders = headerName != null || headerValue != null;
        }
        while (moreHeaders);
 
        return null;
    }
 
    private static Charset getCharset(ContentType contentType) {
        if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName))
            return Charset.forName(contentType.charsetName);
        else
            return null;
    }
 
    /**
     * Class holds the content type and charset (if present)
     */
    private static final class ContentType {
        private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
 
        private String contentType;
        private String charsetName;
        private ContentType(String headerValue) {
            if (headerValue == null)
                throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue");
            int n = headerValue.indexOf(";");
            if (n != -1) {
                contentType = headerValue.substring(0, n);
                Matcher matcher = CHARSET_HEADER.matcher(headerValue);
                if (matcher.find())
                    charsetName = matcher.group(1);
            }
            else
                contentType = headerValue;
        }
    }
}

Making use of this class is simple:

String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/");
System.out.println(title);

Output: Wikipedia, the free encyclopedia

So in this example, we used the standard java library to look up a web page and extract its title.  Normally I would recommend using an HTML parser, but for this simple task it was not necessary.

Read more from Java
10 Comments Post a comment
  1. ahmed
    May 14 2011

    is there any like button here 😀

    Reply
  2. Martha
    May 19 2012

    thnx alot 🙂

    Reply
  3. Arie
    Jul 11 2012

    Thanks a lot very helpful. Saved me some time.

    Reply
  4. Herbert
    Jul 12 2012

    Thank you for this code, it works fine.

    Reply
  5. Laurentiu
    Jul 16 2012

    great post, thank you!

    Reply
  6. simonalsa
    Dec 5 2013

    Nice contirbution. Ty.

    Reply
  7. soumya
    Dec 17 2013

    nice post

    Reply
  8. Nada
    Mar 31 2014

    Thanks thats really helpful 🙂

    Reply
  9. Tyler
    Apr 18 2014

    Thank you!

    Reply
  10. Alex
    Jun 12 2014

    I attempted to do this on Android with API past 3.0 and this code runs on the Main Thread. I used it before but as of present it is not possible

    Reply

Share your thoughts, post a comment.

(required)
(required)

Note: HTML is allowed. Your email address will never be published.

Subscribe to comments