How to extract titles from web pages in Java

Let’s say you have a set of URLs and you want the web page titles associated with them. Maybe you’ve data-mined a bunch of links from HTML pages, or acquired a flat file listing URLs. How would you go about getting the corresponding page titles, and associating them with the URLs using Java?

You could use an HTML parser such as Jsoup to request the HTML document associated with each URL and parse it into a DOM document. Once obtained, you could navigate the document and select the text from the title tag, like so:

String titleText = document.select("title").first().text();

Elegant, but a lot of overhead for such a simple task. You’d be loading the whole page into memory and parsing it into a DOM structure just to extract the title. Instead, you could use the Apache HTTP Client library, which provides a robust API for requesting resources over the HTTP protocol. But it would be unnecessary in this case. Let’s keep it simple and rely only on the java standard library.

To extract the title from a web page, you need to open up a URLConnection. With this connection, you’ll be able to read response headers from the server as well as the response body (which ought to contain a title tag). Before attempting to grab the page title, you should consider the Content-Type response header. Validate that the URL does indeed reference a document of type text/html, otherwise your URL may be referencing an image file, PDF or other type of resource.

Next, it is good practice to determine the character set of the HTML page. This piece of data is frequently sent by the server in the Content-Type header value. It isn’t always, and may instead be sent in an HTML meta tag. For this example we’ll look only to the Content-Type header, and if the character set is not specified there, we will default to your platform’s default character set.

After you’ve verified that the URL points to an HTML page and have determined the character set, the next step is to extract the title text from the response body. In this example, I use regular expressions to extract and clean up the title. Have a look at this TitleExtractor class, with comments to explain what is going on:

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.URL;
import java.net.URLConnection;
import java.nio.charset.Charset;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
 
public class TitleExtractor {
    /* the CASE_INSENSITIVE flag accounts for
     * sites that use uppercase title tags.
     * the DOTALL flag accounts for sites that have
     * line feeds in the title text */
    private static final Pattern TITLE_TAG =
        Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
 
    /**
     * @param url the HTML page
     * @return title text (null if document isn't HTML or lacks a title tag)
     * @throws IOException
     */
    public static String getPageTitle(String url) throws IOException {
        URL u = new URL(url);
        URLConnection conn = u.openConnection();
 
        // ContentType is an inner class defined below
        ContentType contentType = getContentTypeHeader(conn);
        if (!contentType.contentType.equals("text/html"))
            return null; // don't continue if not HTML
        else {
            // determine the charset, or use the default
            Charset charset = getCharset(contentType);
            if (charset == null)
                charset = Charset.defaultCharset();
 
            // read the response body, using BufferedReader for performance
            InputStream in = conn.getInputStream();
            BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset));
            int n = 0, totalRead = 0;
            char[] buf = new char[1024];
            StringBuilder content = new StringBuilder();
 
            // read until EOF or first 8192 characters
            while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) {
                content.append(buf, 0, n);
                totalRead += n;
            }
            reader.close();
 
            // extract the title
            Matcher matcher = TITLE_TAG.matcher(content);
            if (matcher.find()) {
                /* replace any occurrences of whitespace (which may
                 * include line feeds and other uglies) as well
                 * as HTML brackets with a space */
                return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim();
            }
            else
                return null;
        }
    }
 
    /**
     * Loops through response headers until Content-Type is found.
     * @param conn
     * @return ContentType object representing the value of
     * the Content-Type header
     */
    private static ContentType getContentTypeHeader(URLConnection conn) {
        int i = 0;
        boolean moreHeaders = true;
        do {
            String headerName = conn.getHeaderFieldKey(i);
            String headerValue = conn.getHeaderField(i);
            if (headerName != null && headerName.equals("Content-Type"))
                return new ContentType(headerValue);
 
            i++;
            moreHeaders = headerName != null || headerValue != null;
        }
        while (moreHeaders);
 
        return null;
    }
 
    private static Charset getCharset(ContentType contentType) {
        if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName))
            return Charset.forName(contentType.charsetName);
        else
            return null;
    }
 
    /**
     * Class holds the content type and charset (if present)
     */
    private static final class ContentType {
        private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
 
        private String contentType;
        private String charsetName;
        private ContentType(String headerValue) {
            if (headerValue == null)
                throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue");
            int n = headerValue.indexOf(";");
            if (n != -1) {
                contentType = headerValue.substring(0, n);
                Matcher matcher = CHARSET_HEADER.matcher(headerValue);
                if (matcher.find())
                    charsetName = matcher.group(1);
            }
            else
                contentType = headerValue;
        }
    }
}

import java.io.BufferedReader; import java.io.IOException; import java.io.InputStream; import java.io.InputStreamReader; import java.net.URL; import java.net.URLConnection; import java.nio.charset.Charset; import java.util.regex.Matcher; import java.util.regex.Pattern; public class TitleExtractor { /* the CASE_INSENSITIVE flag accounts for * sites that use uppercase title tags. * the DOTALL flag accounts for sites that have * line feeds in the title text */ private static final Pattern TITLE_TAG = Pattern.compile("\\<title>(.*)\\</title>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); /** * @param url the HTML page * @return title text (null if document isn't HTML or lacks a title tag) * @throws IOException */ public static String getPageTitle(String url) throws IOException { URL u = new URL(url); URLConnection conn = u.openConnection(); // ContentType is an inner class defined below ContentType contentType = getContentTypeHeader(conn); if (!contentType.contentType.equals("text/html")) return null; // don't continue if not HTML else { // determine the charset, or use the default Charset charset = getCharset(contentType); if (charset == null) charset = Charset.defaultCharset(); // read the response body, using BufferedReader for performance InputStream in = conn.getInputStream(); BufferedReader reader = new BufferedReader(new InputStreamReader(in, charset)); int n = 0, totalRead = 0; char[] buf = new char[1024]; StringBuilder content = new StringBuilder(); // read until EOF or first 8192 characters while (totalRead < 8192 && (n = reader.read(buf, 0, buf.length)) != -1) { content.append(buf, 0, n); totalRead += n; } reader.close(); // extract the title Matcher matcher = TITLE_TAG.matcher(content); if (matcher.find()) { /* replace any occurrences of whitespace (which may * include line feeds and other uglies) as well * as HTML brackets with a space */ return matcher.group(1).replaceAll("[\\s\\<>]+", " ").trim(); } else return null; } } /** * Loops through response headers until Content-Type is found. * @param conn * @return ContentType object representing the value of * the Content-Type header */ private static ContentType getContentTypeHeader(URLConnection conn) { int i = 0; boolean moreHeaders = true; do { String headerName = conn.getHeaderFieldKey(i); String headerValue = conn.getHeaderField(i); if (headerName != null && headerName.equals("Content-Type")) return new ContentType(headerValue); i++; moreHeaders = headerName != null || headerValue != null; } while (moreHeaders); return null; } private static Charset getCharset(ContentType contentType) { if (contentType != null && contentType.charsetName != null && Charset.isSupported(contentType.charsetName)) return Charset.forName(contentType.charsetName); else return null; } /** * Class holds the content type and charset (if present) */ private static final class ContentType { private static final Pattern CHARSET_HEADER = Pattern.compile("charset=([-_a-zA-Z0-9]+)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL); private String contentType; private String charsetName; private ContentType(String headerValue) { if (headerValue == null) throw new IllegalArgumentException("ContentType must be constructed with a not-null headerValue"); int n = headerValue.indexOf(";"); if (n != -1) { contentType = headerValue.substring(0, n); Matcher matcher = CHARSET_HEADER.matcher(headerValue); if (matcher.find()) charsetName = matcher.group(1); } else contentType = headerValue; } } }

Making use of this class is simple:

String title = TitleExtractor.getPageTitle("http://en.wikipedia.org/");
System.out.println(title);

Output: Wikipedia, the free encyclopedia

So in this example, we used the standard java library to look up a web page and extract its title. Normally I would recommend using an HTML parser, but for this simple task it was not necessary.

Read more from Java

html, jsoup

10 Comments Post a comment

ahmed

May 14 2011

is there any like button here 😀
Reply
Martha

May 19 2012

thnx alot 🙂
Reply
Arie

Jul 11 2012

Thanks a lot very helpful. Saved me some time.
Reply
Herbert

Jul 12 2012

Thank you for this code, it works fine.
Reply
Laurentiu

Jul 16 2012

great post, thank you!
Reply
simonalsa

Dec 5 2013

Nice contirbution. Ty.
Reply
soumya

Dec 17 2013

nice post
Reply
Nada

Mar 31 2014

Thanks thats really helpful 🙂
Reply
Tyler

Apr 18 2014

Thank you!
Reply
Alex

Jun 12 2014

I attempted to do this on Android with API past 3.0 and this code runs on the Main Thread. I used it before but as of present it is not possible
Reply

March 19, 2011

How to extract titles from web pages in Java

Share your thoughts, post a comment.

Your Link Here

About

Pages

Search

March 19, 2011

Subscribe

How to extract titles from web pages in Java

Share your thoughts, post a comment.

Tag Cloud

Your Link Here

Popular Posts

About

Pages

Search