Skip to content

Posts tagged ‘jsoup’

14
May

Web scraping in Java with Jsoup, Part 2 (How-to)

Web scraping refers to programmatically downloading a page and traversing its DOM to extract the data you are interested in. I wrote a parser class in Java to perform the web scraping for my blog analyzer project. In Part 1 of this how-to I explained how I set up the calling mechanism for executing the parser against blog URLs. Here, I explain the parser class itself.

But before getting into the code, it is important to take note of the HTML structure of the document that will be parsed. The pages of The Dish are quite heavy–full of menus and javascript and other stuff, but the area of interest is the set of blog posts themselves. This example shows the HTML structure of each blog post on The Dish:
Read moreRead more

28
Apr

Web scraping in Java with Jsoup, Part 1

In order to obtain the data to feed into my blog analyzer, content must be parsed from the pages of the blog itself.  This is called “web scraping”.  Jsoup will be used to parse the pages, and because this is a Spring project, Spring scheduling will be used to invoke the parser.

The following classes were created:

  • BlogRequest – invokes the parser on a given blog URL, passes parsed content to service layer
  • BlogRequestQueue – queues up and executes blog requests
  • BlogParser – interface with parseURL method
  • DishBlogParser – implements BlogParser, used to parse the blog The Dish

Each of these (aside from the interface) is configured as a Spring-managed bean.  The code for BlogRequest:
Read moreRead more

19
Mar

How to extract titles from web pages in Java

Let’s say you have a set of URLs and you want the web page titles associated with them. Maybe you’ve data-mined a bunch of links from HTML pages, or acquired a flat file listing URLs. How would you go about getting the corresponding page titles, and associating them with the URLs using Java?

You could use an HTML parser such as Jsoup to request the HTML document associated with each URL and parse it into a DOM document.  Once obtained, you could navigate the document and select the text from the title tag, like so:

String titleText = document.select("title").first().text();

Elegant, but a lot of overhead for such a simple task. You’d be loading the whole page into memory and parsing it into a DOM structure just to extract the title.  Instead, you could use the Apache HTTP Client library, which provides a robust API for requesting resources over the HTTP protocol.  But it would be unnecessary in this case.  Let’s keep it simple and rely only on the java standard library.

Read moreRead more

27
Feb

To sanitize user content, use an HTML parser

It is especially important, if you allow any HTML at all in user-submitted content, to sanitize that content by actually parsing the HTML and filtering it for any tags or attributes you wish to exclude. If you fail to do so, your site may be vulnerable to XSS (cross-site scripting) attacks.

Q: “But isn’t it overkill to parse the HTML?  Can’t I use other techniques, such as regular expressions or simple string replacement, to filter out dangerous tags and attributes?”  A: No, and I’ll explain why. Read moreRead more