web scraping | A Web Coding Blog

May

Web scraping in Java with Jsoup, Part 2 (How-to)

Web scraping refers to programmatically downloading a page and traversing its DOM to extract the data you are interested in. I wrote a parser class in Java to perform the web scraping for my blog analyzer project. In Part 1 of this how-to I explained how I set up the calling mechanism for executing the parser against blog URLs. Here, I explain the parser class itself.

But before getting into the code, it is important to take note of the HTML structure of the document that will be parsed. The pages of The Dish are quite heavy–full of menus and javascript and other stuff, but the area of interest is the set of blog posts themselves. This example shows the HTML structure of each blog post on The Dish:
Read more

Apr

Web scraping in Java with Jsoup, Part 1

In order to obtain the data to feed into my blog analyzer, content must be parsed from the pages of the blog itself. This is called “web scraping”. Jsoup will be used to parse the pages, and because this is a Spring project, Spring scheduling will be used to invoke the parser.

The following classes were created:

BlogRequest – invokes the parser on a given blog URL, passes parsed content to service layer
BlogRequestQueue – queues up and executes blog requests
BlogParser – interface with parseURL method
DishBlogParser – implements BlogParser, used to parse the blog The Dish

Each of these (aside from the interface) is configured as a Spring-managed bean. The code for BlogRequest:
Read more

Posts tagged ‘web scraping’

Web scraping in Java with Jsoup, Part 2 (How-to)

Web scraping in Java with Jsoup, Part 1

Your Link Here

About

Pages

Search

Posts tagged ‘web scraping’

Subscribe

Web scraping in Java with Jsoup, Part 2 (How-to)

Web scraping in Java with Jsoup, Part 1

Tag Cloud

Your Link Here

Popular Posts

About

Pages

Search