Posts

Showing posts from December, 2010

Review of new parseHtml() Function in Google Refine

Image
Last post, I mentioned how Beautiful Soup is an elegant way to parse HTML with Google Refine.

Well, it just got better thanks to Iain Sproat's latest commit to Google Refine (and his Java skills are getting better all the time!).  If you pull down trunk and build, you'll see that he has integrated the jsoup.org java library that leverages upon Beautiful Soup.  Iain has done a great job of pushing the jsoup Element stack right up to GREL (Google Refine Expression Language) for concise usage.  I love it !

Using jsoup's simple selector syntax, I was able to easily parse out company websites from LinkedIn's public pages.  The example below says select the div called data-table that contains the term Website and return the 2nd <a href> htmlText.  In Refine, the ordering starts at [0], so in this case [1] gives the 2nd href link.  The jsoup.org website's cookbook and the use of selector-syntax is a great start to begin learning more.

Enjoy !
-Thad