Posts

Showing posts from 2010

Review of new parseHtml() Function in Google Refine

Image
Last post, I mentioned how Beautiful Soup is an elegant way to parse HTML with Google Refine.

Well, it just got better thanks to Iain Sproat's latest commit to Google Refine (and his Java skills are getting better all the time!).  If you pull down trunk and build, you'll see that he has integrated the jsoup.org java library that leverages upon Beautiful Soup.  Iain has done a great job of pushing the jsoup Element stack right up to GREL (Google Refine Expression Language) for concise usage.  I love it !

Using jsoup's simple selector syntax, I was able to easily parse out company websites from LinkedIn's public pages.  The example below says select the div called data-table that contains the term Website and return the 2nd <a href> htmlText.  In Refine, the ordering starts at [0], so in this case [1] gives the 2nd href link.  The jsoup.org website's cookbook and the use of selector-syntax is a great start to begin learning more.

Enjoy !
-Thad

Google Refine and easier HTML parsing

Image
After spending most of the day BANGING my head on using Regex and GREL to handle HTML parsing.

I thought, there MUST be a better way to parse HTML !!!

I know several of you who have thought the same thing.  So, I took the time today to find out where and how this could be improved directly in Google Refine or with an extension.
It just so happens Google Refine already has a wonderful extension with another language itself: Jython

Enter BeautifulSoup  (love the name?) a Jython library for powerful HTML parsing and entity extraction.



Here's more on how to use it easily within Refine:

http://code.google.com/p/google-refine/wiki/StrippingHTML

Enjoy!
-Thad