Google Refine and easier HTML parsing

After spending most of the day BANGING my head on using Regex and GREL to handle HTML parsing.

I thought, there MUST be a better way to parse HTML !!!

I know several of you who have thought the same thing.  So, I took the time today to find out where and how this could be improved directly in Google Refine or with an extension.
It just so happens Google Refine already has a wonderful extension with another language itself: Jython

Enter BeautifulSoup  (love the name?) a Jython library for powerful HTML parsing and entity extraction.



Here's more on how to use it easily within Refine:

http://code.google.com/p/google-refine/wiki/StrippingHTML

Enjoy!
-Thad

Comments

  1. this good POST
    I like this.

    form
    http://bantalsilikon01.blogspot.com
    http://bumbupecel.esy.es
    http://www.facebook.com/bantalsilikongrosir
    http://bumbupecelbali.blogspot.com

    ReplyDelete

Post a Comment

Popular posts from this blog

Display HTML content from a URL within OpenRefine using IFRAMES

Open Source Tools for Data Mining