Google Refine and easier HTML parsing
After spending most of the day BANGING my head on using Regex and GREL to handle HTML parsing.
I thought, there MUST be a better way to parse HTML !!!
I know several of you who have thought the same thing. So, I took the time today to find out where and how this could be improved directly in Google Refine or with an extension.
It just so happens Google Refine already has a wonderful extension with another language itself: Jython
Enter BeautifulSoup (love the name?) a Jython library for powerful HTML parsing and entity extraction.
Here's more on how to use it easily within Refine:
http://code.google.com/p/google-refine/wiki/StrippingHTML
Enjoy!
-Thad
I thought, there MUST be a better way to parse HTML !!!
I know several of you who have thought the same thing. So, I took the time today to find out where and how this could be improved directly in Google Refine or with an extension.
It just so happens Google Refine already has a wonderful extension with another language itself: Jython
Enter BeautifulSoup (love the name?) a Jython library for powerful HTML parsing and entity extraction.
Here's more on how to use it easily within Refine:
http://code.google.com/p/google-refine/wiki/StrippingHTML
Enjoy!
-Thad
Comments
Post a Comment