Posts

Display HTML content from a URL within OpenRefine using IFRAMES

Image
OpenRefine has a powerful feature with its Custom Tabular Exporter.

Which can be used to fetch and preview URLs to review their HTML content in your web browser.  Handy for doing small reconciling tasks at times. :)

1. Create an empty <iframe> element with the src URL as shown (you can also add any iframe attribute options you need like width, height, etc - https://developer.mozilla.org/en-US/docs/Web/HTML/Element/iframe ):



 2. Choose EXPORT --> Custom Tabular Exporter and select the URL column(s) and any others you may wish to view alongside the HTML content:



3. Choose Download tab and select HTML Table and then click Preview button.  This should launch another browser tab window.




4. The HTML Table is rendered with the iframe content being fetched as shown:





Open Source Tools for Data Mining

I have been asked what my favoriteOpen Source Tools for Data Mining with Statistics supportare.  In no particular order, other than recall, here they are.  Feel free to comment on these or any others you like that fall into this same category and the reasons why :

R - http://r-project.org/

GGobi - http://www.ggobi.org - a visualization program for R

Mondrain - http://stats.math.uni-augsburg.de/Mondrian - a visualization program for R (more biased for category work)

KNIME - http://www.knime.org

Orange - http://orange.biolab.si

Tanagra - http://eric.univ-lyon2.fr/~ricco/tanagra

Weka - http://www.cs.waikato.ac.nz/ml/weka

Yale / RapidMiner - http://rapid-i.com

Enjoy!
-Thad

Review of new parseHtml() Function in Google Refine

Image
Last post, I mentioned how Beautiful Soup is an elegant way to parse HTML with Google Refine.

Well, it just got better thanks to Iain Sproat's latest commit to Google Refine (and his Java skills are getting better all the time!).  If you pull down trunk and build, you'll see that he has integrated the jsoup.org java library that leverages upon Beautiful Soup.  Iain has done a great job of pushing the jsoup Element stack right up to GREL (Google Refine Expression Language) for concise usage.  I love it !

Using jsoup's simple selector syntax, I was able to easily parse out company websites from LinkedIn's public pages.  The example below says select the div called data-table that contains the term Website and return the 2nd <a href> htmlText.  In Refine, the ordering starts at [0], so in this case [1] gives the 2nd href link.  The jsoup.org website's cookbook and the use of selector-syntax is a great start to begin learning more.

Enjoy !
-Thad

Google Refine and easier HTML parsing

Image
After spending most of the day BANGING my head on using Regex and GREL to handle HTML parsing.

I thought, there MUST be a better way to parse HTML !!!

I know several of you who have thought the same thing.  So, I took the time today to find out where and how this could be improved directly in Google Refine or with an extension.
It just so happens Google Refine already has a wonderful extension with another language itself: Jython

Enter BeautifulSoup  (love the name?) a Jython library for powerful HTML parsing and entity extraction.



Here's more on how to use it easily within Refine:

http://code.google.com/p/google-refine/wiki/StrippingHTML

Enjoy!
-Thad