Project

boilerpipe

0.01
No commit activity in last 3 years
No release in over 3 years
Ruby wrapper of the Boilerpipe API
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies
 Project Readme

A ruby wrapper for the Boilerpipe API.
Boilerpipe definition:

The boilerpipe library provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.

For more information: http://code.google.com/p/boilerpipe/

Explication

The Boilerpipe module has only one method which is extract. Extract takes 2 parameters, first the url and second a hash.
The hash can have 3 options:

  • output => :html, :htmlFragment, :text, :json, :debug
  • extractor => :ArticleExtractor, :DefaultExtractor, :LargestContentExtractor, :KeepEverythingExtractor, :CanolaExtractor
  • api: => The api url

None of these options are mandatory. To find out more about these options checkout the Boilerpipe API http://boilerpipe-web.appspot.com/

Example

require "boilerpipe"
Boilerpipe.extract("http://techcrunch.com/2011/05/12/karma-is-a-bitch/", {:output => :json})