Project

boilerpipe

0.01
No commit activity in last 3 years
No release in over 3 years
Ruby wrapper of the Boilerpipe API
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
 Dependencies
 Project Readme

A ruby wrapper for the Boilerpipe API.
Boilerpipe definition:

The boilerpipe library provides algorithms to detect and remove the surplus “clutter” (boilerplate, templates) around the main textual content of a web page.

For more information: http://code.google.com/p/boilerpipe/

Explication

The Boilerpipe module has only one method which is extract. Extract takes 2 parameters, first the url and second a hash.
The hash can have 3 options:

  • output => :html, :htmlFragment, :text, :json, :debug
  • extractor => :ArticleExtractor, :DefaultExtractor, :LargestContentExtractor, :KeepEverythingExtractor, :CanolaExtractor
  • api: => The api url

None of these options are mandatory. To find out more about these options checkout the Boilerpipe API http://boilerpipe-web.appspot.com/

Example

require "boilerpipe"
Boilerpipe.extract("http://techcrunch.com/2011/05/12/karma-is-a-bitch/", {:output => :json})