No commit activity in last 3 years
No release in over 3 years
This gem removes the surplus “clutter” (boilerplate, templates) around the main textual content of a web page (pure Ruby implementation). BoilerpipeArticle can be also used to parse (open graph) meta data and microdata. Check GitHub for usage examples.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

= 0.3.9
= 1.6.8
 Project Readme

BoilerpipeArticle

This gem removes the surplus “clutter” (boilerplate, templates) around the main textual content of a web page (pure Ruby implementation). It's especially made for news websites content. It's also able to extract schema.org microdata and other HTML meta data.

##Installation

gem install BoilerpipeArticle

###Usage Example

require 'boilerpipe_article'
require 'net/http'

uri = URI('http://www.bbc.com/news/election-us-2016-36935175')
html = Net::HTTP.get(uri)

parser =  BoilerpipeArticle.new(html)

articleText = parser.getArticle
metas = parser.getMetas
microdata = parser.getMicroData
allText  = parser.getAllText

puts articleText
puts metas
puts microdata

Runtime Dependencies:

nokogiri = 1.6.8 mida = 0.3.9

###Support

Check out textracto.com for lastest updates and API