Project

webtractor

0.0
No commit activity in last 3 years
No release in over 3 years
The Webtractor library can extract main content from websites like news, blogs, etc without unwanted boilerplate (menus, footer, comments)
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

Runtime

 Project Readme

Webtractor

The Webtractor is a ruby library which is able to extract main content from webpages like news, blogs, etc. As a result you can just have a main content without any boilerplate (menu, footer, comments, etc).

Installation

You can install it directly via gem:

gem install webtractor

Or you can put it in your Gemfile:

gem 'webtractor'

Then run:

bundle install

Basic usage

extractor = Webtractor::Extractor.new
result = extractor.extract_from_url
'http://techcrunch.com/2014/05/24/dont-believe-anyone-who-tells-you-learning-to-code-is-easy/'
puts result.text

Or

extractor = Webtractor::Extractor.new
result = extractor.extract '<html><body>...</body></html>'

Or

page = Nokogiri::HTML(...)
extractor = Webtractor::Extractor.new
result = extractor.extract_from_xml page

You can also access Nokogiri document from result via xml attribute:

puts result.xml.xpath('...').text 

Advanced usage

Process of getting main content from the webpage is really simple. It consists of applying multiple filters on the document where every filter gets on input output of the last applied filter.

You can look at the names of default filters:

p Webtractor::Filters::DefaultFilter.new.filters.map{|f| f.class.to_s}

You can remove any filter:

extractor.remove_filter Webtractor::Filters::RemoveComments

Or you can also create your own filter. It can be any class which implements process method which takes page as an argument and returns page:

class RemoveBolds
  def process page
    page.css('b').remove
    page
  end
end

extractor.add_filter RemoveBolds.new

License

This library is distributed under the Beerware license.