Project

slasher

0.0
No commit activity in last 3 years
No release in over 3 years
This gem could extract the real content of and HTML article based on weight of words in HTML dom nodes.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

Runtime

~> 1.6
 Project Readme

slasherrb

Build Status Gem Version Code Climate Coverage Status

This project is actually the ruby version of slasherjs. Slasher is a library that could extract the main content of an HTML article document. The result of extraction is depending of assumption on HTML document structure itself. Therefore, there may be flaws in the result if the document doesn't match the structure that is recognised by the library. This condition will make the library will be improved from time to time.

How To Install

Like other rubygems, just:

gem install slasher

or put this on your Gemfile

gem 'slasher'

How To Use

To use the library, you need to have an HTML document first.

require 'net/http'
require 'slasher'

uri = URI("http://sea-games-2015.liputan6.com/read/2252937/all-indonesia-finals-ganda-putra-sumbang-emas")
html = Net::HTTP.get(uri)

slasher = Slasher.new(html)
content = slasher.slash

#content variable will have the main content of the HTML document (article).

Website Coverage

This library has been tested against some websites and you can see the complete list in this document

TODO

  1. Add more test cases: international websites
  2. Performance analysis
  3. Better API documentation