0.02
No commit activity in last 3 years
No release in over 3 years
Massages HTML how you want to: sanitize tags, remove headers and footers; output to html, markdown, or plain text.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 2.5

Runtime

 Project Readme

HTML Massage Build Status Gem Version

Supported Ruby versions

Ruby 2.1 and above.

Summary

  • Remove headers and footers and navigation, and strip to only the "content" part of the HTML
  • Sanitize tags, removing javascript and styling
  • Convert HTML to markdown, plain text, or sanitized HTML

Massaging from the command line

html_massage html https://en.wikipedia.org/wiki/Technological_singularity > singularity.html
html_massage text https://en.wikipedia.org/wiki/Technological_singularity > singularity.txt
html_massage markdown https://en.wikipedia.org/wiki/Technological_singularity > singularity.md

These files will look something like:

==> singularity.html <==
<h1 id="firstHeading" class="firstHeading"><span dir="auto">Technological singularity</span></h1>

<p>The <b>technological singularity</b> is the theoretical emergence of greater-than-human <a href="/wiki/Superintelligence" title="Superintelligence">superintelligence</a> through technological means.<sup id="cite_ref-1" class="reference"><a href="#cite_note-1"><span>[</span>1<span>]</span></a></sup> Since the capabilities of such intelligence would be difficult for an unaided human mind to comprehend, the occurrence of a technological singularity is seen as an intellectual <a href="/wiki/Event_horizon" title="Event horizon">event horizon</a>, beyond which events cannot be predicted or understood.</p>
...

==> singularity.md <==
# Technological singularity

The **technological singularity** is the theoretical emergence of greater-than-human [superintelligence](https://en.wikipedia.org/wiki/Superintelligence "Superintelligence") through technological means. [1] Since the capabilities of such intelligence would be difficult for an unaided human mind to comprehend, the occurrence of a technological singularity is seen as an intellectual [event horizon](https://en.wikipedia.org/wiki/Event_horizon "Event horizon") , beyond which events cannot be predicted or understood.
...

==> singularity.txt <==
Technological singularity

The technological singularity is the theoretical emergence of greater-than-human superintelligence through technological means.[1] Since the capabilities of such intelligence would be difficult for an unaided human mind to comprehend, the occurrence of a technological singularity is seen as an intellectual event horizon, beyond which events cannot be predicted or understood.
...

Massaging from Ruby

Full Massage

  • Use default whitelist of tags and attributes to sanitize HTML
  • Use default selectors (both include and exclude lists) to attempt to capture only the "content" part of the HTML page
require 'html_massage'

html = %{
  <html>
    <head>
      <script type="text/javascript">document.write('I am a bad script');</script>
    </head>
    <body>
      <div id="header">My Site</div>
      <div>This is some <i>great</i> content!</div>
    </body>
  </html>
}

HtmlMassage.html( html )
# => "<div>This is some <i>great</i> content!</div>"

HtmlMassage.markdown( html )
# => "This is some _great_ content!"

HtmlMassage.text( html )
# => "This is some great content!"

Custom includes and excludes

html = %{
  <html>
    <body>
      <div class="custom_navigation">some links to other pages...</div>
      <div>This is some <i>great</i> content!</div>
    </body>
  </html>
}

html_massage = HtmlMassage.new( html )
html_massage.exclude!( [ '.custom_navigation' ] )
html_massage.include!( [ 'body' ] )
html_massage.to_html
# => <div>This is some <i>great</i> content!</div>

Sanitize HTML

html = %{
  <html>
    <head>
      <script type="text/javascript">document.write('I am a bad script');</script>
    </head>
    <body>
      <div>This is some <i>great</i> content!</div>
    </body>
  </html>
}

html_massage = HtmlMassage.new( html )
html_massage.sanitize!(  :elements => ['div'] )
html_massage.to_html
# => <div>This is some <i>great</i> content!</div>

Make Links Absolute

html = %{
  <a href ="/foo/bar.html">Click this link</a>
}

html_massage = HtmlMassage.new( html )
html_massage.absolutify_links!( 'http://example.com/joe/page1.html' )
html_massage.to_html
# => <a href ="http://example.com/foo/bar.html">Click this link</a>