Project: invisiblellama-repub

Repub¶ ↑

by Invisible Llama (dg at invisiblellama dot net)

DESCRIPTION:¶ ↑

Repub is a simple HTML to ePub converter.

It lacks imagination and won’t try to guess the source document structure, you will have to describe where to look for title and table of contents. In return, it provides you with greater control over generated ePub documents.

FEATURES:¶ ↑

Repub accepts the following parameters:

Source document URL
List of XPath expressions for locating source document title, table of contents, TOC items and TOC sub-sections
List of XPath expressions for describing elements that will be removed from the converted document
List of regular expressions for editing the source document
Publication information metadata tags

All parameters except document URL are optional; the resulting ePub will (probably, if original HTML isn’t broken too bad) be readable but will be lacking any metadata or TOC.

Few examples:

Project Gutenberg’s The Adventures Of Sherlock Holmes (with proper table of contents)

repub -x 'title:div[@class="book"]//h1' \
  -x 'toc://table' \
  -x 'toc_item://tr' \
  http://www.gutenberg.org/dirs/etext99/advsh12h.htm

This tells Repub to look for title in the first found H1 in the DIV of class “book”; that table of contents is located in the first TABLE and TOC item can be found inside TR. The above will produce readable ePub which can be further enhanced by removing some “noise” content:

repub -x 'title:div[@class="book"]//h1' \
  -x 'toc://table' \
  -x 'toc_item://tr' \
  -X '//pre' -X '//hr' -X '//body/h1' -X '//body/h2' \
  http://www.gutenberg.org/dirs/etext99/advsh12h.htm

In addition to parsing, the above command also removes from the final version of document all PREs, HRs and first H1 and H2 elements from the body.

A bit more complicated example:

Git User’s Manual

repub -x 'title://h1' \
  -x 'toc://div[@class="toc"]/dl' \
  -x 'toc_item:dt' \
  -x 'toc_section:following-sibling::*[1]/dl' \
  -w git-manual \
  http://www.kernel.org/pub/software/scm/git/docs/user-manual.html

This tells Repub to look for title in the first found H1, for TOC in the DL element of the DIV with class “toc” and that TOC items can be found inside DT elements. Additionally, TOC item can have a child TOC section inside DL when DL element immediately follows DT.

The above command also saves all XPath expressions as “git-manual” profile, which can be later reused to save keystrokes. For example, if you later decide to regenerate Git Manual ePub without TOC at the beginning of document, you can do

repub -l git-manual -X '//div[@class="toc"]' http://www.kernel.org/pub/software/scm/git/docs/user-manual.html

Few more examples:

Open Packaging Format (OPF) 2.0 (one of the ePub standards, in ePub)

repub -x 'title://p[@class="Title"]' \
  -x 'toc://div[@class="TOC"]' \
  -x 'toc_item:.//p' \
  -x 'toc_section:.//div[@class="TOCSection"]' \
  http://www.idpf.org/2007/opf/OPF_2.0_final_spec.html

GNU Wget Manual

repub -m 'creator:gnu.org' \
  -x 'title://h1' -x 'toc://div[@class="contents"]/ul' -x 'toc_item:li' -x 'toc_section:ul' \
  -X '//div[@class="contents"]' \
  http://www.gnu.org/software/wget/manual/wget.html

And finally, the “Hello World” of e-books, Alice’s Adventures In Wonderland

repub -x 'title:body/h1' -x 'toc://table' -x 'toc_item://tr' -X '//pre' -X '//hr' -X '//body/h4' \
  http://www.gutenberg.org/files/11/11-h/11-h.htm

SYNOPSIS:¶ ↑

Repub is a simple HTML to ePub converter.

Usage: repub [options] url

General options:

-D, --downloader NAME            Which downloader to use to get files (wget or httrack).
                                 Default is wget.
-o, --output PATH                Output path for generated ePub file.
                                 Default is /Users/dg/Projects/repub/<Parsed_Title>.epub
-w, --write-profile NAME         Save given options for later reuse as profile NAME.
-l, --load-profile NAME          Load options from saved profile NAME.
-W, --write-default              Save given options for later reuse as default profile.
-L, --list-profiles              List saved profiles.
-C, --cleanup                    Clean up download cache.
-v, --verbose                    Turn on verbose output.
-q, --quiet                      Turn off any output except errors.
-V, --version                    Show version.
-h, --help                       Show this help message.

Parser options:

-x, --selector NAME:VALUE        Set parser XPath selector NAME to VALUE.
                                 Recognized selectors are: [title toc toc_item toc_section]
-m, --meta NAME:VALUE            Set publication information metadata NAME to VALUE.
                                 Valid metadata names are: [creator date description
                                 language publisher relation rights subject title]
-e, --encoding NAME              Set source document encoding. Default is to autodetect.

Post-processing options:

-s, --stylesheet PATH            Use custom stylesheet at PATH. Use -s- to remove
                                 all links to stylesheets and <style> blocks from the source.
-a, --add PATH                   Add external file to the generated ePub.
-N, --new-fragment XHTML         Prepare document fragment for -A and -P operations.
-A, --after SELECTOR             Insert fragment after element with XPath selector.
-P, --before SELECTOR            Insert fragment before element with XPath selector.
-X, --remove SELECTOR            Remove source element using XPath selector.
                                 Use -X- to ignore stored profile.
-R, --rx /PATTERN/REPLACEMENT/   Edit source HTML using regular expressions.
                                 Use -R- to ignore stored profile.
-B, --browser                    After processing, open resulting HTML in default browser.

DEPENDENCIES:¶ ↑

Also, the following tools must be somewhere in $PATH:

wget or httrack
zip (Info-ZIP)

LIMITATIONS/BUGS:¶ ↑

Currently, only “everything-on-one-page” HTML sources are supported. Repub will download and process all page requisites (stylesheets and images) but all actual content must be on one page.

Encoding auto-detection is slow.

Chardet 0.9.0 is broken under Ruby 1.9 so if you want to use Ruby 1.9 you have to set encoding manually with -e.

Bugs: probably. If you find any, please report them to dg at invisiblellama dot net.

INSTALL:¶ ↑

sudo gem install repub

LICENSE:¶ ↑

(The MIT License)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

invisiblellama-repub

Development

Runtime