Project

rdig

0.01
No commit activity in last 3 years
No release in over 3 years
Website crawler and fulltext indexer.
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Runtime

>= 0.11.6
>= 0.6
>= 4.0.0
 Project Readme

RDig¶ ↑

RDig provides an HTTP crawler and content extraction utilities to help building a site search for web sites or intranets. Internally, Ferret is used for the full text indexing. After creating a config file for your site, the index can be built with a single call to rdig.

RDig depends on Ferret (>= 0.10.0) and, for parsing HTML, on either Hpricot (>= 0.4) or the RubyfulSoup library (>= 1.0.4). As I know no way to specify such an OR dependency in a gem specification, the gem depends on Hpricot. If this is a problem for you, install the gem with –force and manually do a +gem install rubyful_soup+.

basic usage¶ ↑

Index creation¶ ↑

  • create a config file based on the template in doc/examples

  • to create an index:

    rdig -c CONFIGFILE
  • to run a query against the index (just to try it out)

    rdig -c CONFIGFILE -q 'your query'

    this will dump the first 10 search results to STDOUT

Handle search in your application:¶ ↑

require 'rdig'
require 'rdig_config'   # load your config file here
search_results = RDig.searcher.search(query)

see RDig::Search::Searcher for more information.

usage in rails¶ ↑

  • add to config/environment.rb :

    require 'rdig'
    require 'rdig_config'
    
  • place rdig_config.rb into config/ directory.

  • build index:

    rdig -c config/rdig_config.rb
  • in your controller that handles the search form:

    search_results = RDig.searcher.search(params[:query])
    @results = search_results[:list]
    @hitcount = search_results[:hitcount]
    

search result paging¶ ↑

Use the :first_doc and :num_docs options to implement paging through search results. (:num_docs is 10 by default, so without using these options only the first 10 results will be retrieved)

sample configuration¶ ↑

from doc/examples/config.rb. The tag_selector properties are called with a BeautifulSoup instance as parameter. See the RubyfulSoup Site for more info about this cool lib. You can also have a look at the html_content_extractor unit test.

:include:doc/examples/config.rb