No commit activity in last 3 years
No release in over 3 years
Crawler4J filter plugin for Embulk
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
 Dependencies

Development

~> 1.0
>= 10.0
 Project Readme

Crawler filter plugin for Embulk

Write short description here and build.gradle file.

Overview

  • Plugin type: filter

Configuration

  • target_key: base_url column key name (string, require)
  • max_depth_of_crawling: max depth of crawling (integer, default: unlimited)
  • number_of_crawlers: parallelism (integer, default: 1)
  • max_pages_to_fetch: max_pages_to_fetch (integer, default: unlimited)
  • crawl_storage_folder: crawl_storage_folder (string, require)
  • politeness_delay: politeness_delay (integer, default: null)
  • user_agent_string: user_agent_string (string, default: null)
  • output_prefix: output_prefix (string, default: "")
  • connection_timeout: connection timeout millisecond (integer, default: 30000)
  • socket_timeout: socket timeout millisecond (integer, default: 20000)

Example

in:
  type: mysql
  host: dbs04
  user: application
  password: XXXXXXXX
  database: iap
  query: |
    select url from companies limit 100
filters:
  - type: crawler
    target_key: url
    number_of_crawlers: 10
    max_depth_of_crawling: 4
    politeness_delay: 100
    crawl_storage_folder: "/tmp/crawl/%s"
out:
  type: stdout

Build

$ ./gradlew gem  # -t to watch change of files and rebuild continuously