Skip to content

General Notes

jackpay edited this page Jan 29, 2020 · 5 revisions

Setting up the environment

Installing the crawler/scraping architecture

The crawler/scraping architecture is dependent on three libraries/repositories which need to be downloaded/cloned prior to any work.

  1. norconex-crawler - Contains the majority of the codebase for crawling and scraping sites.
  2. spring-crawler - Codebase for bridging the crawling architecture with Camunda and associated databases via SpringBoot.
  3. JobQueueManager (JQM) - An external library, the details of which are discussed in their (sub-)sections.

Specific screen sessions.

  • It can be useful to setup an individual screen session for each component of managing the crawling/scraping architecture (discussed throughout this Wiki).

e.g. screen -S <INDICATIVE_SCREEN_NAME>

It is suggested you create five screens for the processes:

  1. Seed submission - submitting multiple seeds to JQM.
  2. Crawl polling - incremental crawl monitoring and submission.
  3. JEF - starting/stoppin the JEF crawler monitoring service.
  4. Configuring JEF - updating its configuration file with new crawl indexes.
  5. JQM nodes - starting, stopping creating and otherwise managing JQM nodes (most other operations can be performed in the JQM admin U.I.).

Clone this wiki locally