General Notes

Jump to bottom

jackpay edited this page Jan 29, 2020 · 5 revisions

Setting up the environment

Installing the crawler/scraping architecture

The crawler/scraping architecture is dependent on three libraries/repositories which need to be downloaded/cloned prior to any work.

norconex-crawler - Contains the majority of the codebase for crawling and scraping sites.
spring-crawler - Codebase for bridging the crawling architecture with Camunda and associated databases via SpringBoot.
JobQueueManager (JQM) - An external library, the details of which are discussed in their (sub-)sections.

Specific screen sessions.

It can be useful to setup an individual screen session for each component of managing the crawling/scraping architecture (discussed throughout this Wiki).

e.g. screen -S <INDICATIVE_SCREEN_NAME>

It is suggested you create five screens for the processes:

Seed submission - submitting multiple seeds to JQM.
Crawl polling - incremental crawl monitoring and submission.
JEF - starting/stoppin the JEF crawler monitoring service.
Configuring JEF - updating its configuration file with new crawl indexes.
JQM nodes - starting, stopping creating and otherwise managing JQM nodes (most other operations can be performed in the JQM admin U.I.).