-
Notifications
You must be signed in to change notification settings - Fork 0
General Notes
jackpay edited this page Jan 29, 2020
·
5 revisions
The crawler/scraping architecture is dependent on three libraries/repositories which need to be downloaded/cloned prior to any work.
- norconex-crawler - Contains the majority of the codebase for crawling and scraping sites.
- spring-crawler - Codebase for bridging the crawling architecture with Camunda and associated databases via SpringBoot.
- JobQueueManager (JQM) - An external library, the details of which are discussed in their (sub-)sections.
- It can be useful to setup an individual screen session for each component of managing the crawling/scraping architecture (discussed throughout this Wiki).
e.g. screen -S <INDICATIVE_SCREEN_NAME>
It is suggested you create five screens for the processes:
- Seed submission - submitting multiple seeds to JQM.
- Crawl polling - incremental crawl monitoring and submission.
- JEF - starting/stoppin the JEF crawler monitoring service.
- Configuring JEF - updating its configuration file with new crawl indexes.
- JQM nodes - starting, stopping creating and otherwise managing JQM nodes (most other operations can be performed in the JQM admin U.I.).