The Open Crawler uses YAML configuration files to define crawl behavior and connection settings. Template configuration files and examples are located in the config directory.
There are two types of configuration files:
crawler.yml- Required for all crawls. Defines what to crawl, where to send results, and crawl behavior settings.
- Example: config/crawler.yml.example
- Required for all crawls. Defines what to crawl, where to send results, and crawl behavior settings.
elasticsearch.yml- Optional separate file for Elasticsearch connection settings. Useful when running multiple crawlers that share the same Elasticsearch instance.
- Example: config/elasticsearch.yml.example
- Optional separate file for Elasticsearch connection settings. Useful when running multiple crawlers that share the same Elasticsearch instance.
Note
You can include Elasticsearch settings directly in your crawler configuration file instead of using a separate file. Settings in the crawler configuration take precedence over the separate Elasticsearch configuration. The standalone Elasticsearch configuration file is only needed if you want to share connection settings across multiple crawlers. Most users can put everything in the crawler configuration file.
Crawler configuration files are required for all crawl jobs.
If elasticsearch is the output sink, the elasticsearch instance configuration can also be included in a crawler configuration file.
If the elasticsearch configuration is provided this way, it will override any configuration provided in an elasticsearch configuration file.
These are provided in the CLI as a positional argument, e.g. bin/crawler crawl path/to/my-crawler.yml.
The Elasticsearch configuration is only required if the output sink is elasticsearch.
It is not required for file or console.
This configuration is also optional. All of the configuration in this file can be provided in a crawler configuration file as well. The crawler config is loaded after the Elasticsearch config, so any Elasticsearch settings in the crawler config will take priority.
These are provided in the CLI as a named argument for the option --es-config, e.g. bin/crawler crawl path/to/my-crawler.yml --es-config=/path/to/elasticsearch.yml
If your Elasticsearch instance uses SSL/TLS with certificates signed by a private Certificate Authority (CA) or uses self-signed certificates, you will need to configure the crawler to trust these certificates.
- Obtain the CA Certificate: Get the CA certificate file (usually a
.pemor.crtfile) that was used to sign your Elasticsearch node certificates. If using self-signed certificates directly on the nodes, you might need the certificate file for each node or a combined CA file. - Configure
ca_file: Place the CA certificate file(s) in a directory accessible to the crawler. In yourelasticsearch.ymlor crawler configuration file, set theelasticsearch.ca_fileparameter to the certificate.elasticsearch.ca_file: /path/to/your/ca.crt
Note: For detailed explanations of all Elasticsearch connection parameters, including authentication and other SSL options, refer to the comments within the config/elasticsearch.yml.example file.
These settings control the retry behavior when the Elasticsearch output sink is locked.
sink_lock_retry_interval: The interval in seconds to wait before retrying to acquire the sink lock. Defaults to1.sink_lock_max_retries: The maximum number of times to retry acquiring the sink lock before dropping the crawl result. Defaults to120.
See CLI in Docker for details on how to mount configuration files into the Docker container for use with commands.
The config files are provided via opts in the CLI. The order of the opts is not important.
When performing a crawl with only a crawl config:
bin/crawler crawl config/my-crawler.ymlWhen performing a crawl with both a crawl config and an Elasticsearch config:
bin/crawler crawl config/my-crawler.yml --es-config config/elasticsearch.ymlelasticsearch:
username: <%= ENV['ES_USER'] %>
password: <%= ENV['ES_PASS'] %>Example: Default Value Logic
output_path: <%= ENV['OUTPUT_PATH'] || '/tmp/crawl-output' %>How it works:
- Before parsing YAML, the file is processed with Embedded Ruby (ERB) template syntax.
- You can use any Ruby code inside
<%= ... %>tags, but the most common use is referencing environment variables.
For more examples, see the sample configuration files in the config directory.