Exploring ways of quick distributed scraping with the help of Akka.
The easiest way to run the scraper with stand-alone mode is to use the neat CLI interface.
# Build "fat jar" with SBT
sbt assembly
# Run it with
java -jar target/*/scraper.jar --categories prodaja --pages 10By default, the scraper spits out JSON.
java -jar target/*/scraper.jar --categories prodaja --pages 2 | jq -R 'fromjson?'So to make things bit easier for your eyes your can use jq to format or restructure output further for example to CSV.
java -jar target/*/scraper.jar --categories prodaja --pages 10 \
| jq -R 'fromjson?' \
| jq -r "([.refNumber, .title, .price, .location.latitude, .location.longitude]) | @csv" \
> prodaja.csvAdjusting parallelism and other fine application.conf switches can be easily done via loading of different configuration.
java -Dconfig.resource=quick.conf -jar target/*/scraper.jar --categories najemSome configuration options can also be adjusted via environment variables i.e.
INITIAL_CATEGORIES=prodaja,najem
CATEGORY_PAGES_LIMIT=3