Detect and Report Web Traffic Anomalies in Near Real-Time using Flume, Spark Streaming and Impala (https://blog.cloudera.com/blog/2016/06/how-to-detect-and-report-web-traffic-anomalies-in-near-real-time/)
This repository contains example code of how Spark Streaming can be used to collect, count and aggregate occurences of bad HTML codes from a collection of web-servers in near-real-time. The code can be used to detect peaks of bad web traffic as it happens. It uses Flume to collect events, Spark Streaming to process and persist them in real-time and Impala to present them to a reporting front-end.
Maven is used to compile the Spark Streaming application and manage its dependencies. The Cloudera Maven repository is used to easily match all versions of library dependencies. The Maven Shade plugin is used to build a large shaded JAR file that contains all direct and transient dependencies to avoid the need of installing additional JARs your cluster.
Build via: mvn package
In order to run the application you need:
- One or more running flume agent(s). The configurations can be found under
/flumein this repository. Launching the flume agents depends on your cluster setup. - An Impala table to persist the aggregated event counts that are generated by the Spark Streaming application. The DDL can be found under
/sqlin this repository. - The compiled Spark Streaming application JAR connecting to the Flume agents
The Spark Streaming application can the be launched via /scripts/run_ErrorsNRT.sh in this repository.