Skip to content

NRT detection of web-traffic anomalies using Flume, Spark Streaming and Impala

Notifications You must be signed in to change notification settings

jqnik/ErrorsNRT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ErrorsNRT

Detect and Report Web Traffic Anomalies in Near Real-Time using Flume, Spark Streaming and Impala (https://blog.cloudera.com/blog/2016/06/how-to-detect-and-report-web-traffic-anomalies-in-near-real-time/)

This repository contains example code of how Spark Streaming can be used to collect, count and aggregate occurences of bad HTML codes from a collection of web-servers in near-real-time. The code can be used to detect peaks of bad web traffic as it happens. It uses Flume to collect events, Spark Streaming to process and persist them in real-time and Impala to present them to a reporting front-end.

Compilation:

Maven is used to compile the Spark Streaming application and manage its dependencies. The Cloudera Maven repository is used to easily match all versions of library dependencies. The Maven Shade plugin is used to build a large shaded JAR file that contains all direct and transient dependencies to avoid the need of installing additional JARs your cluster.

Build via: mvn package

Running:

In order to run the application you need:

  • One or more running flume agent(s). The configurations can be found under /flume in this repository. Launching the flume agents depends on your cluster setup.
  • An Impala table to persist the aggregated event counts that are generated by the Spark Streaming application. The DDL can be found under /sql in this repository.
  • The compiled Spark Streaming application JAR connecting to the Flume agents

The Spark Streaming application can the be launched via /scripts/run_ErrorsNRT.sh in this repository.

About

NRT detection of web-traffic anomalies using Flume, Spark Streaming and Impala

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published