You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: spark/README.md
+69-50Lines changed: 69 additions & 50 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -4,64 +4,43 @@ This module provides an example of processing event data using Apache Spark.
4
4
5
5
## Getting started
6
6
7
-
This example assumes that you're running a CDH5.1 or later cluster (such as the
8
-
[Cloudera Quickstart VM][getvm]) that has Spark configured. This example requires
9
-
the `spark-submit` command to execute the Spark job on the cluster. If you're using
10
-
the Quickstart VM, be sure to run this example from the VM rather than the host
11
-
computer.
7
+
This example assumes that you're running a CDH5.1 or later cluster (such as the [Cloudera Quickstart VM][getvm]) that has Spark configured. This example requires the `spark-submit` command to execute the Spark job on the cluster. If you're using the Quickstart VM, run this example from the VM rather than the host computer.
To build the project, enter the following command in a terminal window.
26
22
27
-
```bash
23
+
```
28
24
mvn install
29
25
```
30
26
31
-
## Running
32
-
33
-
### Create and populate the events dataset
27
+
## Creating and Populating the Events Dataset
34
28
35
-
First we need to create and populate the `events` dataset.
29
+
In this example, you store raw events in a Hive-backed dataset so that you can process the results using Hive. Use `CreateEvents`, provided with the demo, to both create and populate random event records. Execute the following command from a terminal window in the `kite-examples/spark` directory.
36
30
37
-
We store the raw events in a Hive-backed dataset so you can also process the data
38
-
using Impala or Hive. We'll use a tool provided with the demo to both create and
You can browse the generated events using [Hue on the QuickstartVM](http://localhost:8888/metastore/table/default/events/read).
46
36
47
-
### Use Spark to correlate events
37
+
##Using Spark to Correlate Events
48
38
49
-
Now we want to use Spark to correlate events from the same IP address within a
50
-
five minute window. Before we implement our algorithm, we need to configure Spark.
51
-
In particular, we need to set up Spark to use the Kryo serialization library and
52
-
configure Kryo to automatically serialize our Avro objects.
39
+
In this example, you use Spark to correlate events generated from the same IP address within a five-minute window. Begin by configuring Spark to use the Kryo serialization library.
53
40
54
-
```java
55
-
// Create our Spark configuration and get a Java context
56
-
SparkConf sparkConf =newSparkConf()
57
-
.setAppName("Correlate Events")
58
-
// Configure the use of Kryo serialization including our Avro registrator
Register your Avro classes with the following Scala class to use Avro's specific binary serialization for both the `StandardEvent` and `CorrelatedEvents` classes.
63
42
64
-
We can register our Avro classes with a small bit of Scala code:
43
+
### AvroKyroRegistrator.scala
65
44
66
45
```scala
67
46
classAvroKyroRegistratorextendsKryoRegistrator {
@@ -72,22 +51,34 @@ class AvroKyroRegistrator extends KryoRegistrator {
72
51
}
73
52
```
74
53
75
-
This will register the use of Avro's specific binary serialization for bot the
76
-
`StandardEvent` and `CorrelatedEvents` classes.
54
+
### Highlights from CorrelateEventsTask.class
55
+
56
+
The following snippets show examples of code you use to configure and invoke Spark tasks.
57
+
58
+
Configure Kryo to automatically serialize Avro objects.
59
+
60
+
```java
61
+
// Create the Spark configuration and get a Java context
62
+
SparkConf sparkConf =newSparkConf()
63
+
.setAppName("Correlate Events")
64
+
// Configure the use of Kryo serialization including the Avro registrator
We can now process the events as needed. Once we have our finall RDD, we can
101
-
configure `DatasetKeyOutputFormat` in the same way and use the
102
-
`saveAsNewAPIHadoopFile` method to persist the data to our output dataset.
91
+
The application can now process events as needed. Using your RDD, configure `DatasetKeyOutputFormat` the same way and use `saveAsNewAPIHadoopFile` to store data in an output dataset.
0 commit comments