Skip to content

Commit 6f4ca09

Browse files
author
DennisDawson
committed
Updates for accuracy, typos, and to avail discussion.
1 parent 2e84ec3 commit 6f4ca09

1 file changed

Lines changed: 69 additions & 50 deletions

File tree

spark/README.md

Lines changed: 69 additions & 50 deletions
Original file line numberDiff line numberDiff line change
@@ -4,64 +4,43 @@ This module provides an example of processing event data using Apache Spark.
44

55
## Getting started
66

7-
This example assumes that you're running a CDH5.1 or later cluster (such as the
8-
[Cloudera Quickstart VM][getvm]) that has Spark configured. This example requires
9-
the `spark-submit` command to execute the Spark job on the cluster. If you're using
10-
the Quickstart VM, be sure to run this example from the VM rather than the host
11-
computer.
7+
This example assumes that you're running a CDH5.1 or later cluster (such as the [Cloudera Quickstart VM][getvm]) that has Spark configured. This example requires the `spark-submit` command to execute the Spark job on the cluster. If you're using the Quickstart VM, run this example from the VM rather than the host computer.
128

139
[getvm]: http://www.cloudera.com/content/support/en/downloads/quickstart_vms.html
1410

15-
On the cluster, check out a copy of the code:
11+
On the cluster, check out a copy of the code and navigate to the `/spark` directory using the following commands in a terminal window.
1612

17-
```bash
13+
```
1814
git clone https://github.com/kite-sdk/kite-examples.git
1915
cd kite-examples
2016
cd spark
2117
```
2218

23-
## Building
19+
## Building the Application
2420

25-
To build the project, type
21+
To build the project, enter the following command in a terminal window.
2622

27-
```bash
23+
```
2824
mvn install
2925
```
3026

31-
## Running
32-
33-
### Create and populate the events dataset
27+
## Creating and Populating the Events Dataset
3428

35-
First we need to create and populate the `events` dataset.
29+
In this example, you store raw events in a Hive-backed dataset so that you can process the results using Hive. Use `CreateEvents`, provided with the demo, to both create and populate random event records. Execute the following command from a terminal window in the `kite-examples/spark` directory.
3630

37-
We store the raw events in a Hive-backed dataset so you can also process the data
38-
using Impala or Hive. We'll use a tool provided with the demo to both create and
39-
populate the random events:
40-
41-
```bash
31+
```
4232
mvn exec:java -Dexec.mainClass="org.kitesdk.examples.spark.CreateEvents"
4333
```
4434

4535
You can browse the generated events using [Hue on the QuickstartVM](http://localhost:8888/metastore/table/default/events/read).
4636

47-
### Use Spark to correlate events
37+
## Using Spark to Correlate Events
4838

49-
Now we want to use Spark to correlate events from the same IP address within a
50-
five minute window. Before we implement our algorithm, we need to configure Spark.
51-
In particular, we need to set up Spark to use the Kryo serialization library and
52-
configure Kryo to automatically serialize our Avro objects.
39+
In this example, you use Spark to correlate events generated from the same IP address within a five-minute window. Begin by configuring Spark to use the Kryo serialization library.
5340

54-
```java
55-
// Create our Spark configuration and get a Java context
56-
SparkConf sparkConf = new SparkConf()
57-
.setAppName("Correlate Events")
58-
// Configure the use of Kryo serialization including our Avro registrator
59-
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
60-
.set("spark.kryo.registrator", "org.kitesdk.examples.spark.AvroKyroRegistrator");
61-
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
62-
```
41+
Register your Avro classes with the following Scala class to use Avro's specific binary serialization for both the `StandardEvent` and `CorrelatedEvents` classes.
6342

64-
We can register our Avro classes with a small bit of Scala code:
43+
### AvroKyroRegistrator.scala
6544

6645
```scala
6746
class AvroKyroRegistrator extends KryoRegistrator {
@@ -72,22 +51,34 @@ class AvroKyroRegistrator extends KryoRegistrator {
7251
}
7352
```
7453

75-
This will register the use of Avro's specific binary serialization for bot the
76-
`StandardEvent` and `CorrelatedEvents` classes.
54+
### Highlights from CorrelateEventsTask.class
55+
56+
The following snippets show examples of code you use to configure and invoke Spark tasks.
57+
58+
Configure Kryo to automatically serialize Avro objects.
59+
60+
```java
61+
// Create the Spark configuration and get a Java context
62+
SparkConf sparkConf = new SparkConf()
63+
.setAppName("Correlate Events")
64+
// Configure the use of Kryo serialization including the Avro registrator
65+
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
66+
.set("spark.kryo.registrator", "org.kitesdk.examples.spark.AvroKyroRegistrator");
67+
JavaSparkContext sparkContext = new JavaSparkContext(sparkConf);
68+
``
7769

78-
In order to access our Hive-backed datasets from remote Spark tasks, we need to
79-
register some JARs in Spark's equivalent of the Hadoop DistributedCache:
70+
To access Hive-backed datasets from remote Spark tasks,
71+
register JARs in the Spark equivalent of the Hadoop DistributedCache:
8072

8173
```java
82-
// Register some classes that will be needed in remote Spark tasks
74+
// Register classes needed for remote Spark tasks
8375
addJarFromClass(sparkContext, getClass());
8476
addJars(sparkContext, System.getenv("HIVE_HOME"), "lib");
8577
sparkContext.addFile(System.getenv("HIVE_HOME")+"/conf/hive-site.xml");
8678
```
8779

88-
Now we're ready to read from the events dataset by configuring the MapReduce
89-
`DatasetKeyInputFormat` and then using Spark's built-in support to generate an
90-
RDD form an `InputFormat`.
80+
Configure the MapReduce `DatasetKeyInputFormat` to enable the application to read from the _events_ dataset. Use Spark built-in support to generate an
81+
RDD (Resilient Distributed Dataset) from the input format.
9182

9283
```java
9384
Configuration conf = new Configuration();
@@ -97,9 +88,7 @@ JavaPairRDD<StandardEvent, Void> events = sparkContext.newAPIHadoopRDD(conf,
9788
DatasetKeyInputFormat.class, StandardEvent.class, Void.class);
9889
```
9990

100-
We can now process the events as needed. Once we have our finall RDD, we can
101-
configure `DatasetKeyOutputFormat` in the same way and use the
102-
`saveAsNewAPIHadoopFile` method to persist the data to our output dataset.
91+
The application can now process events as needed. Using your RDD, configure `DatasetKeyOutputFormat` the same way and use `saveAsNewAPIHadoopFile` to store data in an output dataset.
10392

10493
```java
10594
DatasetKeyOutputFormat.configure(conf).writeTo(correlatedEventsUri).withType(CorrelatedEvents.class);
@@ -108,21 +97,51 @@ matches.saveAsNewAPIHadoopFile("dummy", CorrelatedEvents.class, Void.class,
10897
DatasetKeyOutputFormat.class, conf);
10998
```
11099

111-
You can run the example Spark job by executing the following:
100+
In a terminal window, run the Spark job using the following command.
112101

113-
```bash
102+
```
114103
spark-submit --class org.kitesdk.examples.spark.CorrelateEvents --jars $(mvn dependency:build-classpath | grep -v '^\[' | sed -e 's/:/,/g') target/kite-spark-demo-*.jar
115104
```
116105

117106
You can browse the correlated events using [Hue on the QuickstartVM](http://localhost:8888/metastore/table/default/correlated_events/read).
118107

119-
### Delete the datasets
108+
## Deleting the datasets
120109

121-
When you're done or if you want to run the example again, you can delete the datasets we created:
110+
When you're done, or if you want to run the example again, delete the datasets using the Kite CLI `delete` command.
122111

123-
```bash
112+
```
124113
curl http://central.maven.org/maven2/org/kitesdk/kite-tools/0.17.0/kite-tools-0.17.0-binary.jar -o kite-dataset
125114
chmod +x kite-dataset
126115
./kite-dataset delete events
127116
./kite-dataset delete correlated_events
128117
```
118+
119+
## Troubleshooting
120+
121+
The following are known issues and their solutions.
122+
123+
### ClassNotFoundException
124+
125+
The first time you execute `spark-submit`, the process might not find `CorrelateEvents`.
126+
127+
```
128+
java.lang.ClassNotFoundException: org.kitesdk.examples.spark.CorrelateEvents
129+
```
130+
131+
Execute the command a second time to get past this exception.
132+
133+
### AccessControlException
134+
135+
On some VMs, you might receive the following exception.
136+
137+
```
138+
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.security.AccessControlException): \
139+
Permission denied: user=cloudera, access=EXECUTE, inode="/user/spark":spark:spark:drwxr-x---
140+
```
141+
142+
In a terminal window, update permissions using the following commands.
143+
144+
```
145+
$ sudo su - hdfs
146+
$ hadoop fs -chmod -R 777 /user/spark
147+
```

0 commit comments

Comments
 (0)