Skip to content

Commit 94c2a8e

Browse files
bbrownzcerner-bot
authored andcommitted
CDK-476 example oozie application using a kite URI handler to chain workflows through View URIs
1 parent a762c3a commit 94c2a8e

File tree

13 files changed

+727
-0
lines changed

13 files changed

+727
-0
lines changed

kite-examples-oozie/README.md

Lines changed: 76 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,76 @@
1+
# Kite Oozie Versioned Datasets Example
2+
3+
This example demonstrates creating Oozie applications using versioned Datasets as described in https://groups.google.com/a/cloudera.org/d/msg/cdk-dev/uUm-wOv1B3o/Sm6cDVBMusoJ.
4+
5+
This example uses some different Entity models than discussed in the thread. This example uses 3 entity models:
6+
7+
* Person
8+
* PersonOutcomes
9+
* PersonSummary
10+
11+
For each entity model there is a corresponding Oozie coordinator that produces "nominal_time" partitions in a Dataset
12+
oriented around that model.
13+
14+
The data flow is Person -> PersonOutcomes -> PersonSummary. The model is inspired by a processing system in the healthcare
15+
arena where raw person data is processed into a form containing a number of outcomes for each person. Later, each
16+
person's outcomes might be reduced into a record representing a high-level summary of the person.
17+
18+
The entity models used in the example are bare skeletons and realistic processing logic to transform between the models
19+
is absent. What those might look like in a more realistic system is left to the reader's imagination.
20+
21+
Person is the first dataset in this example. The input used to produce that Dataset is a simple text file.
22+
23+
## Prerequisites
24+
25+
The later steps below are done purely to contend with the limited resources of running Hadoop in a VM. Executing this test on a normally sized cluster would only require the steps through restarting Oozie.
26+
27+
* Running instance of [CDH5 Quickstart VM](http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo). The example's POM was setup and tested with a CDH5.2 VM.
28+
29+
* Copy persons.txt from src/main/resources to /user/cloudera directory in HDFS
30+
31+
* Copy datasets.xml from src/main/resources to /user/cloudera/apps directory in HDFS
32+
33+
* Drop the following jars in /var/lib/oozie
34+
* kite-data-core-1.0.1-SNAPSHOT.jar
35+
* kite-data-oozie-1.0.1-SNAPSHOT.jar
36+
* kite-hadoop-compatibility-1.0.1-SNAPSHOT.jar
37+
* kite-data-hive-1.0.1-SNAPSHOT.jar
38+
* commons-jexl-2.1.1.jar
39+
* jackson-core-2.3.1.jar
40+
* jackson-databind-2.3.1.jar
41+
42+
* Add the following to oozie-site.xml Safety Valve:
43+
44+
<property>
45+
<name>oozie.service.URIHandlerService.uri.handlers</name>
46+
<value>org.apache.oozie.dependency.FSURIHandler,org.apache.oozie.dependency.HCatURIHandler,org.kitesdk.data.oozie.KiteURIHandler</value>
47+
</property>
48+
49+
* Restart oozie
50+
51+
* Tweak YARN config (in service yarn -> Gateway Base Group -> Resource Management) in Cloudera Manager.
52+
* ApplicationMaster Memory: 128
53+
* ApplicationMaster Java Maximum Heap Size: 100
54+
* Map Task Memory: 128
55+
* Reduce Task Memory: 128
56+
* Map Task Max Heap: 100
57+
* Reduce Task Max Heap: 100
58+
59+
* Deploy YARN client configuration via Service -> yarn -> Actions -> Deploy Client Configuration
60+
61+
* Tweak YARN service configurations:
62+
* Service -> yarn -> Configuration -> ResourceManager Base Group -> Resource Management -> Container Memory Increment: 256
63+
* Service -> yarn -> Configuration -> NodeManager Base Group -> Resource Management -> Container Virtual CPU Cores: 16
64+
65+
* Restart yarn
66+
67+
* Consider stopping Cloudera Management Services in Cloudera Manager to free more resources on the VM
68+
69+
## Running
70+
* Run "mvn clean package -Pdeploy-example"
71+
* this will create the Person, PersonOutcomes, and PersonSummary base Datasets and deploy and start the associated
72+
coordinators that populate "nominalTime" partitions of the datasets. Look at the coordinators and workflows in Hue to
73+
see them progress.
74+
75+
## Re-deploying
76+
After making any code/configuration changes, kill the running coordinators from Hue and re-run "mvn clean package -Pdeploy-example"

0 commit comments

Comments
 (0)