|
| 1 | +# Kite Oozie Versioned Datasets Example |
| 2 | + |
| 3 | +This example demonstrates creating Oozie applications using versioned Datasets as described in https://groups.google.com/a/cloudera.org/d/msg/cdk-dev/uUm-wOv1B3o/Sm6cDVBMusoJ. |
| 4 | + |
| 5 | +This example uses some different Entity models than discussed in the thread. This example uses 3 entity models: |
| 6 | + |
| 7 | +* Person |
| 8 | +* PersonOutcomes |
| 9 | +* PersonSummary |
| 10 | + |
| 11 | +For each entity model there is a corresponding Oozie coordinator that produces "nominal_time" partitions in a Dataset |
| 12 | +oriented around that model. |
| 13 | + |
| 14 | +The data flow is Person -> PersonOutcomes -> PersonSummary. The model is inspired by a processing system in the healthcare |
| 15 | +arena where raw person data is processed into a form containing a number of outcomes for each person. Later, each |
| 16 | +person's outcomes might be reduced into a record representing a high-level summary of the person. |
| 17 | + |
| 18 | +The entity models used in the example are bare skeletons and realistic processing logic to transform between the models |
| 19 | +is absent. What those might look like in a more realistic system is left to the reader's imagination. |
| 20 | + |
| 21 | +Person is the first dataset in this example. The input used to produce that Dataset is a simple text file. |
| 22 | + |
| 23 | +## Prerequisites |
| 24 | + |
| 25 | +The later steps below are done purely to contend with the limited resources of running Hadoop in a VM. Executing this test on a normally sized cluster would only require the steps through restarting Oozie. |
| 26 | + |
| 27 | +* Running instance of [CDH5 Quickstart VM](http://www.cloudera.com/content/support/en/downloads/download-components/download-products.html?productID=F6mO278Rvo). The example's POM was setup and tested with a CDH5.2 VM. |
| 28 | + |
| 29 | +* Copy persons.txt from src/main/resources to /user/cloudera directory in HDFS |
| 30 | + |
| 31 | +* Copy datasets.xml from src/main/resources to /user/cloudera/apps directory in HDFS |
| 32 | + |
| 33 | +* Drop the following jars in /var/lib/oozie |
| 34 | + * kite-data-core-1.0.1-SNAPSHOT.jar |
| 35 | + * kite-data-oozie-1.0.1-SNAPSHOT.jar |
| 36 | + * kite-hadoop-compatibility-1.0.1-SNAPSHOT.jar |
| 37 | + * kite-data-hive-1.0.1-SNAPSHOT.jar |
| 38 | + * commons-jexl-2.1.1.jar |
| 39 | + * jackson-core-2.3.1.jar |
| 40 | + * jackson-databind-2.3.1.jar |
| 41 | + |
| 42 | +* Add the following to oozie-site.xml Safety Valve: |
| 43 | + |
| 44 | + <property> |
| 45 | + <name>oozie.service.URIHandlerService.uri.handlers</name> |
| 46 | + <value>org.apache.oozie.dependency.FSURIHandler,org.apache.oozie.dependency.HCatURIHandler,org.kitesdk.data.oozie.KiteURIHandler</value> |
| 47 | + </property> |
| 48 | + |
| 49 | +* Restart oozie |
| 50 | + |
| 51 | +* Tweak YARN config (in service yarn -> Gateway Base Group -> Resource Management) in Cloudera Manager. |
| 52 | + * ApplicationMaster Memory: 128 |
| 53 | + * ApplicationMaster Java Maximum Heap Size: 100 |
| 54 | + * Map Task Memory: 128 |
| 55 | + * Reduce Task Memory: 128 |
| 56 | + * Map Task Max Heap: 100 |
| 57 | + * Reduce Task Max Heap: 100 |
| 58 | + |
| 59 | +* Deploy YARN client configuration via Service -> yarn -> Actions -> Deploy Client Configuration |
| 60 | + |
| 61 | +* Tweak YARN service configurations: |
| 62 | + * Service -> yarn -> Configuration -> ResourceManager Base Group -> Resource Management -> Container Memory Increment: 256 |
| 63 | + * Service -> yarn -> Configuration -> NodeManager Base Group -> Resource Management -> Container Virtual CPU Cores: 16 |
| 64 | + |
| 65 | +* Restart yarn |
| 66 | + |
| 67 | +* Consider stopping Cloudera Management Services in Cloudera Manager to free more resources on the VM |
| 68 | + |
| 69 | +## Running |
| 70 | +* Run "mvn clean package -Pdeploy-example" |
| 71 | + * this will create the Person, PersonOutcomes, and PersonSummary base Datasets and deploy and start the associated |
| 72 | + coordinators that populate "nominalTime" partitions of the datasets. Look at the coordinators and workflows in Hue to |
| 73 | + see them progress. |
| 74 | + |
| 75 | +## Re-deploying |
| 76 | +After making any code/configuration changes, kill the running coordinators from Hue and re-run "mvn clean package -Pdeploy-example" |
0 commit comments