Skip to content

CDK-476 oozie UriHandler for Kite URIs#364

Closed
bbrownz wants to merge 1 commit intokite-sdk:masterfrom
bbrownz:CDK-476_data_oozie
Closed

CDK-476 oozie UriHandler for Kite URIs#364
bbrownz wants to merge 1 commit intokite-sdk:masterfrom
bbrownz:CDK-476_data_oozie

Conversation

@bbrownz
Copy link
Copy Markdown
Contributor

@bbrownz bbrownz commented May 13, 2015

An Oozie URIHandler for KiteURIs. This allows KiteURIs to be used for data input events. Particularly when chaining together workflows based on the availability of an upstream dataset (a populated view in this case).

Uses the signal ready concept implemented in CDK-451 as an alternative to the traditional "_SUCCESS" file.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: Notificatiosn

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch 35ec9e5

@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented May 13, 2015

Created a PR to kite-examples showing a use case for the URIHandler, kite-sdk/kite-examples#26

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 22, 2015

Hi @bbrownz, sorry it is taking me so long to get to this. I'm not very familiar with Oozie, but I talked with someone today that was interested and will probably have a look. Otherwise I'll follow up on this next week. Thanks for being patient!

@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented May 22, 2015

No worries! The biggest questions I had posting this in were
*we build the project, since not all the profiles made a lot of sense with the URIHandler being in oozie 4, which we only get under the CDH5 profile now. Currently the oozie version is just always locked in.
*If we should support deleting the views through the LauncherURIHandler, that felt like something we wouldn't want for the first iteration though

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this should attempt to read the signal files and catch any IOExceptions that are thrown. Then this would have at least basic support.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were you thinking along the route of reading the signal files directly (so we'd be able to flex the configuration and user used for the FileSystem access)? That seems like it would bind us very directly to the current implementation, although it would work.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I meant try to do the operation like you do here, but if it fails with a permissions-related IOException, return false instead of propagating the exception.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I follow, thats an interesting trade off.

I think returning false would make the most sense if we were approaching it from the perspective that the (oozie process) user without permissions shouldn't know that the dataset exists or not. Propagating the exception through helps to reveal the issue that the permissions are blocking the operation they were hoping would work.

I think propagating the exception makes the most sense in that alert a consumer that their permission setup isn't going to work with the URI handler as is. If we were to create documentation around the usage we would definitely want to note that case (outside of the example project I didn't create any, and perhaps should have).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds fine to me.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is precedent for catching IOException and rethrowing as HadoopAccessorException in FSURIHandler. It seems reasonable to me to be consistent with that.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a catch in the exists method for a DatasetException to re-throw that exception as a HadoopAccessorException. I'd considered the more specific DatasetIOException that IOExceptions are wrapped in but this seemed like it would cover our bases a little better. It also occurred to me that HadoopAccessorException might be less appropriate in the future (for example if we added a signal implementation that was HBase specific) but that felt like I had started to nitpick myself. I'm open to further opinions though 😄

a310ebf

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 28, 2015

I have a couple of minor comments, but this looks good to me overall. Have you tested this with a cluster?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 28, 2015

@ben-roling, could you review this as well?

@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented May 28, 2015

Have you tested this with a cluster?

So far we've tested this on 5.2 and 5.3 with quickstart VMs via the example (kite-sdk/kite-examples#26). We're in progress doing some testing now with one of our dev clusters that has Kerberos enabled.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 28, 2015

Sounds good! I'd like someone more familiar with Oozie to take a look, but I'm happy with this going in. Since Ben is a committer, I think it makes sense to get his +1 before merging. Thanks for working on this, @bbrownz!

@ben-roling
Copy link
Copy Markdown

@bbrownz works with me and as such I had already reviewed most of this prior to the PR being posted. I'll take another quick pass through and then I can formally document my +1.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing to me. Why is this ant task necessary? I presume this is copy/paste from one of the other sub-projects. It looks like it is in kite-data-hbase, kite-data-hive, kite-data-s3, and kite-hadoop-compatibility.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because the module does not have any public classes so there's no javadoc for it. This task generates an empty javadoc JAR, which is needed for Maven central.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that explanation makes sense in the context of the other projects but this project does have public classes and with this ant task removed it does still generate a javadoc JAR.

I think maybe this task just needs to be removed from this POM. Alternatively we can update the package excludes filter for the javadoc plugin (defined in the parent POM) to exclude org.kitesdk.data.oozie directly indicating we choose not to publish javadoc. I don't expect anyone is really going to be reading the javadoc as it's not really a Kite consumer oriented project.

@tomwhite or @rdblue - do you have an opinion about whether or not we should publish the javadoc for these classes?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should publish javadoc for this module as it doesn't contain classes that users will program to in their programs. So we should update the package excludes filters and leave the ant task as it is in this patch.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Tom. Most of the code here is implementing a public API documented in Oozie, right?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct and I agree it would be fine not to publish javadoc. These classes will not be consumed by "users" - just by Oozie.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the exclusion to the javadoc plugin 32186c3

@ben-roling
Copy link
Copy Markdown

I think there are just 3 things left with this:

  • exclude the package on the javadoc plugin config
  • possibly change the IOException handling for exists()
  • implement getContext()

Once @bbrownz implements the trivial javadoc plugin config change I am +1.

I don't care that much about the IOException handling thing. It could be changed to rethrow as HadoopAccessorException or left alone and either way would be fine by me.

With CDK-1007 I have provided a sub-task to defer the getContext() change. I think we could merge this without it.

@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented May 29, 2015

Addressed @ben-roling's comments from above. I think it's best to leave the implementation of the getContext() method to the optimization in the CDK-1007 issue when the usage would really come into play.

I everything looks good here I can squash this down into a single change before the merge.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented May 29, 2015

Squash away!

@bbrownz bbrownz force-pushed the CDK-476_data_oozie branch 2 times, most recently from e68be98 to 11146a9 Compare May 29, 2015 19:41
@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented May 29, 2015

Rebased and squashed, should be ready after the travis build goes through.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in Oozie, we use our XLog class for logging; the methods should be the same though. For example:

private static XLog LOG = XLog.getLog(KiteURIHandler.class);

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call, updated to XLog here: c743837

@rkanter
Copy link
Copy Markdown

rkanter commented Jun 1, 2015

Hi, I'm one of the Oozie devs. I took a look at this, and it looks fine to me. I made a minor comment about logging. That said, the Yahoo! Oozie devs are more familiar with the URIHandler code; they're the ones who generalized it and added support for HCat URIs.

What do you think about including this directly in Oozie? That would make things easier for users because they wouldn't have to add jars into the Oozie server. And we can include the example Coordinator too. I know that Ben was working on OOZIE-1829, though I guess that's no longer needed for now. Anyway, I think putting this in Oozie would be a great integration story, and I can also ask one of the Yahoo! Oozie devs to review the code.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 1, 2015

Thanks for taking a look @rkanter!

@bbrownz and @ben-roling, we should definitely decide if we want this to go in Oozie or Kite. I'd like to avoid adding it here only to deprecate it in favor of an Oozie version in a release or two. My vote is to get this into Oozie, since that seems like the right place for it to live, as with Flume and the DatasetSink. What do you guys think?

@ben-roling
Copy link
Copy Markdown

Thanks for the feedback @rkanter!

At first glance I thought it would be nice to have this included in Oozie itself as well but @bbrownz and I talked through it more and I'm not so sure.

The primary concern we had was with regard to the speed at which we could get changes pushed through the system. For example, if a consumer needs a fix to Kite they need to wait for a Kite release, an Oozie release after that, and then finally a distro release before they finally have it in their environment.

If the handler is only included in Kite things can work much more quickly. Once the new Kite version is released they just deploy it and it is done with no concern for when the next Oozie or distro release might happen.

All this said, there is nothing we could do to stop the Oozie devs from incorporating Kite URI support directly into their project if they wanted to. If they did, it could be difficult for consumers to override the Kite version if they needed to pick up a newer Kite version more quickly than they could get a new Oozie version.

Assuming we continue down a path of keeping the packaging separate from Oozie their is also the possibility we could do a documentation patch to Oozie such that Oozie doc makes mention of the Kite integration, making the functionality easier for users to find.

Thoughts?

@rkanter
Copy link
Copy Markdown

rkanter commented Jun 2, 2015

That's a good point. We don't have a regular release schedule for Oozie; it's just sort of whenever someone feels like doing it. And we don't have frequent releases. If Kite wants to iterate faster on it, then it would make sense to put it in Kite. If we do that, I do like the idea of mentioning it in the Oozie docs.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 2, 2015

How much do we expect this component to evolve? It seems like we should be designing it so that it touches Kite's public API -- which I think is how it currently works -- and doesn't need to change much since it is just doing change detection based on URI and the signalling API. I know we've identified some improvements and opened issues for them, but shouldn't a short-term goal be to get this to the point where we don't need to change it often?

If we could just update the version of Kite on a cluster and have Oozie pick up the latest, then I think that would be preferable.

@ben-roling
Copy link
Copy Markdown

I'm not expecting this component specifically to change that much. I was thinking more about Kite core changing. Of course this component doesn't use the full surface area of the Kite API so there are only certain types of changes to Kite core that could affect the Oozie integration. The change would have to be something that affects Datasets.load() or Signalable.isReady() - maybe a bug fix or an enhancement or optimization of some sort.

If we could just update the version of Kite on a cluster and have Oozie pick up the latest, then I think that would be preferable.

Per your suggestion, we could take a hybrid approach where we include the KiteURIHandler class in Kite itself and do an Oozie patch that adds configuration to pick it up.

More specifically, the Oozie patch would:

  • update oozie-default.xml to add the new URI schema to URIHandler class mapping
  • include a default Kite version in the Oozie dependencies so things work out of the box
  • include a configuration mechanism to point to an alternate Kite installation path as a mechanism to pick up a newer/different Kite install

I don't off the top of my head know exactly how that would be implemented on the Oozie side but we could look into it if that direction is desirable.

@rkanter
Copy link
Copy Markdown

rkanter commented Jun 2, 2015

That is another option. The code could live in Kite and Oozie just pulls it in like any other 3rd-party maven dependency during the build. We already parameterize a number of dependencies in the pom, so we could just add a "kite.version". Then, anyone who builds Oozie can easily switch versions by changing the pom, and anyone who already has Oozie can simply swap out the kite jar with a newer version. That leaves all of the code in Kite, while still making it work in Oozie out-of-the-box.

@ben-roling
Copy link
Copy Markdown

My hope is to avoid a solution where when a consumer wants a new Kite version before an Oozie release it requires rebuilding Oozie or modifying the files provided by the existing Oozie deployment in their cluster.

@rkanter - the way you describe sounds to me like you are suggesting a solution that would require deleting JARs from the existing Oozie deployment to get the new Kite version to go into effect.

My thought was slightly different in that we would add some mechanism to Oozie where the Kite dependencies live in some specific directory and an environment variable or configuration key of some sort can be set to point to a non-default directory to override the Kite install.

Ideally the mechanism would be such that in the CDH world a parcel could be used to apply a new Kite version to Oozie. If deleting files out of the original Oozie deployment is required I don't think a parcel would be able to do the job.

@rkanter
Copy link
Copy Markdown

rkanter commented Jun 3, 2015

In CDH, all those jars would be symlinked. So, if you update Kite and restart Oozie, Oozie should automatically pick the new jars up. I was talking more from an Apache release perspective, with a tarball install. i.e. You have Oozie 4.1.0, and you want to update the Kite jars, you can just add the new Kite jars and redeploy the Oozie server. You'd have to do this anyway if Oozie didn't ship with any Kite jars, right?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just spotted this because it caused CI test failures. This should be the upstream Apache version of Oozie instead of the CDH one. We test against CDH in the cdh5 profile, so you can override the version to be 4.1.0-cdh5.4.2 there. (Looks like it is based on 4.1.0, which is why the test failed in the first place.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, it looks like when I rebased and squashed I pulled in the update for the CDH5 profile to 5.4.2 (which included an update in the CDH5 profile to go to the 4.1.0 version of oozie).

This project will still build under the default and CDH4 profiles where "vers.oozie" would be 3.3.2 based (before the URI handler). So just "vers.oozie" wouldn't work.

Just to make sure I'm in sync, are you suggesting I just set this to 4.1.0 and drop the CDH portion?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the version in the cdh5 profile should be 4.1.0-cdh5.4.2. I don't mind what we use for upstream, but feel free to update it if that works better for you.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this commit should get us closer to what we want dbf34f9

It's still awkward in that the module always needs to build with at least oozie 4.x, and the CDH4 and default profiles both build with oozie 3.3.2. To help get around this I added a vers.oozie4 property to the data-oozie module, and extended the cdh5 profile to set vers.oozie4 to use the CDH version under that profile. This way we should still be using the CDH5 version when testing under that profile and the oozie 4 version under the other two profiles.

@ben-roling
Copy link
Copy Markdown

In CDH, all those jars would be symlinked. So, if you update Kite and restart Oozie, Oozie should automatically pick the new jars up.

Yeah, but I'm not sure you follow what I mean. What would your "update Kite" steps be? Let's pretend we add kite-data as a dependency of Oozie and in CDH5.5 I get this new version of Oozie installed into my environment as part of the standard CDH5.5 parcel. Now kite-data-1.1.jar is in /opt/cloudera/parcels/CDH-5.5.0/lib/oozie/libserver. Now, kite-data-1.1.1 comes out with a critical (to me) bug fix and I want to get it into my environment. Cloudera hasn't deemed this fix important enough so there is no CDH-5.5.1 parcel for me to pick it up from. How do I incorporate it?

Sure, I suppose I could go manually muck with the /opt/cloudera/parcels/CDH-5.5.0/lib/oozie/libserver directory on my system but I don't want to do that. What I was thinking was I wanted to create my own parcel that I can install with a fix. I guess I could create my own CDH-5.5.0-myfix parcel that repackages the whole CDH5.5 distribution but I wasn't wanting to go there. I want to be able to create a parcel that contains only the new Kite JARs and install that as an addition to the system on top of the CDH-5.5.0 parcel. I can do that but the problem is there is no way that I am aware (via a parcel) of for me to take the original Kite JARs off the Oozie classpath without repackaging all of Oozie or all of CDH. Even if I wanted (and I don't) to take the route of just repackaging all of Oozie I'm not yet familiar enough with the parcel system to know how I could do that either.

I was talking more from an Apache release perspective, with a tarball install. i.e. You have Oozie 4.1.0, and you want to update the Kite jars, you can just add the new Kite jars and redeploy the Oozie server. You'd have to do this anyway if Oozie didn't ship with any Kite jars, right?

Maybe it depends on what you mean by "redeploy the Oozie server". If Oozie didn't ship any changes related to this feature what a user would do to use the feature is change their oozie-site.xml, add the necessary JARs to /var/lib/oozie, and restart Oozie. They don't have to touch any of the content that was shipped by Oozie or in any way rebuild Oozie. I wouldn't consider that "redeploying the Oozie server" but you could be using a different definition of the word "deploy".

In the model I just described If there was a new release of Kite you needed you would delete the Kite-related JARs from /var/lib/oozie, put the new ones in, and restart Oozie.

The direction I was trying to suggest is such that instead of dumping the Kite JARs in /var/lib/oozie you would dump them wherever you like (e.g. /my/kite/jar/path) and then update some Oozie config to add /my/kite/jar/path to the classpath. When you want a new version of Kite you can either delete the old jars from /my/kite/jar/path and add the new ones or you can create a new /my/kite/jar/path/v2 folder and update the Oozie config to point to the new location.

With this model Oozie could ship default Kite JARs in some folder embedded in Oozie and add that directory to the classpath unless the configuration is overridden to point to another directory.

Am I starting to make more sense?

@rkanter
Copy link
Copy Markdown

rkanter commented Jun 4, 2015

Cloudera hasn't deemed this fix important enough so there is no CDH-5.5.1 parcel for me to pick it up from. How do I incorporate it?
What I was thinking was I wanted to create my own parcel that I can install with a fix. I guess I could create my own CDH-5.5.0-myfix parcel that repackages the whole CDH5.5 distribution but I wasn't wanting to go there. I want to be able to create a parcel that contains only the new Kite JARs and install that as an addition to the system on top of the CDH-5.5.0 parcel.

I'm not familiar with the release process for Kite, but I was I don't know if we want to go down the rabbit-hole of specifically helping users doing custom things like patching and building their own version of Kite (unless that's common for Kite?). And as far as I know, when you build a Parcel, you have to include every component, not just the updated jars; so I don't think you can do that.

Now kite-data-1.1.jar is in /opt/cloudera/parcels/CDH-5.5.0/lib/oozie/libserver. Now, kite-data-1.1.1 comes out with a critical (to me) bug fix and I want to get it into my environment.

For CDH components, /opt/cloudera/parcels/CDH-5.5.0/lib/oozie/libserver doesn't have version numbers in the jar names. Instead, it uses symlinks to unversioned jars in that component's dir, which are actually symlinks that point to versioned jars. So, /opt/cloudera/parcels/CDH-5.5.0/lib/oozie/libserver would have kite-data.jar which points to ../../../lib/kite/lib/kite-data.jar (or something similar), which in-turn points to kite-data-1.1.jar. When you upgrade the Kite parcel, it should have a new unversioned symlink that points to kite-data-1.1.1.jar.

Maybe it depends on what you mean by "redeploy the Oozie server". If Oozie didn't ship any changes related to this feature what a user would do to use the feature is change their oozie-site.xml, add the necessary JARs to /var/lib/oozie, and restart Oozie.

Redeploying the Oozie server is only for tarball installs, which is what is done upstream. And /var/lib/oozie only works for CDH package or parcel installs. Upstream Oozie doesn't have a binary distribution. When you want to install that Oozie, you download the source tarball and compile it. There's then a setup script that you have to run, which lets you add additional jars to the WAR file before deploying it in Tomcat. In upstream Oozie, if you want to change or add additional jars, you have to re-run the setup script to redeploy Oozie. In CDH packages/parcels, we do some magic stuff with exploded WAR files where you don't have to do this, and /var/lib/oozie is symlinked into the deployed WAR "file" in Tomcat.

The direction I was trying to suggest is such that instead of dumping the Kite JARs in /var/lib/oozie you would dump them wherever you like (e.g. /my/kite/jar/path) and then update some Oozie config to add /my/kite/jar/path to the classpath.

Because of Tomcat, Oozie doesn't have lots of control over manipulating it's classpath. Currently, we've only made classes configurable in oozie-site (e.g. Here's the list of classes to load on startup, here's the list of URIHandler classes, etc), and Oozie just assumes their already on the classpath. And for that to happen, the jars have to be in the right place as I described in my previous paragraph.

Another option is to only include the Kite jar with the CDH version of Oozie, and require users of the upstream version of Oozie download it manually. Mixing CDH versions is unsupported, so this shouldn't be a problem; and again, the unversioned symlinks help here anyway. We currently do this with some of the optional dependencies in Oozie, so there's precedent for that.

@rkanter
Copy link
Copy Markdown

rkanter commented Jun 4, 2015

By the way, I'll be on vacation until Tuesday with limited access, so you probably won't get another reply from me until then.

@bbrownz bbrownz force-pushed the CDK-476_data_oozie branch from c743837 to dbf34f9 Compare June 4, 2015 13:19
@ben-roling
Copy link
Copy Markdown

Thanks for all the feedback and extra information @rkanter. I guess my thoughts were probably mostly rooted in my organization's current usage of kite in a mentality where kite is not a distribution-provided library. We're still mostly on CDH4 where it seems to me kite is not distribution-provided. Now that I am looking more closely it seems that kite is a distribution-provided library in at least recent CDH5 parcels. For example, I have a Quickstart VM with a CDH5.3.3 parcel on it at the moment and I see Kite is included inside it.

In our non-distribution-provided usage so far we have just picked up Kite versions as Kite project releases them (independent from CDH releases). I have not even thought to look to see the relationship between Kite releases and CDH releases. For example, what is the average time between a Kite release and a CDH parcel release that follows it? If those generally happen in tandem then clearly my whole thinking was wrong as there would always be a CDH parcel release I could pick up to get the change.

To be clear, my goal wasn't to try to support use cases of users running forks of Kite. I was targeting use cases of official Kite releases that hadn't gotten into the distribution yet. Like I indicated in my previous paragraph, perhaps the idea of Kite releases that aren't in the distribution yet is fallacy? Maybe such things don't really exist or only exist for very short periods of time.

I should also note that I am not yet as familiar as I'd like to be with the parcel concepts in CDH. I understand there is a core CDH parcel and then there are some other additive parcels you can install on top (e.g. LZO support).

I was talking to a colleague yesterday and he pointed me to this thread that helped me understand things a bit better. Parcels are expected to be disjoint and there isn't really a reliable mechanism to override something in one parcel with something else in another parcel.

I guess the idea I had before was kind of like the idea of having a parcel that overrides the kite from the CDH parcel but not exactly. I was thinking more like it was something that adds an additional Kite version to the system and then allows an Oozie configuration option that allows the administrator to use it just for Oozie. That said, the idea doesn't really make sense when you think of kite as a distribution-provided dependency across the platform (not just in the context of Oozie). If I wanted Oozie to use a new Kite version I would most likely want the whole platform to use the new Kite version. At that point what I would want would be more directly like the idea of having a parcel that overrides the CDH provided cross-platform Kite library, which is something the current parcel system doesn't support.

In the end I guess it appears to me the correct way to get a new Kite version in the context of CDH is to get a whole new CDH parcel and presumably just to wait for Cloudera to release one. Again, I haven't yet checked the relationship between Kite releases and unsolicited CDH parcel releases but if there is too big of a gap there then of course the consumer would try to coerce Cloudera to perform and unscheduled parcel packaging. It's now my understanding that it isn't terribly uncommon for Cloudera to prepare an unscheduled parcel for a customer that needs some fix before the fix is scheduled to go in a regular CDH update release.

So, I’m giving up on the idea of the configuration option being added to Oozie. We still have a choice to make about how to incorporate these changes into Oozie:

  • All the code goes into Kite and we add nothing except a documentation reference to Oozie
  • The kite-oozie module stays in Kite and we add Kite dependencies and ooze-default.xml update to Oozie
  • The kite-oozie module moves to oozie (likely named oozie-kite instead since now it is an oozie module instead of a kite module) such that there is nothing related to the URIHandler in Kite itself at all except maybe a documentation reference

I presume @rdblue and @rkanter you both prefer the 2nd or 3rd option over the 1st? Which do you prefer of the 2nd and 3rd options?

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 11, 2015

@ben-roling, I'm inclined to keep this in Kite for now and update Oozie documentation. It looks like Oozie releases are rare and we'll be able to progress much more quickly with Kite's more regular release timeline. That said, if we end up not changing anything in this for a few months, I think we should revisit the issue and move it to Oozie because we will have confirmed that it won't change much as long as it uses Kite's public API.

As far as CDH releases go, we do keep up with Kite features in those releases. CDH 5.4.x has Kite 1.0.0. Since Kite is both a client library, you can either use it in your code (whatever version you like) or use it through components that aren't libraries. Kite shipped with CDH is primarily for components like Flume that use Kite internally, but also for customers that prefer to pick up changes more slowly and not have to update their software with the upstream project.

Next steps: let's get this merged! Is there anything that needs to be fixed before adding this to Kite?

@bbrownz bbrownz force-pushed the CDK-476_data_oozie branch from 32fdc72 to 7498953 Compare June 11, 2015 21:45
@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented Jun 11, 2015

I don't believe there is anything left to fix. I pushed up a new rebased version. Should be ready to go if everything looks good on the build.

@bbrownz bbrownz force-pushed the CDK-476_data_oozie branch from 7498953 to 32e75b2 Compare June 12, 2015 13:32
@ben-roling
Copy link
Copy Markdown

Thanks for the feedback @rdblue - it will be nice to finally have a released form of this functionality!

We can certainly revisit the possibility of moving this over to the Oozie project in the future.

@rkanter
Copy link
Copy Markdown

rkanter commented Jun 12, 2015

I've created https://issues.apache.org/jira/browse/OOZIE-2269 to add the docs to Oozie. If someone here wants to do it, I can review it.

Given that CDH releases ship with a single version of Kite (and the user isn't expected to update it without updating CDH), I'll probably look into having the kite-oozie jars in Oozie automatically in just CDH; that will save those users some extra steps.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 12, 2015

@rkanter, that sounds good to me.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 12, 2015

@bbrownz: local tests are failing because the column projection API was merged before this. Sorry! Could you update?

@bbrownz bbrownz force-pushed the CDK-476_data_oozie branch from 32e75b2 to c6bfa7a Compare June 12, 2015 19:13
@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented Jun 12, 2015

Sure, pushed up a quick rebase against master. Local tests looked good on the default profile.

@bbrownz
Copy link
Copy Markdown
Contributor Author

bbrownz commented Jun 12, 2015

It looks like the CDH5 profile build failed on the kite-data-flume module with the native thread error. It's likely unrelated but it might be worth kicking off another travis build just to be sure.

java.lang.OutOfMemoryError: unable to create new native thread

@ben-roling
Copy link
Copy Markdown

@rkanter - thanks for logging OOZIE-2269 - I think @bbrownz or I could probably get something for you to review on that.

If you can also get the update into CDH to include kite-oozie automatically that sounds nice.

@rdblue
Copy link
Copy Markdown
Contributor

rdblue commented Jun 13, 2015

Merged as d540ead0

It's great to add support for Oozie, thanks for making it happen @bbrownz and @ben-roling!

@rdblue rdblue closed this Jun 13, 2015
@bbrownz bbrownz deleted the CDK-476_data_oozie branch June 15, 2015 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants