CDK-476 oozie UriHandler for Kite URIs#364
Conversation
|
Created a PR to kite-examples showing a use case for the URIHandler, kite-sdk/kite-examples#26 |
|
Hi @bbrownz, sorry it is taking me so long to get to this. I'm not very familiar with Oozie, but I talked with someone today that was interested and will probably have a look. Otherwise I'll follow up on this next week. Thanks for being patient! |
|
No worries! The biggest questions I had posting this in were |
There was a problem hiding this comment.
Maybe this should attempt to read the signal files and catch any IOExceptions that are thrown. Then this would have at least basic support.
There was a problem hiding this comment.
Were you thinking along the route of reading the signal files directly (so we'd be able to flex the configuration and user used for the FileSystem access)? That seems like it would bind us very directly to the current implementation, although it would work.
There was a problem hiding this comment.
No, I meant try to do the operation like you do here, but if it fails with a permissions-related IOException, return false instead of propagating the exception.
There was a problem hiding this comment.
Ah, I follow, thats an interesting trade off.
I think returning false would make the most sense if we were approaching it from the perspective that the (oozie process) user without permissions shouldn't know that the dataset exists or not. Propagating the exception through helps to reveal the issue that the permissions are blocking the operation they were hoping would work.
I think propagating the exception makes the most sense in that alert a consumer that their permission setup isn't going to work with the URI handler as is. If we were to create documentation around the usage we would definitely want to note that case (outside of the example project I didn't create any, and perhaps should have).
There was a problem hiding this comment.
There is precedent for catching IOException and rethrowing as HadoopAccessorException in FSURIHandler. It seems reasonable to me to be consistent with that.
There was a problem hiding this comment.
I added a catch in the exists method for a DatasetException to re-throw that exception as a HadoopAccessorException. I'd considered the more specific DatasetIOException that IOExceptions are wrapped in but this seemed like it would cover our bases a little better. It also occurred to me that HadoopAccessorException might be less appropriate in the future (for example if we added a signal implementation that was HBase specific) but that felt like I had started to nitpick myself. I'm open to further opinions though 😄
|
I have a couple of minor comments, but this looks good to me overall. Have you tested this with a cluster? |
|
@ben-roling, could you review this as well? |
So far we've tested this on 5.2 and 5.3 with quickstart VMs via the example (kite-sdk/kite-examples#26). We're in progress doing some testing now with one of our dev clusters that has Kerberos enabled. |
|
Sounds good! I'd like someone more familiar with Oozie to take a look, but I'm happy with this going in. Since Ben is a committer, I think it makes sense to get his +1 before merging. Thanks for working on this, @bbrownz! |
|
@bbrownz works with me and as such I had already reviewed most of this prior to the PR being posted. I'll take another quick pass through and then I can formally document my +1. |
There was a problem hiding this comment.
This is a bit confusing to me. Why is this ant task necessary? I presume this is copy/paste from one of the other sub-projects. It looks like it is in kite-data-hbase, kite-data-hive, kite-data-s3, and kite-hadoop-compatibility.
There was a problem hiding this comment.
It's because the module does not have any public classes so there's no javadoc for it. This task generates an empty javadoc JAR, which is needed for Maven central.
There was a problem hiding this comment.
Ok, that explanation makes sense in the context of the other projects but this project does have public classes and with this ant task removed it does still generate a javadoc JAR.
I think maybe this task just needs to be removed from this POM. Alternatively we can update the package excludes filter for the javadoc plugin (defined in the parent POM) to exclude org.kitesdk.data.oozie directly indicating we choose not to publish javadoc. I don't expect anyone is really going to be reading the javadoc as it's not really a Kite consumer oriented project.
@tomwhite or @rdblue - do you have an opinion about whether or not we should publish the javadoc for these classes?
There was a problem hiding this comment.
I don't think we should publish javadoc for this module as it doesn't contain classes that users will program to in their programs. So we should update the package excludes filters and leave the ant task as it is in this patch.
There was a problem hiding this comment.
I agree with Tom. Most of the code here is implementing a public API documented in Oozie, right?
There was a problem hiding this comment.
Correct and I agree it would be fine not to publish javadoc. These classes will not be consumed by "users" - just by Oozie.
There was a problem hiding this comment.
Added the exclusion to the javadoc plugin 32186c3
|
I think there are just 3 things left with this:
Once @bbrownz implements the trivial javadoc plugin config change I am +1. I don't care that much about the IOException handling thing. It could be changed to rethrow as HadoopAccessorException or left alone and either way would be fine by me. With CDK-1007 I have provided a sub-task to defer the getContext() change. I think we could merge this without it. |
|
Addressed @ben-roling's comments from above. I think it's best to leave the implementation of the getContext() method to the optimization in the CDK-1007 issue when the usage would really come into play. I everything looks good here I can squash this down into a single change before the merge. |
|
Squash away! |
e68be98 to
11146a9
Compare
|
Rebased and squashed, should be ready after the travis build goes through. |
There was a problem hiding this comment.
in Oozie, we use our XLog class for logging; the methods should be the same though. For example:
private static XLog LOG = XLog.getLog(KiteURIHandler.class);
There was a problem hiding this comment.
good call, updated to XLog here: c743837
|
Hi, I'm one of the Oozie devs. I took a look at this, and it looks fine to me. I made a minor comment about logging. That said, the Yahoo! Oozie devs are more familiar with the URIHandler code; they're the ones who generalized it and added support for HCat URIs. What do you think about including this directly in Oozie? That would make things easier for users because they wouldn't have to add jars into the Oozie server. And we can include the example Coordinator too. I know that Ben was working on OOZIE-1829, though I guess that's no longer needed for now. Anyway, I think putting this in Oozie would be a great integration story, and I can also ask one of the Yahoo! Oozie devs to review the code. |
|
Thanks for taking a look @rkanter! @bbrownz and @ben-roling, we should definitely decide if we want this to go in Oozie or Kite. I'd like to avoid adding it here only to deprecate it in favor of an Oozie version in a release or two. My vote is to get this into Oozie, since that seems like the right place for it to live, as with Flume and the DatasetSink. What do you guys think? |
|
Thanks for the feedback @rkanter! At first glance I thought it would be nice to have this included in Oozie itself as well but @bbrownz and I talked through it more and I'm not so sure. The primary concern we had was with regard to the speed at which we could get changes pushed through the system. For example, if a consumer needs a fix to Kite they need to wait for a Kite release, an Oozie release after that, and then finally a distro release before they finally have it in their environment. If the handler is only included in Kite things can work much more quickly. Once the new Kite version is released they just deploy it and it is done with no concern for when the next Oozie or distro release might happen. All this said, there is nothing we could do to stop the Oozie devs from incorporating Kite URI support directly into their project if they wanted to. If they did, it could be difficult for consumers to override the Kite version if they needed to pick up a newer Kite version more quickly than they could get a new Oozie version. Assuming we continue down a path of keeping the packaging separate from Oozie their is also the possibility we could do a documentation patch to Oozie such that Oozie doc makes mention of the Kite integration, making the functionality easier for users to find. Thoughts? |
|
That's a good point. We don't have a regular release schedule for Oozie; it's just sort of whenever someone feels like doing it. And we don't have frequent releases. If Kite wants to iterate faster on it, then it would make sense to put it in Kite. If we do that, I do like the idea of mentioning it in the Oozie docs. |
|
How much do we expect this component to evolve? It seems like we should be designing it so that it touches Kite's public API -- which I think is how it currently works -- and doesn't need to change much since it is just doing change detection based on URI and the signalling API. I know we've identified some improvements and opened issues for them, but shouldn't a short-term goal be to get this to the point where we don't need to change it often? If we could just update the version of Kite on a cluster and have Oozie pick up the latest, then I think that would be preferable. |
|
I'm not expecting this component specifically to change that much. I was thinking more about Kite core changing. Of course this component doesn't use the full surface area of the Kite API so there are only certain types of changes to Kite core that could affect the Oozie integration. The change would have to be something that affects Datasets.load() or Signalable.isReady() - maybe a bug fix or an enhancement or optimization of some sort.
Per your suggestion, we could take a hybrid approach where we include the KiteURIHandler class in Kite itself and do an Oozie patch that adds configuration to pick it up. More specifically, the Oozie patch would:
I don't off the top of my head know exactly how that would be implemented on the Oozie side but we could look into it if that direction is desirable. |
|
That is another option. The code could live in Kite and Oozie just pulls it in like any other 3rd-party maven dependency during the build. We already parameterize a number of dependencies in the pom, so we could just add a "kite.version". Then, anyone who builds Oozie can easily switch versions by changing the pom, and anyone who already has Oozie can simply swap out the kite jar with a newer version. That leaves all of the code in Kite, while still making it work in Oozie out-of-the-box. |
|
My hope is to avoid a solution where when a consumer wants a new Kite version before an Oozie release it requires rebuilding Oozie or modifying the files provided by the existing Oozie deployment in their cluster. @rkanter - the way you describe sounds to me like you are suggesting a solution that would require deleting JARs from the existing Oozie deployment to get the new Kite version to go into effect. My thought was slightly different in that we would add some mechanism to Oozie where the Kite dependencies live in some specific directory and an environment variable or configuration key of some sort can be set to point to a non-default directory to override the Kite install. Ideally the mechanism would be such that in the CDH world a parcel could be used to apply a new Kite version to Oozie. If deleting files out of the original Oozie deployment is required I don't think a parcel would be able to do the job. |
|
In CDH, all those jars would be symlinked. So, if you update Kite and restart Oozie, Oozie should automatically pick the new jars up. I was talking more from an Apache release perspective, with a tarball install. i.e. You have Oozie 4.1.0, and you want to update the Kite jars, you can just add the new Kite jars and redeploy the Oozie server. You'd have to do this anyway if Oozie didn't ship with any Kite jars, right? |
kite-data/kite-data-oozie/pom.xml
Outdated
There was a problem hiding this comment.
I just spotted this because it caused CI test failures. This should be the upstream Apache version of Oozie instead of the CDH one. We test against CDH in the cdh5 profile, so you can override the version to be 4.1.0-cdh5.4.2 there. (Looks like it is based on 4.1.0, which is why the test failed in the first place.)
There was a problem hiding this comment.
Ah, yeah, it looks like when I rebased and squashed I pulled in the update for the CDH5 profile to 5.4.2 (which included an update in the CDH5 profile to go to the 4.1.0 version of oozie).
This project will still build under the default and CDH4 profiles where "vers.oozie" would be 3.3.2 based (before the URI handler). So just "vers.oozie" wouldn't work.
Just to make sure I'm in sync, are you suggesting I just set this to 4.1.0 and drop the CDH portion?
There was a problem hiding this comment.
I think the version in the cdh5 profile should be 4.1.0-cdh5.4.2. I don't mind what we use for upstream, but feel free to update it if that works better for you.
There was a problem hiding this comment.
I think this commit should get us closer to what we want dbf34f9
It's still awkward in that the module always needs to build with at least oozie 4.x, and the CDH4 and default profiles both build with oozie 3.3.2. To help get around this I added a vers.oozie4 property to the data-oozie module, and extended the cdh5 profile to set vers.oozie4 to use the CDH version under that profile. This way we should still be using the CDH5 version when testing under that profile and the oozie 4 version under the other two profiles.
Yeah, but I'm not sure you follow what I mean. What would your "update Kite" steps be? Let's pretend we add kite-data as a dependency of Oozie and in CDH5.5 I get this new version of Oozie installed into my environment as part of the standard CDH5.5 parcel. Now kite-data-1.1.jar is in Sure, I suppose I could go manually muck with the
Maybe it depends on what you mean by "redeploy the Oozie server". If Oozie didn't ship any changes related to this feature what a user would do to use the feature is change their oozie-site.xml, add the necessary JARs to /var/lib/oozie, and restart Oozie. They don't have to touch any of the content that was shipped by Oozie or in any way rebuild Oozie. I wouldn't consider that "redeploying the Oozie server" but you could be using a different definition of the word "deploy". In the model I just described If there was a new release of Kite you needed you would delete the Kite-related JARs from /var/lib/oozie, put the new ones in, and restart Oozie. The direction I was trying to suggest is such that instead of dumping the Kite JARs in /var/lib/oozie you would dump them wherever you like (e.g. /my/kite/jar/path) and then update some Oozie config to add /my/kite/jar/path to the classpath. When you want a new version of Kite you can either delete the old jars from /my/kite/jar/path and add the new ones or you can create a new /my/kite/jar/path/v2 folder and update the Oozie config to point to the new location. With this model Oozie could ship default Kite JARs in some folder embedded in Oozie and add that directory to the classpath unless the configuration is overridden to point to another directory. Am I starting to make more sense? |
I'm not familiar with the release process for Kite, but I was I don't know if we want to go down the rabbit-hole of specifically helping users doing custom things like patching and building their own version of Kite (unless that's common for Kite?). And as far as I know, when you build a Parcel, you have to include every component, not just the updated jars; so I don't think you can do that.
For CDH components,
Redeploying the Oozie server is only for tarball installs, which is what is done upstream. And
Because of Tomcat, Oozie doesn't have lots of control over manipulating it's classpath. Currently, we've only made classes configurable in oozie-site (e.g. Here's the list of classes to load on startup, here's the list of URIHandler classes, etc), and Oozie just assumes their already on the classpath. And for that to happen, the jars have to be in the right place as I described in my previous paragraph. Another option is to only include the Kite jar with the CDH version of Oozie, and require users of the upstream version of Oozie download it manually. Mixing CDH versions is unsupported, so this shouldn't be a problem; and again, the unversioned symlinks help here anyway. We currently do this with some of the optional dependencies in Oozie, so there's precedent for that. |
|
By the way, I'll be on vacation until Tuesday with limited access, so you probably won't get another reply from me until then. |
c743837 to
dbf34f9
Compare
|
Thanks for all the feedback and extra information @rkanter. I guess my thoughts were probably mostly rooted in my organization's current usage of kite in a mentality where kite is not a distribution-provided library. We're still mostly on CDH4 where it seems to me kite is not distribution-provided. Now that I am looking more closely it seems that kite is a distribution-provided library in at least recent CDH5 parcels. For example, I have a Quickstart VM with a CDH5.3.3 parcel on it at the moment and I see Kite is included inside it. In our non-distribution-provided usage so far we have just picked up Kite versions as Kite project releases them (independent from CDH releases). I have not even thought to look to see the relationship between Kite releases and CDH releases. For example, what is the average time between a Kite release and a CDH parcel release that follows it? If those generally happen in tandem then clearly my whole thinking was wrong as there would always be a CDH parcel release I could pick up to get the change. To be clear, my goal wasn't to try to support use cases of users running forks of Kite. I was targeting use cases of official Kite releases that hadn't gotten into the distribution yet. Like I indicated in my previous paragraph, perhaps the idea of Kite releases that aren't in the distribution yet is fallacy? Maybe such things don't really exist or only exist for very short periods of time. I should also note that I am not yet as familiar as I'd like to be with the parcel concepts in CDH. I understand there is a core CDH parcel and then there are some other additive parcels you can install on top (e.g. LZO support). I was talking to a colleague yesterday and he pointed me to this thread that helped me understand things a bit better. Parcels are expected to be disjoint and there isn't really a reliable mechanism to override something in one parcel with something else in another parcel. I guess the idea I had before was kind of like the idea of having a parcel that overrides the kite from the CDH parcel but not exactly. I was thinking more like it was something that adds an additional Kite version to the system and then allows an Oozie configuration option that allows the administrator to use it just for Oozie. That said, the idea doesn't really make sense when you think of kite as a distribution-provided dependency across the platform (not just in the context of Oozie). If I wanted Oozie to use a new Kite version I would most likely want the whole platform to use the new Kite version. At that point what I would want would be more directly like the idea of having a parcel that overrides the CDH provided cross-platform Kite library, which is something the current parcel system doesn't support. In the end I guess it appears to me the correct way to get a new Kite version in the context of CDH is to get a whole new CDH parcel and presumably just to wait for Cloudera to release one. Again, I haven't yet checked the relationship between Kite releases and unsolicited CDH parcel releases but if there is too big of a gap there then of course the consumer would try to coerce Cloudera to perform and unscheduled parcel packaging. It's now my understanding that it isn't terribly uncommon for Cloudera to prepare an unscheduled parcel for a customer that needs some fix before the fix is scheduled to go in a regular CDH update release. So, I’m giving up on the idea of the configuration option being added to Oozie. We still have a choice to make about how to incorporate these changes into Oozie:
I presume @rdblue and @rkanter you both prefer the 2nd or 3rd option over the 1st? Which do you prefer of the 2nd and 3rd options? |
|
@ben-roling, I'm inclined to keep this in Kite for now and update Oozie documentation. It looks like Oozie releases are rare and we'll be able to progress much more quickly with Kite's more regular release timeline. That said, if we end up not changing anything in this for a few months, I think we should revisit the issue and move it to Oozie because we will have confirmed that it won't change much as long as it uses Kite's public API. As far as CDH releases go, we do keep up with Kite features in those releases. CDH 5.4.x has Kite 1.0.0. Since Kite is both a client library, you can either use it in your code (whatever version you like) or use it through components that aren't libraries. Kite shipped with CDH is primarily for components like Flume that use Kite internally, but also for customers that prefer to pick up changes more slowly and not have to update their software with the upstream project. Next steps: let's get this merged! Is there anything that needs to be fixed before adding this to Kite? |
32fdc72 to
7498953
Compare
|
I don't believe there is anything left to fix. I pushed up a new rebased version. Should be ready to go if everything looks good on the build. |
7498953 to
32e75b2
Compare
|
Thanks for the feedback @rdblue - it will be nice to finally have a released form of this functionality! We can certainly revisit the possibility of moving this over to the Oozie project in the future. |
|
I've created https://issues.apache.org/jira/browse/OOZIE-2269 to add the docs to Oozie. If someone here wants to do it, I can review it. Given that CDH releases ship with a single version of Kite (and the user isn't expected to update it without updating CDH), I'll probably look into having the kite-oozie jars in Oozie automatically in just CDH; that will save those users some extra steps. |
|
@rkanter, that sounds good to me. |
|
@bbrownz: local tests are failing because the column projection API was merged before this. Sorry! Could you update? |
32e75b2 to
c6bfa7a
Compare
|
Sure, pushed up a quick rebase against master. Local tests looked good on the default profile. |
|
It looks like the CDH5 profile build failed on the kite-data-flume module with the native thread error. It's likely unrelated but it might be worth kicking off another travis build just to be sure.
|
|
Merged as d540ead0 It's great to add support for Oozie, thanks for making it happen @bbrownz and @ben-roling! |
An Oozie URIHandler for KiteURIs. This allows KiteURIs to be used for data input events. Particularly when chaining together workflows based on the availability of an upstream dataset (a populated view in this case).
Uses the signal ready concept implemented in CDK-451 as an alternative to the traditional "_SUCCESS" file.