-
-
Notifications
You must be signed in to change notification settings - Fork 32
A Quick Example
For folks who like to "see it go" to get a feel for something new, there is a pre-baked example of processing text documents (the complete works of William Shakespeare). This example is maintained here:
https://github.com/nsoft/jesterj/tree/master/code/examples/shakespeare
If you experience difficulty or errors please report them as issues.
Of course the example above is a toy example. You will likely look at it and wonder which parts are significant, so here's a link to an example of how to convert the Shakespeare example into your own ingestion project. This example does the following:
- Starts with the Shakespeare example (Astute observers may notice I accidentally made one minor change before checking that in, but "starting from Shakespeare" is a supported/intentional boostrap methodology for initiating a project of your own)
- Indexes the Solr Ref Guide instead of Shakespeare's works
- Includes a custom
DocumentProcessorthat you might want to write for this use case. (The ref guide has a<nav>element on every page and searches that hit terms from that<nav>element return every page, which is horrible precision from a user perspective) - Demonstrates the use of 3rd party libraries (for parsing the ref guide html to find the
<nav>elements) - Demonstrates the three simple steps for adding a step to a JesterJ ingestion plan.
One obvious worry in any content ingestion project is that by it's very nature JesterJ has an enormous set of dependencies. Consider that we depend on all of Cassandra, Solr, and Tika each of which is a dependency monster. JesterJ wouldn't be very useful if it forced you into closely constrained versions of libraries that you might want to use, or forced you to update to the latest Solr Client (for example). JesterJ has therefore developed a class loading solution that allows you to specify and use either our dependencies or your custom dependencies.
For code loaded from the jar you supply, a custom class loader is used and it only supplies the native JesterJ dependencies if it can't find the dependency among what you are supplying. JesterJ core code never sees anything loaded specifically for you so this should generally avoid conflicts with JesterJ. The trick to this is that you will be packaging your code using uno-jar, which is a class loader based fat-jar packaging system (a fork of OneJar that we also maintain). When you invoke JesterJ you will tell us where to find the uno-jar with your plan and any custom code, and every dependency you specified in your ingestion project. This is conceptually similar to a *.war file, but without the need for any xml descriptors, and with no "special" api jars you have to carefully exclude.
Thus if you write a custom class against Solr 8, your fat-jar will contain Solr 8's dependencies and your code will ignore the Solr 9 dependencies available from JesterJ. Thus, if you need to send to an ancient solr, the reccomended procedure is to copy and modify our SendToSolrProcessor and possibly one of it's subclasses (such as SendToSolrCloudZkProcessor) in your project and use that. While it's not an out of the box support for arbitrary Solr versions, you will have a solid starting point and narrowly proscribed set of changes to implement.