Skip to content
twagoo edited this page Dec 22, 2014 · 3 revisions

Project: YAMS (Yet another metadata search)

YAMS is an CMDI/IMDI archive search tool based on Arbil datanodes developed at The Language Archive. There are two parts, the crawler which is run from the command line and the web application which is run in tomcat. The crawling process is done directly to a local BaseX database. When the crawl is complete the resulting database is moved into the running BaseX instance. The web application can select from the various databases available and only requires read-only access to the databases.


People

  • Peter Withers (original developer)
  • Twan Goosen

Dependencies

YAMS depends (via Maven) on the Plugins framework and Arbil metadata reading and writing.

It also depends on a running BaseX XML database instance, and a connection to a 'Corpus Structure 2' provider for corpus browsing.

See the Design section for an overview.

Crawler usage

The crawler needs to be run via the command line. It has the following command line options:

-a,--append                   Restart crawling adding missing documents.
-ams,--amspermissions <arg>   REST service URL where permissions
						   information from AMS can be obtained
						   (default:
						   https://lux16.mpi.nl/ds/yams-cs-connector/r
						   est/node?id=).
-c,--crawl                    Crawl the provided url or the default url
						   if not otherwise specified.
-d,--drop                     Drop the existing data and recrawl. This
						   option implies the c option.
-db,--dbname <arg>            Name of the database to use (default:
						   YAMS-DB).
-f,--facets                   Preload the facets from the existing
						   crawled data.
-l,--limit <arg>              Limit crawling to URLs which contain the
						   provided string (default:
						   http://lux16.mpi.nl/).
-n,--number <arg>             Number of documents to insert (default:
						   90).
-p,--password <arg>           Data base password, (default: admin).
-s,--server <arg>             Data base server URL or file path (when a
						   file path is provided it is used as the
						   local basex directory via the java bindings
						   rather than the REST interface), default is
						   to use the un mondified local basex
						   directory
-t,--target <arg>             Target URL of the start documents to crawl
						   (default:
						   http://hdl.handle.net/11142/00-74BB450B-4E5
						   E-4EC7-B043-F444C62DB5C0). This option
						   implies the c option.
-u,--user <arg>               Data base user name, (default: admin).
-x,--debug                    Display debug output

In general, you will want to use the -a option to make the crawler recursively look up all links found in the root node. Use -n to limit the number of nodes to crawl in one session. The next time the crawler runs, it will continue processing links encountered in previous crawler sessions.

Design

YAMS ecosystem

Dependencies between YAMS and common external components

Thick lines = public interface; blue line = GUI

YAMS modules

Dependencies between YAMS modules

Thick lines = public interface; black line = REST service; blue line = GUI; red line = server side application

YAMS server architecture

Runtime connections between YAMS components:

Lucidchart source

Development & testing

Quick start

To start developing in the trunk, check out/clone the following projects and do a mvn clean install on each of the:

Crawler

  • Create a basex directory in your home directory

  • Run/debug the 'YAMS crawler' project from Netbeans (some default options are defined in nbactions.xml)

    • OR run the YAMS 'jar-with-dependencies.jar' file generated in crawler/target with the following options: -Xmx2048m -s ~/basex/ -c -a -n 1000 -f
  • This will start crawling from the default start URL (currently hdl:11142/00-74BB450B-4E5E-4EC7-B043-F444C62DB5C0), follow links (-a), process a maximum of 1000 files (-n) and create statistics (-f). This will generate some output to the stdout, and will also create a yams-crawler.log file in the working directory with some more information.

  • Add the -x option to enable debugging output

BaseX connector

Running the BaseX connector REST service against an existing BaseX database HTTP server

Assuming such a service is up and running at tlatest06 on port 8984:

  • Modify the parameters in context.xml as follows (globally or in the META-INF directory of the yams-basex-connector project before deployment):

    \<Parameter name="basexRestUrl" override="false" value="http://tlatest06:8984"/\>
    
    \<Parameter name="basexUser" override="false" value="admin"/\>
    
    \<Parameter name="basexPass" override="false" value="admin"/\>
    
  • Build and run the yams-basex-connector project in Tomcat

  • Browse to the 'rest' servlet within the deployed application, e.g. by going to http://localhost:8080/yams-basex-connector/rest/ this should show an HTML listing of the available databases and links to info (as JSON) for each database

Running the BaseX connector REST service against a local database

Assuming a database already exists on the filesystem (i.e. the crawler has been executed) in location /Users/me/basex:

  • Start a local BaseX REST server by running the following command in the 'bin' directory of BaseX: ./basexhttp -X -Dorg.basex.path=/Users/me/basex -l

  • A REST service should now be running locally on port 8984, test this by browsing to http://localhost:8984/rest/

  • Modify the parameters in context.xml as follows (globally or in the META-INF directory of the yams-basex-connector project before deployment):

  • Build and run the yams-basex-connector project in Tomcat

  • Browse to the 'rest' servlet within the deployed application, e.g. by going to http://localhost:8080/yams-basex-connector/rest/. This should show an HTML listing of the available databases and links to info (as JSON) for each database

Corpus Structure 2 connector

TODO

GWT application

Assuming that a BaseX connector and a CS2 connector are already running locally or remotely

  • Locate the /web/src/main/resources/nl/mpi/yams/client/ServiceLocations.properties file and open it for editing

  • Make sure the following properties link to existing and correctly functioning locations (they are relative to the deployment path of the GWT application):

    nl.mpi.yams.jsonCsAdaptorUrl=../yams-cs-connector/rest
    nl.mpi.yams.jsonBasexAdaptorUrl=../yams-basex-connector/rest
    
  • You can also configure remote locations, e.g.:

    nl.mpi.yams.jsonCsAdaptorUrl=https://lux16.mpi.nl/ds/yams-cs-connector/rest
    nl.mpi.yams.jsonBasexAdaptorUrl=https://lux16.mpi.nl/ds/yams-basex-connector/rest
    
  • Build YAMS GWT project

  • IMPORTANT: To enable development mode, start the application with the custom goal mvn gwt:run -Pdev (defined as a custom action in the Netbeans project);

  • To enable debugging (i.e. use breakpoints), use mvn gwt:debug -Pdev and attach the debugger to port 8000

  • This will start the 'GWT Development Mode' application. Launch the browser from within this application and use the 'dev mode' button on the first page to run the client side application in the browser and get debugging output in the console on the server

Testing the search REST interface

The interface deployed on lux17 is used as an example here:

  • BaseX connector service: https://lux17.mpi.nl/cmdi/lat/yams-basex-connector/rest
  • Connector to the BaseX database REST interface: http://lux17.mpi.nl:8986/rest/
  • Query example: {TODO}

Some performance testing scripts are available in /crawler/src/test/script/performance, also see Trac ticket #4215.

Status, Planning and Roadmap

Status: no current or planned development at The Language Archive