-
Notifications
You must be signed in to change notification settings - Fork 1
Home
YAMS is an CMDI/IMDI archive search tool based on Arbil datanodes developed at The Language Archive. There are two parts, the crawler which is run from the command line and the web application which is run in tomcat. The crawling process is done directly to a local BaseX database. When the crawl is complete the resulting database is moved into the running BaseX instance. The web application can select from the various databases available and only requires read-only access to the databases.
- Peter Withers (original developer)
- Twan Goosen
YAMS depends (via Maven) on the Plugins framework and Arbil metadata reading and writing.
It also depends on a running BaseX XML database instance, and a connection to a 'Corpus Structure 2' provider for corpus browsing.
See the Design section for an overview.
The crawler needs to be run via the command line. It has the following command line options:
-a,--append Restart crawling adding missing documents.
-ams,--amspermissions <arg> REST service URL where permissions
information from AMS can be obtained
(default:
https://lux16.mpi.nl/ds/yams-cs-connector/r
est/node?id=).
-c,--crawl Crawl the provided url or the default url
if not otherwise specified.
-d,--drop Drop the existing data and recrawl. This
option implies the c option.
-db,--dbname <arg> Name of the database to use (default:
YAMS-DB).
-f,--facets Preload the facets from the existing
crawled data.
-l,--limit <arg> Limit crawling to URLs which contain the
provided string (default:
http://lux16.mpi.nl/).
-n,--number <arg> Number of documents to insert (default:
90).
-p,--password <arg> Data base password, (default: admin).
-s,--server <arg> Data base server URL or file path (when a
file path is provided it is used as the
local basex directory via the java bindings
rather than the REST interface), default is
to use the un mondified local basex
directory
-t,--target <arg> Target URL of the start documents to crawl
(default:
http://hdl.handle.net/11142/00-74BB450B-4E5
E-4EC7-B043-F444C62DB5C0). This option
implies the c option.
-u,--user <arg> Data base user name, (default: admin).
-x,--debug Display debug output
In general, you will want to use the -a option to make the crawler recursively look up all links found in the root node. Use -n to limit the number of nodes to crawl in one session. The next time the crawler runs, it will continue processing links encountered in previous crawler sessions.
Dependencies between YAMS and common external components

Thick lines = public interface; blue line = GUI
Dependencies between YAMS modules

Thick lines = public interface; black line = REST service; blue line = GUI; red line = server side application
Runtime connections between YAMS components:

To start developing in the trunk, check out/clone the following projects and do a mvn clean install on each of the:
-
Create a
basexdirectory in your home directory -
Run/debug the 'YAMS crawler' project from Netbeans (some default options are defined in nbactions.xml)
-
OR run the YAMS 'jar-with-dependencies.jar' file generated in crawler/target with the following options:
-Xmx2048m -s ~/basex/ -c -a -n 1000 -f
-
OR run the YAMS 'jar-with-dependencies.jar' file generated in crawler/target with the following options:
-
This will start crawling from the default start URL (currently hdl:11142/00-74BB450B-4E5E-4EC7-B043-F444C62DB5C0), follow links (
-a), process a maximum of 1000 files (-n) and create statistics (-f). This will generate some output to the stdout, and will also create ayams-crawler.logfile in the working directory with some more information. -
Add the
-xoption to enable debugging output
Assuming such a service is up and running at tlatest06 on port 8984:
-
Modify the parameters in context.xml as follows (globally or in the META-INF directory of the yams-basex-connector project before deployment):
\<Parameter name="basexRestUrl" override="false" value="http://tlatest06:8984"/\> \<Parameter name="basexUser" override="false" value="admin"/\> \<Parameter name="basexPass" override="false" value="admin"/\> -
Build and run the yams-basex-connector project in Tomcat
-
Browse to the 'rest' servlet within the deployed application, e.g. by going to
http://localhost:8080/yams-basex-connector/rest/this should show an HTML listing of the available databases and links to info (as JSON) for each database
Assuming a database already exists on the filesystem (i.e. the crawler has been executed) in location /Users/me/basex:
-
Start a local BaseX REST server by running the following command in the 'bin' directory of BaseX:
./basexhttp -X -Dorg.basex.path=/Users/me/basex -l -
A REST service should now be running locally on port 8984, test this by browsing to
http://localhost:8984/rest/ -
Modify the parameters in context.xml as follows (globally or in the META-INF directory of the yams-basex-connector project before deployment):
-
Build and run the yams-basex-connector project in Tomcat
-
Browse to the 'rest' servlet within the deployed application, e.g. by going to
http://localhost:8080/yams-basex-connector/rest/. This should show an HTML listing of the available databases and links to info (as JSON) for each database
TODO
Assuming that a BaseX connector and a CS2 connector are already running locally or remotely
-
Locate the
/web/src/main/resources/nl/mpi/yams/client/ServiceLocations.propertiesfile and open it for editing -
Make sure the following properties link to existing and correctly functioning locations (they are relative to the deployment path of the GWT application):
nl.mpi.yams.jsonCsAdaptorUrl=../yams-cs-connector/rest nl.mpi.yams.jsonBasexAdaptorUrl=../yams-basex-connector/rest -
You can also configure remote locations, e.g.:
nl.mpi.yams.jsonCsAdaptorUrl=https://lux16.mpi.nl/ds/yams-cs-connector/rest nl.mpi.yams.jsonBasexAdaptorUrl=https://lux16.mpi.nl/ds/yams-basex-connector/rest -
Build YAMS GWT project
-
IMPORTANT: To enable development mode, start the application with the custom goal
mvn gwt:run -Pdev(defined as a custom action in the Netbeans project); -
To enable debugging (i.e. use breakpoints), use
mvn gwt:debug -Pdevand attach the debugger to port 8000 -
This will start the 'GWT Development Mode' application. Launch the browser from within this application and use the 'dev mode' button on the first page to run the client side application in the browser and get debugging output in the console on the server
The interface deployed on lux17 is used as an example here:
- BaseX connector service:
https://lux17.mpi.nl/cmdi/lat/yams-basex-connector/rest - Connector to the BaseX database REST interface:
http://lux17.mpi.nl:8986/rest/ - Query example: {TODO}
Some performance testing scripts are available in /crawler/src/test/script/performance, also see Trac ticket #4215.
Status: no current or planned development at The Language Archive