SILKNOW crawler that collects metadata records describing silk material from various museums.
- Node 8+
You first need to install dependencies, by using npm:
npm install
The crawler takes one paramater: the name of the museum to be crawled. For example:
npm start -- mfa-boston
Available parameters:
| Parameter | Description |
|---|---|
| --no-files | Do not download files such as photos |
| --no-records | Do not write the JSON records |
| --list-fields | Returns a list of unique fields from JSON records. Also takes a --format parameter (values: "md" or "markdown" for Markdown, "json" for JSON, defaults to Markdown) |
| --check-images | Re-download images marked with the hasError flag |
artic- Art Institute of Chicagoceres-mcu- Red Digital de Colecciones de Museos de Españael-tesoro- Museo de Arte Sacro El Tesoro de la Concepcióneuropeana- Europeanagallica- Gallicagarin- Garín 1820imatex- Centre de Documentació i Museu Tèxtiljoconde- Joconde Database of French Museum Collectionsles-arts-decoratifs- Musée des Arts Décoratifsmet-museum- The Metropolitan Museum of Artmfa-boston- Boston Museum of Fine Artsmobilier-international- Collection of the Mobilier national in Francemtmad- Musée des Tissusparis-musees- Paris Muséesrisd-museum- Rhode Island School of Design Museumsmithsonian- Smithsonianunipa- Sicily Cultural Heritagevam- Victoria and Albert Museumvenezia- Musei di Veneziaversailles- Versailles
Crawled JSON structure of each museum can be found here
The UNIPA crawler parses local files only. It requires a database.json along with an images folder. The data has to be stored in data/unipa/resources.
Link to the dataset: https://www.dropbox.com/sh/a8zzv22r59q67eq/AAB4SOAGf1byLFwakYkzbcYFa?dl=0
The Paris Musées API requires to generate a token by following the Paris Musées API documentation.
Once a token has bene obtained, add the environment variable PARIS_MUSEES_TOKEN=<token> (replace <token> with the token) before running the crawler.
MET Museum implements an anti-scrapping strategy which requires to first open this page into a web browser, then open the browser's inspector and type in the console: document.cookie to get the cookies. It should look like this: "incap_ses_XXX_XXXXXXX=abcDEFgHIjkLmNoPQrSTUvWxyZABCDEFGHijklMNOPqrSTUVwXYZAb==".
Finally, add the environment variable MET_MUSEUM_COOKIE="incap_ses_XXX_XXXXXXX=abcDEFgHIjkLmNoPQrSTUvWxyZABCDEFGHijklMNOPqrSTUVwXYZAb==" (replace with your own cookie) before running the crawler.
This cookie is only valid for a limited amount of time, but it should be enough to crawl the entire collection.
The Musée d'Art et d'Industrie (St Etienne) crawler parses local files only. It requires a export silknow.tsv file along with an media folder. The data has to be stored in data/musee-st-etienne/resources.
Link to the dataset: https://drive.google.com/drive/folders/1V-p9cJ-lNtUtGHW1ePv_k4rLsd_xbbyb
Add the environment variable DEBUG=silknow:* to also output the debug logs.