Skip to content

Conversation

@timfrazee
Copy link
Contributor

@timfrazee timfrazee commented Oct 2, 2023

Implements infrastructure for mapper unit smoke tests.

  • test_mapper iterates over all implemented mappers, builds data fixtures for them, and runs to_UCLDC(), expecting no exceptions
  • Test Helpers help prepare tests, prepare records, and generate fixtures. Helpers are organized in the same was as mappers and are dynamically looked up by name/module path.
  • If specific testing is needed for a mapper, an override method can be provided named test_{mapper_name}_mapper in test_mapper.py. The generic mapper test will not run for this mapper.

@timfrazee
Copy link
Contributor Author

@amywieliczka Here are the pre-mapping methods (rikolti__pre_mapping) returned by the registry, by mapper type, for each currently-implemented mapper on main:

{
    "oac.oac": ["/select-oac-id"],
    "flickr.sppl": ["/select-id?prop=id"],
    "flickr.sdasm": ["/select-id?prop=id"],
    "flickr.flickr": ["/select-id?prop=id"],
    "oai.samvera": ["/select-id?prop=id"],
    "oai.omeka": [],
    "oai.csu_dspace": ["/select-id?prop=id"],
    "oai.omeka.csa": [],
    "oai.omeka.nothumb": [],
    "oai.oai": [],
    "oai.quartex": ["/select-id?prop=id"],
    "oai.chapman": ["/select-id?prop=id"],
    "oai.islandora.chs": ["/select-id?prop=id"],
    "oai.islandora.burbank": ["/select-id?prop=id"],
    "oai.content_dm.csudh": ["/select-id?prop=id"],
    "oai.content_dm.pepperdine": ["/select-id?prop=id"],
    "oai.content_dm.blackgold": ["/select-id?prop=id"],
    "oai.content_dm.arck": ["/select-id?prop=id"],
    "oai.content_dm.cvpl": ["select-id?prop=id"],
    "oai.content_dm.lapl": ["/select-id?prop=id"],
    "oai.content_dm.chico": ["/select-id?prop=id"],
    "oai.content_dm.contentdm": [],
    "oai.content_dm.csu_sac": ["/select-id?prop=id"],
    "oai.tv_academy": ["/select-id?prop=id"],
    "oai.csu_dspace.csuci": [],
    "oai.cca_vault": ["/select-id?prop=id"],
    "oai.up": ["/select-id?prop=id"],
    "oai.pspl": [],
    "oai.islandora": ["/select-id?prop=id"],
    "oai.yosemite": ["/select-id?prop=id"],
    "nuxeo.nuxeo": ["/select-id?prop=uid"],
}

At quick glance, it looks like it's pretty consistent - there are some OAI mappers that don't have any pre-mapping, but those that do are all the same.

@timfrazee
Copy link
Contributor Author

@amywieliczka Ok, here is the full dataset for all 55 defined mapper types. Format is:

{
  mapper_type: {
    pre_mapper: [ collection IDs for this mapper_type that include this pre_mapper ]
  },
  ...
}

Nones are ignored, though you'll notice the ucd_json mapper has a blank string.

{
    "oac_dc": {
        "/select-oac-id":[2,7,9,10,14,42,44,45,48,49,51,52,53,54,55,56,57,58,59,60]
    },
    "ucd_json": {
        "":[8,3378,3696,5656,8918,9757,10503,27202,27275,27306,27307,27548,27685,28047,28048,28049,28050,28051,28052,28053]
    },
    "dublin_core": {
        "/select-id?prop=id":[11,13,75,101,123,124,125,126,127,130,131,133,134,135,149,151],
        "/select-id": [32, 156, 9075, 11353],
    },
    "ucldc_nuxeo": {
        "/select-id?prop=uid":[19,20,21,22,27,28,29,30,31,36,38,40,50,65,66,68,69,76,78,79]
    },
    "ucsb_aleph_marc": {"/ucsb-aleph-marc-id": [113]},
    "ucb_tind_marc": {
        "/select-id?prop=id":[150,747,1454,2515,3461,3928,4442,5058,5636,6083,6200,7497,7875,8776,8980,9673,10373,10425,11167,11237]
    },
    "ucsc_oai_dpla": {
        "/select-id?prop=id":[153,154,157,159,4877,5014,5105,5191,12661,16530,18580,23349,27086,27156,27371,27736,27737,27738,27739,27740]
    },
    "ucsd_blacklight_dc": {
        "/select-id?prop=id,/jsonfy-prop":[172,173,174,178,183,184,186,516,1092,1412,2497,3475,4155,5020,5344,5938,8540,8844,9652,10125]
    },
    "csa_omeka": {
        "/select-id?prop=id":[232,3297,5445,8476,14210,17012,18384,19562,20936,24852,26673,27142,27277,27278,27279,27280,27281,27282,27283,27284]
    },
    "ucla_solr_dc": {"/select-id?prop=PID": [354, 17210, 26092, 26093]},
    "quartex_oai": {
        "/select-id?prop=id":[2206,2287,2487,2658,6137,10527,20666,23760,26212,26213,26214,26215,26216,26217,26218,26220,26486,26487,26488,26489]
    },
    "sjsu_islandora": {
        "/select-id?prop=id":[3155,6406,7523,7855,8736,9615,9825,11725,13181,17161,19008,19256,19264,19303,20698,25340,26574,26575,26598,26599]
    },
    "cca_vault_oai_dc": {"/select-id?prop=id": [3433, 3767, 26391, 26470]},
    "chs_islandora": {
        "/select-id?prop=id":[3501,6124,26726,26727,26923,26924,26925,26926,27504,27505,27506,27507,27508,27509,27510,27511,27512,27513,27514,27515]
    },
    "contentdm_oai_dc": {
        "/select-id?prop=id":[4425,5401,7955,8235,9844,13762,14885,15063,15934,16161,16887,16888,16908,17466,18106,18980,19404,19611,23030,24195]
    },
    "preservica_api": {
        "/select-preservica-id":[4789,5861,7157,7562,9160,10207,11143,11367,15686,18469,21804,22816,24251,24415,25302,26462,26464,26670,26671],
        "/select-id?prop=id": [26460],
    },
    "black_gold_oai": {
        "/select-id?prop=id":[4891,26865,26866,26867,26868,26869,26870,26896,26897,27844]
    },
    "omeka_santa_clara": {"/select-id?prop=id": [5138, 27798]},
    "calpoly_oai_dc": {
        "/select-id?prop=id":[9670,9674,10423,14360,16958,18308,19436,20425,26796,26797,26798,26799,26800,26801,26802,26803,26804,26805,26806,26807]
    },
    "csu_sac_oai_dc": {"/select-id?prop=id": [11068, 26723, 27968]},
    "csudh_contentdm_oai_dc": {
        "/select-id?prop=id":[14167,26203,26397,26398,26399,26400,26401,26402,26403,26944,27288]
    },
    "chula_vista_pl_contentdm_oai_dc": {
        "select-id?prop=id": [18598, 26477, 26480, 26481, 26482]
    },
    "lapl_oai": {
        "/select-id?prop=id":[26094,27219,27220,27221,27222,27223,27224,27225,27226,27227]
    },
    "sfpl_marc": {"/sfpl-marc-id": [26095]},
    "lapl_26096": {"/select-id?prop=id": [26096]},
    "ucsf_solr": {"": [26099, 26100, 28029, 28030, 28031, 28032, 28033, 28034]},
    "cavpp_islandora": {
        "/select-id?prop=id":[26198,26225,26226,26227,26228,26229,26230,26231,26232,26233,26235,26236,26237,26239,26240,26241,26242,26243,26244,26245]
    },
    "up_oai_dc": {
        "/select-id?prop=id":[26224,26309,26310,26311,26312,26313,26314,26315,26316,26317,26318,26319,26320,26321,26323,26324,26325,26326,26849,26850]
    },
    "chapman_oai_dc": {
        "/select-id?prop=id":[26284,26285,26286,26287,26288,26289,26290,26291,26292,26293,26294,26295,26296,26297,26328,26329,26330,26331,26332,26333]
    },
    "califa_oai_dc": {"/select-id?prop=id": [26483]},
    "csl_marc": {"/csl-marc-id": [26559]},
    "contentdm_oai_dc_get_sound_thumbs": {"/select-id?prop=id": [26684]},
    "pspl_oai_dc": {
        "/select-id?prop=id": [26707, 26715, 26716, 26717, 26718, 26719, 26720, 26721]
    },
    "omeka": {
        "/select-id?prop=id":[26724,26725,27121,27122,27140,27144,27145,27147,27149,27150,27165,27433,27618,27619,27620,27621,27622,27623,27624,27625]
    },
    "chico_oai_dc": {
        "/select-id?prop=id": [26762, 26763, 26764, 26765, 26766, 26768, 26769]
    },
    "ucb_bampfa_solr": {"/select-id?prop=id,/jsonfy-prop": [26772]},
    "islandora_oai_dc": {
        "/select-id?prop=id":[26773,26774,26775,26776,26777,26778,26779,26885,26951]
    },
    "flickr_api": {"/select-id?prop=id": [26857, 27125, 27316]},
    "youtube_video_snippet": {
        "/select-id?prop=id":[26858,27015,27022,27023,27024,27029,27030,27081,27162,27163,27274,27296,27360,27419,27420,27421,27422,27423]
    },
    "csuci_mets": {"/select-id?prop=id": [26928]},
    "pastperfect_xml": {
        "/select-id?prop=metadata/identifier": [26935, 27166, 27429, 27434, 27566]
    },
    "caltech_restrict": {"/select-id?prop=id": [26952, 27118]},
    "usc_oai_dc": {
        "/select-id?prop=id":[26988,27087,27089,27090,27091,27092,27093,27094,27095,27096,27097,27098,27099,27100,27101,27102,27103,27104,27105,27106]
    },
    "yosemite_oai_dc": {"/select-id?prop=id": [26989, 27016, 27631, 27632, 27633]},
    "emuseum_xml": {"/select-id?prop=metadata/identifier": [27011]},
    "csu_dspace_mets": {"/select-id?prop=id": [27085]},
    "sierramadre_marc": {"/csl-marc-id": [27127]},
    "burbank_islandora": {
        "/select-id?prop=id":[27128,27129,27130,27131,27132,27133,27134,27135,27136,27137,27138]
    },
    "omeka_nothumb": {"/select-id?prop=id": [27143, 27146, 27148]},
    "sanjose_pastperfect": {"/select-id?prop=metadata/identifier": [27289]},
    "tv_academy_oai_dc": {"/select-id?prop=id": [27305]},
    "flickr_sdasm": {"/select-id?prop=id": [27324, 27424, 27425, 27426, 27427]},
    "flickr_sppl": {
        "/select-id?prop=id":[27338,27362,27363,27364,27365,27366,27367,27373,27374,27375,27376,27377,27378,27379,27380,27381,27382,27383,27384,27386]
    },
    "internet_archive": {"/select-id?prop=identifier": [27369, 27616]},
    "arck_oai": {
        "/select-id?prop=id":[27564,27582,27583,27584,27585,27586,27587,27588,27589,27590,27591]
    },
}

@timfrazee timfrazee force-pushed the mapper-unit-tests branch 2 times, most recently from 12aa322 to 31ef66b Compare November 13, 2023 18:54
@timfrazee timfrazee marked this pull request as ready for review November 20, 2023 18:54
@amywieliczka
Copy link
Collaborator

amywieliczka commented Nov 30, 2023

@timfrazee wow, this is really solid work. @barbarahui and I sat down today to take a good look at things and have these questions:

  1. The merge paradigm of schema definitions was not immediately apparent to us: the current implementation looks like it describes two levels of specification: a default schema and a schema, and the schema is merged into the default schema (thus overwriting any schema specs set in the default schema), but there's a full hierarchy of mappers. Do you intend to build this out to merge arbitrarily deep down a hierarchy of helpers? For example: the csudh schema gets merged into contentdm schema gets merged into oai schema gets merged into base default schema? That seems the only way, but that also seems super complicated! Would this happen in the @classmethod for_mapper? At the same time, a csudh schema that gets merged directly into the base default schema (or even a csudh schema that gets merged directly into the oai schema) also seems untenable.

  2. It seems you're trying to cache the results of the generated id and identifier fields in a self.static array on the fixture generator base class under the assumption that the generated result of these values will be used elsewhere within the fixture, and that mapping will raise an exception if the value is different. Which mappers have this kind of requirement?

  3. It seems you're attempting to add capacity to define multiple schemas with the schema_index parameter passed to generate_fixture. I do think it could be valuable to define multiple alternative schemas (for cases when a mapper has to do one thing if, for example, description is a string but another thing if description is an array), but I don't actually see this implemented, and before I do, it would probably be good to check that this is actually a requirement of some mappers. It also seems quite complex to implement: an entire matrix of field variants intersecting with the presumed requirement for cached static fields (which may themselves have variants) and the schema-merge class inheritance model.

Overall, I'm wondering if we can find some way to concretize this. Flattening the mapper class inheritance model would at least reduce one axis in that matrix and is definitely the most appealing to me. We already know that there is no namespace collision in our leaf mappers. Incidentally, I'm honestly wondering if it doesn't make sense to flatten the mapper inheritance tree as well (though that will likely come after launch sometime). Each time we make updates to base mappers (such as the base oai mapper), we have to re-run validation reports to every single collection that uses that base mapper because they're all implicated in the change. As this code base evolves, it will have to be a best practice to only update leaf mappers in order to reduce implicating all the other data in the system, unless we're ready to implicate (and validate) everything.

I know you've done a bit more analysis of the OAI mappers specifically since you last worked on this, so I'd be curious to hear your thoughts regarding the requirements for our fixture generators grounded in some analysis. Probably worth taking a look at (or talking to Lucas about) another tree of mappers as well.

All in all, this use of faker to populate schema definitions of the incoming vernacular metadata seems like a great approach.

@timfrazee timfrazee force-pushed the mapper-unit-tests branch 2 times, most recently from 706e98b to f82fd09 Compare December 4, 2023 18:54
@christinklez christinklez linked an issue Dec 4, 2023 that may be closed by this pull request
@christinklez christinklez added this to the #4 CIC work milestone Dec 11, 2023
@timfrazee
Copy link
Contributor Author

timfrazee commented Dec 11, 2023

@amywieliczka @barbarahui I've implemented a few improvements and, currently, a functional ContentDM mapper test.

  • Ability to filter mappers to test
    • You can pass a CLI arg to pytest to filter mappers that will be tested.
pytest metadata_mapper/test --mapper=contentdm
pytest metadata_mapper/test --mapper=contentdm_mapper # same as above - `_mapper` is optional

pytest metadata_mapper/test --mappers=contentdm,cca_vault # `mappers` arg can take mutiple comma-delimited mapper names
  • Implemented requests-mock to handle HTTP requets
    • If a mapper makes a request, it must be mocked in the associated helper's setup_mocks method. Otherwise HTTP requests are blocked and will raise an exception.
def setup_mocks(self):
  url = "https://example.com"
  response = { "success": True }
  self.request_mock.register_uri("GET", url, json=response)

See ContentdmTestHelper for an example.

  • Finished ContentdmTestHelper, providing the necessary fixture changes for the mapper to initialize and map successfully.

Next up, I'm working on the ScudhTestHelper and CcaVaultTestHelper, as well as addressing the deficiencies of the existing SCHEMA/DEFAULT_SCHEMA test helper pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fixture generator for: content_dm, csudh, cca_vault

4 participants