This repository contains the XML Schema Definition (XSD) for validating Kentucky Digital Newspaper Program (KDNP) newspaper issue metadata. It enforces strict formats and controlled vocabularies for various fields such as dates, identifiers, and geographic data relevant to Kentucky newspaper collections.
-
Date Fields:
IssueDatemust followYYYY-MM-DDformat.Yearis a 4-digit number.Month,Day, andEditionare two-digit numbers.
-
Identifiers:
KDNPTitleCode: exactly 3 lowercase letters (aโz).KDNPControlIdentifier: 13 characters; first 3 are lowercase letters, followed by 10 digits.
-
Geographic Fields:
Regionmust be one of six predefined Kentucky regions.Countymust be a valid Kentucky county name (120 counties enumerated).
Validate KDNP XML files describing newspaper issues against this schema to ensure data consistency and integrity.
- Developed with XML Schema 1.0 standards.
- Custom simple types and enumerations enforce format and controlled vocabularies.
- Extensible for additional fields as needed.
This Python script validates multiple XML files in a directory against a specified XSD (XML Schema Definition) file using the lxml library. It logs any validation or parsing errors into a log file.
- XSD Schema File
- XSD_PATH
- Path to the .xsd file used to validate XML files.
- XSD_PATH
- XML Files Directory
- XML_DIR
- Directory containing XML files to be validated.
- XML_DIR
- Validation Log File
- LOG_FILE
- Path where validation errors (if any) are written.
- LOG_FILE
Description:
Validates each .xml file found in the provided directory (and subdirectories) against the provided XSD schema.
Steps:
-
Load XSD schema using lxml.etree.XMLSchema.
-
Traverse the XML directory using os.walk.
-
Parse and validate each .xml file:
-
If validation fails, the schema error log is captured.
-
If parsing fails, the exception message is recorded.
-
-
Write results to log file:
-
If errors are found, they are written to the specified log file.
-
If no errors are found, a success message is logged.
-
Example Output in Log File:
- XML Validation Errors:
File: C:\path\to\file.xml Line 15: Element 'title': This element is not expected. Expected is ( name ).
File: C:\path\to\another_file.xml Parsing error: Premature end of data in tag book line 3
If all XML files pass validation, the log will contain:
- All XML files validated successfully.
To run the script directly:
python kdanp_meta_validate_xsd.py
Make sure this block is at the bottom of the file:
if name == "main": validate_xml_files(XSD_PATH, XML_DIR, LOG_FILE)
- Python 3.x
- lxml package (Install with: pip install lxml)
- File paths are Windows-specific (using raw strings r"...").
- Ensure the XSD file is well-formed, or schema loading will fail.
- XML parsing errors (e.g., malformed XML) are caught and logged.