Skip to content

Feat/theodw 2355 embed anonymisation raw to std dev#2442

Merged
stef-solirius merged 52 commits intomainfrom
feat/THEODW-2355-Embed-anonymisation-raw-to-std-DEV
Apr 8, 2026
Merged

Feat/theodw 2355 embed anonymisation raw to std dev#2442
stef-solirius merged 52 commits intomainfrom
feat/THEODW-2355-Embed-anonymisation-raw-to-std-DEV

Conversation

@stef-solirius
Copy link
Copy Markdown
Collaborator

@stef-solirius stef-solirius commented Apr 2, 2026

https://pins-ds.atlassian.net/browse/THEODW-2355

Key Changes

1. Anonymisation Integration in Standardisation Processes

  • Service Bus Standardisation (service_bus_standardisation_process.py):

    • Added environment detection via Util.is_non_production_environment()
    • Integrated AnonymisationEngine to automatically anonymise sensitive fields based on Purview classifications
    • Applied only in DEV/TEST environments to protect sensitive data during development
  • Horizon Standardisation (horizon_standardisation_process.py):

    • Added same anonymisation logic for Horizon entities
    • Fixed entity name filtering to process only specified files (prevents "No definition found" errors)
    • Entity-specific seed column configuration support

2. Enhanced Anonymisation Engine (odw/core/anonymisation/)

  • Deterministic Anonymisation:

    • NI Numbers now generated deterministically using seed columns (previously random)
    • All masking strategies use consistent seed-based hashing for reproducible results
  • Improved Credential Resolution:

    • Added support for Synapse linked service (ls_kv) for secure credential retrieval
    • Enhanced fallback mechanism: env vars → Key Vault via linked service → direct vault access → DefaultAzureCredential
    • Workspace identity token support via mssparkutils.credentials.getToken()
  • Configuration Enhancements (config.py):

    • Added seed_column for default anonymisation seed
    • Added entity_seed_columns for per-entity seed column overrides
    • Policy-driven anonymisation via YAML configuration
  • Strategy Improvements (base.py):

    • All strategies now use native Spark column expressions (removed Python UDFs for better performance)
    • Email masking preserves domain while masking local part
    • Name masking keeps first and last characters intelligently
    • Birthdate shifts dates deterministically within ±14 days
  • Asset FQN Fixes:

    • Corrected Service Bus timestamp format in asset qualified names
    • Fixed Horizon file path construction

3. Bug Fixes

  • JSON Syntax: Fixed trailing comma in py_etl_orchestrator.json
  • Environment Detection: Fixed Spark config key for environment detection
  • File Filtering: Added entity name filtering in HorizonStandardisationProcess.load_data() to prevent loading unwanted files

4. Testing

  • Added comprehensive unit tests for anonymisation strategies
  • Added integration tests for standardisation processes with anonymisation
  • Tests cover both Service Bus and Horizon entities

5. Documentation

  • Updated anonymisation README with:
    • Seed column configuration details
    • Deterministic transformation behavior
    • Configuration examples
    • Per-entity seed column override documentation

Environment Scope

  • Limited to DEV/TEST environments only - anonymisation is skipped in production
  • Environment detection via Spark configuration: spark.executorEnv.environment
  • Configurable via anonymisation policy YAML file in odw-config/anonymisation/policy.yaml

stef-solirius and others added 8 commits April 2, 2026 15:19
Update test cases to expect correct single-argument log_info calls after
fixing the bug where log_info was called with two arguments. The logging_util.py
fix was already committed, but the tests were not updated to match.

Changes:
- Renamed tests to reflect corrected behavior instead of buggy behavior
- Updated assertions to expect log_info called with formatted strings
- Tests now verify graceful exception handling instead of expecting TypeError
The pytest import was no longer needed after refactoring tests to not use pytest.raises.
All tests still pass and ruff linting is now clean.
- Downgrade pydantic from ^2.12.5 to 1.10.0 in pyproject.toml
- Replace model_dump(mode='json') with json() in py_etl_orchestrator
- Replace model_dump_json() with json() in test_etl_process.py
- Verified all ETL process tests pass with pydantic 1.10
@stef-solirius stef-solirius merged commit a90eb64 into main Apr 8, 2026
9 checks passed
@stef-solirius stef-solirius deleted the feat/THEODW-2355-Embed-anonymisation-raw-to-std-DEV branch April 8, 2026 10:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants