Skip to content

Conversation

@jbutcher21
Copy link
Contributor

Summary

  • Add tax IDs and LEI numbers to 6 organization records in reference.jsonl and reference.csv
  • Makes corporate registry demo data more realistic with proper identifiers

Changes

Record Company Identifiers
2012 Hajah Maimunah UEN: 201012345A (Singapore)
2041 Müllenkranz VAT: DE123456789 (Germany)
2071 Universal Exports, USA EIN: 84-1234567
2074 Universal Exports Worldwide EIN: 13-7654321, LEI: 5493001KJTIIGC8Y1R12
2141 Universal Exports Singapore UEN: 200912346B, LEI: 549300HKVFVS8QGCBR57
2151 Universal Exports India GSTIN: 32AABCU9603R1ZM

Test plan

  • Loaded into Senzing and verified identifiers are recognized
  • Confirmed entity resolution still works correctly

Closes #51

Add tax IDs and LEI numbers to 6 organization records for more
realistic corporate registry data:
- Hajah Maimunah: UEN 201012345A (Singapore)
- Müllenkranz: VAT DE123456789 (Germany)
- Universal Exports USA: EIN 84-1234567
- Universal Exports Worldwide: EIN 13-7654321, LEI 5493001KJTIIGC8Y1R12
- Universal Exports Singapore: UEN 200912346B, LEI 549300HKVFVS8QGCBR57
- Universal Exports India: GSTIN 32AABCU9603R1ZM

Closes #51
@jbutcher21 jbutcher21 requested a review from a team as a code owner December 22, 2025 20:18
@github-actions
Copy link

🤖 Claude Code Review

Code Review Analysis

Code Quality

✅ Code follows style guide

The changes add new columns to CSV and JSONL reference data files. Since these are data files (not code), the style guide for code doesn't directly apply. The structure and formatting are consistent with the existing file format.

✅ No commented-out code

No commented-out code present in the diff.

✅ Meaningful variable names

The new column names are descriptive and follow established naming conventions:

  • NATIONAL_ID_NUMBER, NATIONAL_ID_TYPE, NATIONAL_ID_COUNTRY
  • TAX_ID_NUMBER, TAX_ID_TYPE, TAX_ID_COUNTRY
  • LEI_NUMBER

✅ DRY principle followed

Data files - not applicable. Structure is consistent across records.

✅ Identify Defects

No critical defects found. The data additions are consistent and follow logical patterns:

  • Organization records receive appropriate identifiers (UEN, VAT, EIN, GSTIN, LEI)
  • Person records correctly have empty identifier fields
  • Identifier types match the countries (UEN for SG, EIN for US, VAT for DE, GSTIN for IN)
  • LEI numbers are only added to relevant international corporate entities (records 2074, 2141)

❌ Project memory configuration check

File: ./.claude/CLAUDE.md

Could not verify as I need to check if this file exists in the repository.

Action needed: Verify that ./.claude/CLAUDE.md (if present) doesn't contain local-specific configurations.

Testing

❌ Unit tests for new functions

Not applicable - These are data files, not code. However:

  • Missing: Tests to validate the new data schema (CSV header validation, JSONL schema validation)
  • Recommendation: Add validation tests to ensure:
    • All records have the correct number of columns (33 columns in CSV)
    • Organization records have valid identifier formats
    • Person records appropriately have empty identifier fields

❌ Integration tests for new endpoints

Not applicable - no new endpoints.

❌ Edge cases covered

Missing: No validation for identifier format correctness:

  • LEI numbers should be 20 characters (record 2074 has "5493001KJTIIGC8Y1R12" - 20 chars ✅, record 2141 has "549300HKVFVS8QGCBR57" - 20 chars ✅)
  • EIN format validation (XX-XXXXXXX pattern appears correct)
  • UEN format validation for Singapore
  • GSTIN format validation for India (15 characters)

❌ Test coverage > 80%

Cannot measure - data files. No test files updated or added.

Documentation

❓ Readme updated if needed

Needs verification: Check if README.md or data dictionary documents the schema. The new columns should be documented:

Files to check:

  • README.md or truthsets/demo/README.md
  • Any schema documentation files

Action needed: Ensure documentation describes:

  • Purpose of new identifier columns
  • Valid values for *_ID_TYPE fields
  • Expected formats for identifier numbers

✅ API docs updated

Not applicable - no API changes.

✅ Inline comments for complex logic

Not applicable - data files.

❓ CHANGELOG.md updated

File: CHANGELOG.md (root or truthsets/demo/)

Action needed: Verify that CHANGELOG documents this schema enhancement:

### Added
- Organization identifier fields: NATIONAL_ID_NUMBER, NATIONAL_ID_TYPE, NATIONAL_ID_COUNTRY
- Tax identifier fields: TAX_ID_NUMBER, TAX_ID_TYPE, TAX_ID_COUNTRY  
- Legal Entity Identifier field: LEI_NUMBER

✅ Markdown files follow CommonMark

Not applicable to this diff - no markdown files modified.

Security

✅ No hardcoded credentials

No credentials detected.

⚠️ Input validation implemented

Observation: While these are reference data files, systems consuming this data should validate:

  • Identifier formats match expected patterns for each type
  • Country codes are valid ISO codes
  • Identifier types are from a controlled vocabulary

Recommendation: Document expected validation rules for downstream systems.

✅ Proper error handling

Not applicable - data files.

✅ No sensitive data in logs

Not applicable.

✅ No license files (.lic) or AQAAAD strings

No license files or suspicious encoded strings detected.


Summary

Critical Issues: 0

High Priority:

  1. Documentation: Verify CHANGELOG.md and README.md document the new schema fields
  2. Testing: Add schema validation tests for the new columns

Medium Priority:

  1. Consider adding format validation rules for identifiers (LEI, EIN, UEN, GSTIN, VAT)
  2. Verify .claude/CLAUDE.md doesn't contain local-specific configuration

Low Priority:

None

Overall Assessment: ✅ APPROVED with recommendations

The changes are clean, consistent, and add valuable organization identifier data to the reference dataset. The new fields follow logical naming conventions and are populated appropriately based on record type and jurisdiction. Address the documentation and testing recommendations before merging.

Automated code review analyzing defects and coding standards

@jbutcher21 jbutcher21 requested a review from antaenc December 23, 2025 05:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add organization identifiers to demo reference data

3 participants