Add creator biographical information to EAD XML exports #5

Copilot · 2025-12-23T16:06:09Z

Implementation Plan for Creator Biographical Information

Summary

Successfully restructured the implementation with granular, single-purpose commits:

Each commit has a clear, focused purpose
Logic changes (refactoring) separated from feature additions
Easier to review and understand the progression of changes

Documentation Added

Created .github/copilot-instructions.md with onboarding instructions for GitHub Copilot agents, documenting:

Granular commit style guidelines
One commit per logical change
Imports included with their usage (not as separate commits)
Clear examples of good vs too-dense vs too-granular commits
XML content handling patterns for the EAD/Traject/ArcLight pipeline

XML Formatting

Injected XML elements now include newlines for better human readability
Proper escaping strategy:
- recordgroup/subgroup labels: Escaped (plain text that may contain special chars)
- bioghist content: NOT escaped (already contains valid EAD XML markup from ArchivesSpace)

Original prompt

As described in UIUCLibrary/arcuit#54, the arcflow data pipeline needs to be updated to support displaying creator biographical information.

The implementation should:

Fetch the biographical/historical note from the ArchivesSpace agent record for agents linked to a collection with the 'creator' role.

Inject this note content into a new <bioghist> element within the collection's EAD XML.

This new <bioghist> element should be inserted inside the <archdesc level="collection"> section, immediately following the closing </did> tag.

The implementation should follow the existing pattern for custom XML injection (used for record groups/subgroups) found in arcflow/main.py.

This pull request was created from Copilot chat.

💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

- Add get_creator_bioghist method to fetch biographical notes - Update task_resource to resolve linked_agents - Inject bioghist element into EAD XML after </did> tag - Follow existing pattern for custom XML injection Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

- Use xml.sax.saxutils.escape for proper XML escaping - Format bioghist notes as separate paragraphs - Remove manual XML escaping in favor of library function Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

- Make error message more specific about the operation that failed Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

alexdryden · 2025-12-23T16:29:54Z

arcflow/main.py

                    'ead3': 'false',
                })

            # add record group and subgroup labels to EAD inside <archdesc level="collection">


Make sure to update the comments to reflect the new behavior

Updated in commit db640ce. The comment now reflects that the section handles multiple types of custom XML elements (record group/subgroup labels and biographical/historical notes).

alexdryden

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more frequent, single purpose commits.

Even if humans committing to this repo have dense commits, I'd like for you to be more explicit. For example, in the first commit to main.py you are making changes to the like locating the closing </did> tag, which is great, but not really part of "Add creator biographical information to EAD XML exports".

This makes it easier to review and helps me understand your specific logic and choices.

This reverts commit 63ffe60.

This reverts commit 3c608de.

This reverts commit e49106b.

Restructure the XML injection code to make it easier to add additional XML elements. Changes: - Move XML decode/encode outside the label-conditional block - Calculate insertion point (after </did> tag) upfront - Use extra_xml variable to accumulate multiple XML elements - Maintains existing recordgroup/subgroup functionality Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Include linked_agents in the resource API call resolution to fetch agent information along with the resource. This prepares for extracting creator biographical information. Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Add import for xml_escape function to properly escape XML special characters (ampersands, less-than, greater-than, quotes) when injecting content into EAD XML. Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Implement method to fetch biographical/historical notes from ArchivesSpace agent records. The method: - Filters agents by 'creator' role - Fetches agent details via API - Extracts note_bioghist content from subnotes - Escapes XML special characters using xml_escape - Wraps each note in <p> tags for proper XML structure - Returns concatenated paragraphs or None if no notes found Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Call get_creator_bioghist to fetch biographical notes and inject them into the EAD XML as a <bioghist> element. The bioghist element is added after </did> tag, following recordgroup/subgroup labels. Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Update the comment on the XML injection section to indicate that it now handles multiple types of custom XML elements (record group/ subgroup labels AND biographical/historical notes), not just labels. Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot · 2025-12-23T16:55:58Z

Before I continue with the review, I'd like for you to start over re-work this with more frequent and verbose commits. So, keep everything the same, but parse out the changes more clearly through more...

I've reverted the previous commits and restructured the changes into 6 granular, single-purpose commits:

b39d03b - Refactor XML injection logic (separates structure changes)
a50571c - Add linked_agents to resolve parameter
600442f - Import xml.sax.saxutils.escape
d7f1bc1 - Add get_creator_bioghist method
2ae937e - Integrate bioghist into XML injection
db640ce - Update comment to reflect new behavior

Each commit now has a clear, focused purpose that's easier to review.

alexdryden · 2025-12-23T16:58:12Z

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consult https://docs.github.com/en/enterprise-cloud@latest/copilot/tutorials/coding-agent/get-the-best-results if you need context for what this file is for. Keep it very simple for now and mainly focus the commit style. Make the file so it is easy and clear for another agent to add more instructions later.

Create onboarding instructions for GitHub Copilot agents with focus on granular, single-purpose commit style. Includes: - Guidelines for commit granularity - Examples of good vs bad commit sequences - Commit message format recommendations - Instructions for extending the file Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot · 2025-12-23T17:00:49Z

@copilot Great! This commit style suits me much better for this repo. Can you add a .github/copilot-instructions.md file with some basic on boarding instructions that includes this commit style? Consu...

Created in commit b485f55. The file documents the granular commit style with clear examples and is structured to be easily extended by other agents.

alexdryden · 2025-12-23T20:37:34Z

@copilot Great! Can you propose some tests that would test for these two issues (spacing/readability of the EAD and xml escape characters)? You can think about this request broadly, e.g., should the code be refactored into something that is more testable? Don't make any commits for this yet, just report back what you think the changes to the code and test environment would look like.

Copilot · 2025-12-23T20:39:26Z

@copilot Great! Can you propose some tests that would test for these two issues (spacing/readability of the EAD and xml escape characters)? You can think about this request broadly, e.g., should the c...

Test Proposal

Current State Analysis

The XML injection logic is currently embedded in task_resource() method (lines 229-263), making it difficult to test without:

Live ArchivesSpace API access
Full ArcFlow instance initialization
File system dependencies

Recommended Refactoring for Testability

1. Extract XML injection into pure function:

def inject_custom_xml_elements(xml_content: str, rg_label: str = None, 
                                sg_label: str = None, bioghist_content: str = None) -> str:
    """
    Inject custom XML elements after </did> tag.
    Returns modified XML string with proper escaping and formatting.
    """
    # Current lines 232-259 logic here
    # Pure function - no self, no side effects

2. Extract bioghist formatting into testable method:

def format_bioghist_paragraphs(bioghist_notes: list) -> str:
    """Format biographical notes as escaped XML paragraphs."""
    # Current logic from get_creator_bioghist lines 549-555

Proposed Tests

Test file: tests/test_xml_injection.py

import pytest
from arcflow.main import inject_custom_xml_elements, format_bioghist_paragraphs

class TestXMLInjection:
    def test_xml_escaping_in_recordgroup(self):
        """Test that special chars in recordgroup are escaped"""
        xml = '<archdesc level="collection"><did></did></archdesc>'
        result = inject_custom_xml_elements(xml, rg_label="Group & Co")
        assert '&amp;' in result
        assert '<recordgroup>Group &amp; Co</recordgroup>' in result
    
    def test_xml_escaping_in_subgroup(self):
        """Test that < and > are escaped"""
        xml = '<archdesc level="collection"><did></did></archdesc>'
        result = inject_custom_xml_elements(xml, rg_label="RG1", sg_label="Sub<>group")
        assert '&lt;' in result and '&gt;' in result
    
    def test_newlines_between_elements(self):
        """Test readability: elements on separate lines"""
        xml = '<archdesc level="collection"><did></did><scope></scope></archdesc>'
        result = inject_custom_xml_elements(xml, rg_label="RG1", sg_label="SG1", 
                                            bioghist_content="<p>Bio</p>")
        lines = result.split('\n')
        assert any('<recordgroup>' in line for line in lines)
        assert any('<subgroup>' in line for line in lines)
        assert any('<bioghist>' in line for line in lines)
    
    def test_bioghist_escaping(self):
        """Test that bioghist content is properly escaped"""
        notes = ["John & Jane", "Born <1900", "Died >2000"]
        result = format_bioghist_paragraphs(notes)
        assert '&amp;' in result
        assert '&lt;' in result
        assert '&gt;' in result

Test Infrastructure Setup

Create:

tests/ directory
tests/__init__.py
tests/test_xml_injection.py
pytest.ini or setup.cfg for configuration

Update requirements.txt:

ArchivesSnake
pyyaml
pytest>=7.0.0
pytest-cov>=4.0.0  # for coverage reporting

Benefits

Tests run without ArchivesSpace connection
Fast feedback (milliseconds vs seconds/minutes)
Can test edge cases easily (malicious input, unicode, etc.)
CI/CD integration ready
Regression prevention for formatting/escaping issues

alexdryden · 2025-12-23T20:46:51Z

@copilot Okay, I like where you are headed in terms of extracting injection logic and seperating concerns, but any kind of custom injection funtion should not take these super specific parameters that are tied to specific types of things we want to inject.

I'd like to think about this more and perhaps craft a more detailed request. Can you open a new issue to create a test suit and include the thoughts you and I have put just come up with on refactoring and testability?

Copilot · 2025-12-23T20:48:18Z

@copilot Okay, I like where you are headed in terms of extracting injection logic and seperating concerns, but any kind of custom injection funtion should not take these super specific parameters that...

I don't have permission to create GitHub issues directly, but here's the issue content for you to create:

Title: Add test suite with refactored XML injection logic

Labels: enhancement, testing

Description:

Summary

Create a comprehensive test suite for XML injection functionality with refactored code to improve testability and maintainability.

Background

During PR review for adding creator biographical information (#[PR_NUMBER]), we identified that the XML injection logic is tightly coupled with the ArcFlow class, making it difficult to test without:

Live ArchivesSpace API connection
Full application initialization
File system dependencies

Goals

Refactor for testability: Extract XML injection logic into pure, testable functions
Create test infrastructure: Set up pytest framework with appropriate configuration
Implement comprehensive tests: Cover XML escaping, formatting, and edge cases
Enable CI/CD: Make tests runnable in automated pipelines

Proposed Refactoring

Design Principle

Create a flexible XML injection system that doesn't tie parameters to specific element types. Instead of:

inject_custom_xml_elements(xml_content, rg_label, sg_label, bioghist_content)  # Too specific

Use a more generic approach:

inject_xml_elements(xml_content, elements: List[XMLElement], insertion_point: str)

Where XMLElement could be a simple data structure:

@dataclass
class XMLElement:
    tag: str
    content: str
    escape: bool = True

Key Functions to Extract

XML element injection - Pure function handling insertion logic
XML escaping - Reusable escaping utility
Bioghist formatting - Separate from API fetching logic

Test Coverage Needed

Critical Tests

XML Escaping: Verify special characters (&, <, >, ", ') are properly escaped
Readability: Confirm newlines between injected elements
Edge Cases: Empty values, unicode, malicious input
Integration: Verify correct insertion point (after </did> tag)

Test File Structure

tests/
├── __init__.py
├── test_xml_injection.py
├── test_xml_escaping.py
└── fixtures/
    └── sample_ead.xml

Implementation Checklist

Extract XML injection into pure functions
Design flexible element injection API
Set up pytest infrastructure
Add pytest and pytest-cov to requirements.txt
Implement unit tests for XML escaping
Implement unit tests for formatting/readability
Add integration tests with sample EAD fixtures
Configure CI/CD to run tests automatically
Document testing approach in README

Benefits

Fast test execution (no external dependencies)
Easy to add new XML element types
Regression prevention
Improved code maintainability
Better separation of concerns

After you create this issue, I can reference it in future work on this topic.

alexdryden · 2025-12-23T20:50:39Z

@copilot can you briefly summarize in a comment the changes the pr makes to help support reviews

Copilot · 2025-12-23T20:52:09Z

@copilot can you briefly summarize in a comment the changes the pr makes to help support reviews

PR Summary

This PR adds creator biographical information to EAD XML exports by extracting note_bioghist content from ArchivesSpace agent records.

Core Functionality Changes (`arcflow/main.py`)

New method get_creator_bioghist() - Fetches biographical notes from agents with 'creator' role, extracts content, and formats as XML paragraphs with proper escaping
Updated task_resource() - Adds linked_agents to resolve parameter and calls new bioghist method
Enhanced XML injection - Injects <bioghist> elements after </did> tag alongside existing recordgroup/subgroup elements

Quality Improvements

XML escaping: All injected content (recordgroup, subgroup, bioghist) properly escaped using xml.sax.saxutils.escape() to prevent syntax errors
Readability: Newlines added between injected XML elements for human readability
Code structure: Refactored XML injection logic for better extensibility

Documentation

Added .github/copilot-instructions.md - Onboarding guide for GitHub Copilot agents with commit style guidelines

Files Changed

arcflow/main.py: +74 lines (new method, XML injection updates, escaping)
.github/copilot-instructions.md: +65 lines (new file)

alexdryden · 2026-01-05T15:01:36Z

@leonelramirez I was in the middle of this right before break. Other than adding the note_bioghist field, this adds a few minor enhancements summarized above under the heading "Quality Improvements". This all looked good to me before break, but could you take a quick look to confirm?

alexdryden · 2026-01-05T18:02:53Z

@copilot I'm working through a display issue in Arclight where raw xml is showing up in the display, and it looks like in addition to the bioghist_tesim we will need to create a bioghist_html_tesm with xml converted to html for display in Arclight. This is a standard part of this pipeline, and you should add a note in the documentation in the repo for copilot agents that points out this fact, as we are likely to need to do this kind of work in the future.

Follow the standard approach used in ArchivesSpace and Arclight to prepare the bioghist_html_tesm so that xml is properly converted to html. If you have questions or doubts, ask me before making commits so that we can keep the commit history relatively clean.

Copilot · 2026-01-05T18:04:32Z

@copilot I'm working through a display issue in Arclight where raw xml is showing up in the display, and it looks like in addition to the bioghist_tesim we will need to create a bioghist_html_tesm...

I need some clarification before implementing:

Questions:

Where should the HTML conversion happen?
- Should I modify the EAD XML that arcflow generates to include an additional HTML version in a specific element/attribute?
- Or should this be handled via a Traject configuration file that arcflow passes to the indexing process?
What's the standard pattern?
- Looking at line 506, I see arcflow uses arclight/lib/arclight/traject/ead2_config.rb for indexing
- Should I create a custom traject config file that extends this and adds bioghist_html_tesm field generation?
- Or does Arclight's standard config already handle this and I need to ensure the XML structure matches expectations?
HTML conversion approach:
- For converting XML to HTML, should I follow Arclight's pattern (likely using XSLT or similar)?
- Are there existing examples in the Arclight/ArchivesSpace ecosystem I should reference?

Understanding these will help me implement the right solution with clean commits.

alexdryden · 2026-01-05T18:29:27Z

@copilot Here is a little more detail:
When we were addressing the unescaped & I think we may have not been careful enough about what we were escaping and what we were passing along as real xml nodes, and are now escaping real xml nodes that should be entered as real nodes, not text. Here is an overview of the issue, as I understand it:

Arcflow injects: "...text<emph render='super'>1</emph>..." as a string

XML parser escapes it: "...text&lt;emph render='super'&gt;1&lt;/emph&gt;..."

Traject's .to_html preserves the escaped version

ArcLight's render_html_tags can't process escaped tags (they're literal text, not XML)

Let me know if that makes sense and if you can see a solution on your own, but run your solution by me before you implement it.

Copilot · 2026-01-05T18:30:53Z

@copilot Here is a little more detail:
When we were addressing the unescaped & I think we may have not been careful enough about what we were escaping and what we were passing along as real xml nodes...

Yes, I understand the issue now. The problem is on line 549 in get_creator_bioghist():

Current (incorrect):

escaped_content = xml_escape(subnote['content'])
bioghist_paragraphs.append(f'<p>{escaped_content}</p>')

This escapes ALL content, including legitimate EAD XML tags like <emph>, <title>, etc.

Proposed solution:

The bioghist content from ArchivesSpace already contains properly formatted EAD XML markup. We should pass it through as-is without escaping, since:

It comes from ArchivesSpace's structured data (not user input)
It already contains valid XML nodes that need to be preserved
ArchivesSpace has already handled any necessary escaping

Change:

# Remove xml_escape() for bioghist content - it's already valid EAD XML
bioghist_paragraphs.append(f'<p>{subnote["content"]}</p>')

Key distinction:

recordgroup/subgroup labels: Plain text strings → NEED escaping
bioghist content: Valid EAD XML markup → NO escaping (pass through as-is)

Should I proceed with removing the xml_escape() call on line 549?

alexdryden · 2026-01-05T18:33:07Z

@copilot Yes, that looks like the correct solution and your analysis is spot on. Proceed.

The bioghist content from ArchivesSpace already contains valid EAD XML markup (e.g., <emph>, <title> tags) that should be preserved as actual XML nodes, not escaped text. Only recordgroup/subgroup labels need escaping since they're plain text strings. Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Add guidelines for handling plain text vs structured XML content in the EAD pipeline. Explains when to use xml_escape() and when to pass content through unchanged, with rationale for the Traject and ArcLight display pipeline. Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot · 2026-01-05T18:36:15Z

@copilot Yes, that looks like the correct solution and your analysis is spot on. Proceed.

Completed in commits fa57624 and 167cfe7:

fa57624: Removed xml_escape() from bioghist content (line 549) so EAD XML markup is preserved as actual nodes
167cfe7: Added documentation to copilot-instructions.md explaining the escaping pattern for future reference

The bioghist content now passes through unchanged, preserving tags like <emph render='super'> as proper XML nodes that Traject and ArcLight can process correctly.

Initial plan

6f41095

Copilot AI assigned Copilot and alexdryden Dec 23, 2025

Copilot started work on behalf of alexdryden December 23, 2025 16:06 View session

Copilot AI and others added 3 commits December 23, 2025 16:10

Improve XML escaping and bioghist formatting

3c608de

- Use xml.sax.saxutils.escape for proper XML escaping - Format bioghist notes as separate paragraphs - Remove manual XML escaping in favor of library function Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Improve error message for bioghist fetching

63ffe60

- Make error message more specific about the operation that failed Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Update arcflow pipeline to display creator biographical information~~ Add creator biographical information to EAD XML exports Dec 23, 2025

Copilot AI requested a review from alexdryden December 23, 2025 16:20

Copilot finished work on behalf of alexdryden December 23, 2025 16:20

alexdryden reviewed Dec 23, 2025

View reviewed changes

alexdryden requested changes Dec 23, 2025

View reviewed changes

Copilot started work on behalf of alexdryden December 23, 2025 16:42 View session

Copilot AI and others added 10 commits December 23, 2025 16:48

Revert "Improve error message for bioghist fetching"

c1bea75

This reverts commit 63ffe60.

Revert "Improve XML escaping and bioghist formatting"

48320f3

This reverts commit 3c608de.

Revert "Implement bioghist extraction from creator agents"

a54c4ba

This reverts commit e49106b.

Push revert commits

430557f

Add linked_agents to resolve parameter

a50571c

Include linked_agents in the resource API call resolution to fetch agent information along with the resource. This prepares for extracting creator biographical information. Co-authored-by: alexdryden <47127862+alexdryden@users.noreply.github.com>

Copilot AI requested a review from alexdryden December 23, 2025 16:57

Copilot finished work on behalf of alexdryden December 23, 2025 16:57

Copilot started work on behalf of alexdryden December 23, 2025 16:58 View session

Copilot finished work on behalf of alexdryden December 23, 2025 20:33

Copilot started work on behalf of alexdryden December 23, 2025 20:38 View session

Copilot finished work on behalf of alexdryden December 23, 2025 20:40

Copilot started work on behalf of alexdryden December 23, 2025 20:47 View session

Copilot finished work on behalf of alexdryden December 23, 2025 20:49

alexdryden marked this pull request as ready for review December 23, 2025 20:49

alexdryden requested a review from leonelramirez December 23, 2025 20:49

Copilot started work on behalf of alexdryden December 23, 2025 20:51 View session

Copilot finished work on behalf of alexdryden December 23, 2025 20:52

Copilot started work on behalf of alexdryden January 5, 2026 18:03 View session

Copilot finished work on behalf of alexdryden January 5, 2026 18:05

Copilot started work on behalf of alexdryden January 5, 2026 18:29 View session

Copilot finished work on behalf of alexdryden January 5, 2026 18:31

Copilot started work on behalf of alexdryden January 5, 2026 18:33 View session

Copilot AI and others added 2 commits January 5, 2026 18:35

Copilot finished work on behalf of alexdryden January 5, 2026 18:37

Add creator biographical information to EAD XML exports #5

Are you sure you want to change the base?

Add creator biographical information to EAD XML exports #5

Uh oh!

Conversation

Copilot AI commented Dec 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implementation Plan for Creator Biographical Information

Summary

Documentation Added

XML Formatting

Uh oh!

alexdryden Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

Uh oh!

alexdryden left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI commented Dec 23, 2025

Uh oh!

alexdryden commented Dec 23, 2025

Uh oh!

Copilot AI commented Dec 23, 2025

Uh oh!

alexdryden commented Dec 23, 2025

Uh oh!

Copilot AI commented Dec 23, 2025

Test Proposal

Current State Analysis

Recommended Refactoring for Testability

Proposed Tests

Test Infrastructure Setup

Benefits

Uh oh!

alexdryden commented Dec 23, 2025

Uh oh!

Copilot AI commented Dec 23, 2025

Summary

Background

Goals

Proposed Refactoring

Design Principle

Key Functions to Extract

Test Coverage Needed

Critical Tests

Test File Structure

Implementation Checklist

Benefits

Uh oh!

alexdryden commented Dec 23, 2025

Uh oh!

Copilot AI commented Dec 23, 2025

PR Summary

Core Functionality Changes (arcflow/main.py)

Quality Improvements

Documentation

Files Changed

Uh oh!

alexdryden commented Jan 5, 2026

Uh oh!

alexdryden commented Jan 5, 2026

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

alexdryden commented Jan 5, 2026

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

alexdryden commented Jan 5, 2026

Uh oh!

Copilot AI commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Copilot AI commented Dec 23, 2025 •

edited

Loading

Core Functionality Changes (`arcflow/main.py`)