Skip to content

feat: Capture parent element context in TreeScraper link extraction #20

@willgriffin

Description

@willgriffin

Problem

TreeScraper extracts links but loses hierarchical context. When a link says "View" but its parent element says "Council Minutes December 30, 2024", we lose that date information.

Current Behavior

interface Link {
  href: string;
  text: string;        // Just "View"
  title?: string;
  ariaLabel?: string;
  // ... no parent context
}

Proposed Enhancement

Extend Link interface to capture parent context:

interface Link {
  href: string;
  text: string;
  title?: string;
  ariaLabel?: string;
  // NEW: Parent context
  parentText?: string;        // Immediate parent's text content
  ancestorTexts?: string[];   // Path of ancestor texts (e.g., ["2024", "Minutes", "December 30"])
  hierarchyLevel?: number;    // Depth in tree expansion
}

Implementation Notes

The code already constructs element paths in extractLinksWithTreeExpansion (lines 196-209 in tree.ts) but discards them. Changes needed:

  1. In extractLinks() (line 124), traverse up from each <a> to capture parent text
  2. In extractLinksWithTreeExpansion(), track which expansion iteration revealed each link
  3. Add parentText to the Link interface in types.ts

Use Case

Municipal sites like eckville.com have:

<div class="meeting-item">
  <span>Council Minutes December 30, 2024</span>
  <a href="/public/download/files/266576">View</a>
</div>

With parent context, praeco's parser can extract "December 30, 2024" from parentText instead of just seeing "View".

Backwards Compatibility

  • New fields are optional, won't break existing consumers
  • Existing link extraction behavior unchanged
  • Just adds more metadata to the Link objects

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions