Skip to content

Add unified ISamplesGraph API for narrow and wide formats #11

@rdhyee

Description

@rdhyee

Summary

Currently pqg has good query support for narrow format (PQG class + TypedEdgeQueries) but wide format only has validation utilities. Users must write raw SQL/DuckDB queries to work with wide format files.

This issue proposes a unified ISamplesGraph class that auto-detects format and provides a common interface for both.

Current State

Narrow format support:

  • PQG class: getNode(), getRelations(), getNodeIds(), etc.
  • TypedEdgeQueries: get_edges_by_type(), get_edge_type_statistics(), etc.

Wide format support:

  • get_schema_from_parquet(): Auto-detects format
  • WideSchemaValidator: Validates schema
  • No query APIs

Proposal

New Class: ISamplesGraph

from pqg import ISamplesGraph, ISamplesEdgeType

# Auto-detects format
graph = ISamplesGraph('path/to/file.parquet')
print(f"Format: {graph.format.value}")  # "narrow" or "wide"

# Same API works for both formats
entities = graph.count_by_type()
relations = graph.count_relations_by_type()

# Query by edge type
produced_by = graph.get_relations(
    edge_type=ISamplesEdgeType.MSR_PRODUCED_BY, 
    limit=100
)

# Multi-hop traversal
locations = graph.traverse(
    start_pid="some_sample_pid",
    edge_types=[
        ISamplesEdgeType.MSR_PRODUCED_BY, 
        ISamplesEdgeType.EVENT_SAMPLE_LOCATION
    ]
)

Proposed Methods

class ISamplesGraph:
    def __init__(self, parquet_path: str)
    
    # Properties
    @property
    def format(self) -> SchemaFormat
    @property
    def is_narrow(self) -> bool
    @property
    def is_wide(self) -> bool
    
    # Entity queries (same for both formats)
    def get_entities(self, otype: str, limit: int = None) -> pd.DataFrame
    def get_entity(self, pid: str) -> dict
    def count_by_type(self) -> pd.DataFrame
    
    # Relationship queries (different implementations internally)
    def get_relations(self, subject_type: str = None,
                      edge_type: ISamplesEdgeType = None,
                      limit: int = None) -> pd.DataFrame
    def count_relations_by_type(self) -> pd.DataFrame
    
    # Traversal (multi-hop)
    def traverse(self, start_pid: str, 
                 edge_types: List[ISamplesEdgeType]) -> pd.DataFrame

Internal Implementation

Narrow format: Delegate to existing PQG + TypedEdgeQueries

Wide format: Generate DuckDB queries using edge type knowledge:

  • Map ISamplesEdgeType to p__* column names
  • Filter by otype to disambiguate shared predicates (e.g., responsibility)

Benefits

  1. User-friendly: No need to understand format differences
  2. Format-agnostic code: Same notebook/script works with either format
  3. Leverages existing work: Reuses PQG, TypedEdgeQueries, schema detection
  4. Future-proof: Can add EXPORT format support later

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions