-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Description
Summary
Currently pqg has good query support for narrow format (PQG class + TypedEdgeQueries) but wide format only has validation utilities. Users must write raw SQL/DuckDB queries to work with wide format files.
This issue proposes a unified ISamplesGraph class that auto-detects format and provides a common interface for both.
Current State
Narrow format support:
PQGclass:getNode(),getRelations(),getNodeIds(), etc.TypedEdgeQueries:get_edges_by_type(),get_edge_type_statistics(), etc.
Wide format support:
get_schema_from_parquet(): Auto-detects formatWideSchemaValidator: Validates schema- No query APIs
Proposal
New Class: ISamplesGraph
from pqg import ISamplesGraph, ISamplesEdgeType
# Auto-detects format
graph = ISamplesGraph('path/to/file.parquet')
print(f"Format: {graph.format.value}") # "narrow" or "wide"
# Same API works for both formats
entities = graph.count_by_type()
relations = graph.count_relations_by_type()
# Query by edge type
produced_by = graph.get_relations(
edge_type=ISamplesEdgeType.MSR_PRODUCED_BY,
limit=100
)
# Multi-hop traversal
locations = graph.traverse(
start_pid="some_sample_pid",
edge_types=[
ISamplesEdgeType.MSR_PRODUCED_BY,
ISamplesEdgeType.EVENT_SAMPLE_LOCATION
]
)Proposed Methods
class ISamplesGraph:
def __init__(self, parquet_path: str)
# Properties
@property
def format(self) -> SchemaFormat
@property
def is_narrow(self) -> bool
@property
def is_wide(self) -> bool
# Entity queries (same for both formats)
def get_entities(self, otype: str, limit: int = None) -> pd.DataFrame
def get_entity(self, pid: str) -> dict
def count_by_type(self) -> pd.DataFrame
# Relationship queries (different implementations internally)
def get_relations(self, subject_type: str = None,
edge_type: ISamplesEdgeType = None,
limit: int = None) -> pd.DataFrame
def count_relations_by_type(self) -> pd.DataFrame
# Traversal (multi-hop)
def traverse(self, start_pid: str,
edge_types: List[ISamplesEdgeType]) -> pd.DataFrameInternal Implementation
Narrow format: Delegate to existing PQG + TypedEdgeQueries
Wide format: Generate DuckDB queries using edge type knowledge:
- Map
ISamplesEdgeTypetop__*column names - Filter by
otypeto disambiguate shared predicates (e.g.,responsibility)
Benefits
- User-friendly: No need to understand format differences
- Format-agnostic code: Same notebook/script works with either format
- Leverages existing work: Reuses
PQG,TypedEdgeQueries, schema detection - Future-proof: Can add EXPORT format support later
Related
- Issue Add human-readable labels for IdentifiedConcept URIs in Wide/Narrow formats #10: Add human-readable labels for IdentifiedConcept URIs
- Existing format detection:
pqg/schemas/base.py:get_schema_from_parquet()
Metadata
Metadata
Assignees
Labels
No labels