Skip to content

Feature/glob pattern support#1

Open
TheLazzziest wants to merge 23 commits intomainfrom
feature/glob-pattern-support
Open

Feature/glob pattern support#1
TheLazzziest wants to merge 23 commits intomainfrom
feature/glob-pattern-support

Conversation

@TheLazzziest
Copy link
Copy Markdown
Collaborator

No description provided.

Copy link
Copy Markdown
Collaborator Author

@TheLazzziest TheLazzziest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So far it looks great, but there is some room for improvements.

@TheLazzziest
Copy link
Copy Markdown
Collaborator Author

  1. Move code-generate structure from fixtures to assets
  2. Add a mermaid diagram for a generic case
  3. Refactor fixture to let them provision the directories to the cases instead of creating them
  4. Add E2E tests for different cases

@Houston56
Copy link
Copy Markdown
Owner

Refactored test fixtures to use asset templates

  • Moved package structures to tests/assets/ (single_level, nested, multi_star, namespace)
  • Simplified fixtures to use shutil.copytree() instead of programmatic file creation
  • All fixtures now follow the same pattern as package_path fixture
  • Reduces code duplication and improves maintainability

Did not use mermaid_assert in test_get_graph_string_with_nested_glob_pattern because mermaid_assert validates the specific structure from package_path fixture (users/posts/comments tables), while this test uses nested_package_path which has a different structure (users/products/api_resources tables). The current assertion correctly verifies that glob pattern matching works by checking for expected tables.

Next steps if this looks good:

  1. Implement module search functionality for remaining patterns (nested glob patterns like example.*.*.models and multiple stars like example.*.api.*.models)
  2. Then add integration tests for these patterns

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks much better now. Just don't forget that about glob functionality itself:

  • Matching Any String example.*.models
  • Matching a Single Character example.fo?.models
  • Matching character groups:
    • Character classes example.api.v[12].models
    • Character ranges example.api.v[0-9].models
  • Complementation
    • Character class example.api.v[!1].models
    • Character range example.api.v[!0-9].models
  • Pathnames:
    • Multiple packages deep: example.v?.*.*.models
    • Recursive packages lookup: example.**.api.*.models
  • Errors:
    • Missing rule: example.v?..models
    • Invalid delimiter example.v?,,models

As for tables, there is no need so far. It will be too time consuming to implement. We still don't know if the maintainer will share this approach. So my suggestion would be the following:

  1. Complete the test suit
  2. Implement the first two cases (Any string, A single char)
  3. Open a drafted PR with a question (if this approach is ok), then continue work if he says - yes.

@Houston56
Copy link
Copy Markdown
Owner

Thanks for the feedback! I've implemented full support for all the requested glob pattern cases:

  • Basic wildcard (*) - matches any string at a single level
  • Multiple wildcards (*.*, *.*.*) - matches multiple package levels
  • Single character matching (?) - matches exactly one character
  • Character classes with prefix (v[12]) - matches exactly one character from the set
  • Character ranges (v[0-9]) - matches exactly one character in the range
  • Negation/complementation ([!1], [!0-9]) - matches any character except those specified
  • Recursive patterns (**) - matches zero or more package levels at any depth
  • Mixed wildcards (v?.*.*.models) - supports combinations of different pattern types
  • Namespace packages (PEP 420) - handles packages with multiple __path__ entries
  • Comprehensive error handling - validates patterns and provides clear error messages for invalid inputs

The code has been refactored into a dedicated finders module with full test coverage. All tests are passing.

@TheLazzziest
Copy link
Copy Markdown
Collaborator Author

Summary: An introduction of a state machine

The current implementation treats module discovery as a graph traversal problem solved with a Breadth-First Search (BFS) using a state machine.

  1. The Mask (Pattern): Converted into a Linked List of GlobNode objects. This allows us to track our progress through the mask using object pointers rather than fragile integer indices.

  2. Execution Flow: We move through the filesystem and the pattern independently. A standard match moves both cursors forward. A ** match moves the filesystem cursor forward while keeping the pattern cursor stationary (consuming directories):

  • Type-Aware Matching:
    • Files: Matched by Stem (e.g., mask models matches models.py).
    • Directories: Matched by Name (e.g., mask models matches models/).
    • Namespace Support: Any directory with a valid Python identifier name (and not pycache) is treated as a potential package, complying with PEP 420.

Tradeoffs

Advantages

  • Strict Correctness with **: The state machine approach is one of the few ways to correctly handle the non-determinism of ** (recursive wildcards) without getting stuck in infinite loops or missing overlapping paths.
  • Extensibility: Because the pattern is a Linked List, you can easily implement "Macros" later. For example, if you encounter a token @django_apps, you can dynamically inject a new chain of GlobNodes into the list at runtime without breaking the traverser.
  • OS Agnostic: Full use of pathlib ensures consistent behavior across Windows and POSIX systems.

Risks & Limitations

  • Memory Overhead (BFS): We store every unique state (Path, PatternNode) in the visited set. On massive filesystems with broad ** patterns, this set can grow significantly.

  • False Positives (Namespace Packages):

    • Issue: PEP 420 says any directory can be a package.
    • Risk: If your structure has a folder named media or templates (which are valid Python identifiers), the traverser will enter them and try to match modules inside.
    • Mitigation: You must rely on your glob mask being specific enough (e.g., apps.**.models) to avoid wandering into asset directories.
  • Stem Matching Ambiguity:

    • Issue: A mask of utils will match both utils.py and a package utils/.
    • Risk: If you have both in the same folder (bad practice, but possible), both will be returned. The consumer of this generator must decide which one to prioritize.

@TheLazzziest
Copy link
Copy Markdown
Collaborator Author

A small recap of what has been done:

  1. Implement a state machine for searching and glob pattern matching
  2. Implement glob pattern validation rules
  3. Improve tests for the state maching
  4. Decouple searching logic from import execution
  5. Add tests for get_graph_string

Remaining scope of the work:

  1. Add mermaid chart parser
  2. Perform chart comparison between the expected chart and the one produced by the library
  3. Update mermaid_assert function to perform dynamic comparison
  4. Define what to do with namespace packages when we need to select a base model (both packages must define a base model)

@Houston56
Copy link
Copy Markdown
Owner

Separate graph building from serialization and add dynamic graph comparison

This commit refactors the graph generation logic to separate graph building
from serialization, enabling direct MetaData comparison in tests without
parsing strings. It also adds support for namespace packages with multiple
base classes.

Changes

Core Refactoring

  1. Separated graph building and serialization

    • Added get_graph_metadata(): Returns MetaData object instead of string
    • Added serialize_metadata(): Serializes MetaData to string format
    • Refactored get_graph_string(): Now a thin wrapper combining the above

    This separation allows tests to compare graph structures directly using
    MetaData objects, making tests more robust and independent of serialization
    format.

  2. Added graph comparison functionality

    • Added compare_metadata(): Structurally compares two MetaData objects
    • Compares tables, columns, types, constraints, and foreign key relationships
    • Provides detailed error messages on mismatch

    This enables dynamic testing of graphs generated from glob patterns where
    the expected structure is not known in advance.

  3. Updated mermaid_assert() for dynamic comparison

    • Supports both legacy mode (string assertions) and dynamic mode (MetaData comparison)
    • Maintains backward compatibility with existing tests
    • Allows tests to work with dynamically discovered models

Namespace Packages Support

  1. Added support for multiple base classes in namespace packages

    • Added _find_base_classes_by_pattern(): Finds all base classes matching glob pattern
    • Added _merge_metadata(): Merges MetaData from multiple base classes
    • Handles table name conflicts by adding namespace prefixes
    • Re-merges metadata after model imports to capture all registered tables

    When using glob patterns like project*.example.base:Base, the system now:

    • Finds all matching base classes (e.g., project1.example.base:Base, project2.example.base:Base)
    • Merges their MetaData into a single unified graph
    • Resolves naming conflicts automatically
    • Ensures all models from all namespace packages are included

Test Fixes

  1. Fixed test_find_modules tests
    • Changed to_module_name() calls to use Path.cwd() instead of package_path
    • Reason: Test fixtures call os.chdir(), so Path.cwd() correctly reflects
      the working directory after the change. This matches the behavior in
      get_graph_metadata() which also uses Path.cwd().
    • ModuleFinder returns absolute paths, and Path.cwd() ensures correct
      relative path calculation

Technical Details

Why separate serialization?

Previously, get_graph_string() did everything in one step:

  1. Import modules
  2. Build graph (get MetaData)
  3. Filter tables
  4. Serialize to string

This made it impossible to:

  • Compare two graphs structurally
  • Test with dynamic glob patterns
  • Reuse graph building logic

Now the flow is:

  1. get_graph_metadata() → Returns MetaData
  2. compare_metadata() → Compares MetaData objects
  3. serialize_metadata() → Converts MetaData to string

Why re-merge metadata after imports?

When merging metadata from multiple namespace packages:

  • Initial merge happens before model imports (empty metadata)
  • Models register themselves in their original base class metadata
  • After imports, we re-merge to capture all newly registered tables

This ensures the final merged metadata contains all tables from all
namespace packages.

Testing

  • All 66 tests pass
  • Code coverage: 86%
  • Backward compatibility maintained
  • New tests added for namespace packages and wildcard patterns

All changes are backward compatible. get_graph_string() continues
to work as before, now implemented as a wrapper around the new functions.

@TheLazzziest
Copy link
Copy Markdown
Collaborator Author

LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants