Skip to content

Latest commit

 

History

History
243 lines (237 loc) · 12.1 KB

File metadata and controls

243 lines (237 loc) · 12.1 KB

Architecture Overview

Project Summary

This project implements a complete regular expression engine from scratch, building up from basic automata operations to advanced features like capturing groups and backreferences. The project demonstrates core computer science concepts including:

  • Nondeterministic Finite Automata (NFA) construction and simulation
  • Regular expression parsing using recursive descent
  • Regex-to-NFA conversion using Thompson's construction
  • Pattern matching with capturing groups
  • Text processing via a mini sed implementation
  • Backreference matching for advanced regex features

Project Structure

The project is organized into 4 checkpoints (cp1-cp4), each building on the previous:

├── cp1/          # Core NFA data structures and basic operations
├── cp2/          # Regular expression parsing and NFA construction
├── cp3/          # Capturing groups and text processing (msed)
├── cp4/          # Backreferences support
├── examples/     # Test files (NFAs, regexes, CNF formulas, etc.)
└── tests/        # Automated test suites for each checkpoint

Checkpoint 1: Core NFA Foundation

Purpose: Implement the fundamental NFA data structure and basic operations.

Key Components

nfa.py - NFA Core Library

  • NFA class: Represents a nondeterministic finite automaton
  • States, alphabet, start state, accept states
  • Transition function: transitions[(state, symbol)] → list of transitions
  • Supports ε-transitions (empty string transitions)
  • Transition class: Represents a state transition (q, a, r)
  • From state q, on symbol a, to state r
  • Uses EPSILON = '&' for empty string transitions
  • match(m, w) function: Simulates NFA on input string
  • Uses breadth-first search (BFS) to explore all possible paths
  • Returns accepting path (list of transitions) if string is accepted
  • Handles ε-transitions correctly
  • Time complexity: O(n·m) where n = string length, m = NFA size
  • read() / write(): File I/O for NFA serialization
  • Text format: states, alphabet, start, accept states, then transitions

regular.py - Basic NFA Constructors

  • symbol(a): Creates NFA recognizing single character {a}
  • epsilon(): Creates NFA recognizing empty string {ε}

Executables

  • epsilon_nfa: Outputs NFA for ε
  • symbol_nfa <char>: Outputs NFA for single character
  • nfa_path <nfa_file> <string>: Tests if string is accepted, prints path

Design Patterns

  • State renaming: When combining NFAs, states are prefixed (e.g., 1_0, 2_0) to avoid collisions
  • BFS simulation: Explores all possible computation paths simultaneously
  • Configuration tracking: (state, input_position) pairs represent computation states

Checkpoint 2: Regular Expressions to NFAs

Purpose: Parse regular expressions and convert them to NFAs using Thompson's construction.

Key Components

parse_re.py - Recursive Descent Parser

  • Grammar (operator precedence from highest to lowest):
E  → T E'          (Union/alternation)
E' → | T E' | ε
T  → F T'          (Concatenation)
T' → F T' | ε
F  → P F'          (Kleene star)
F' → * F' | ε
P  → (E) | a | ε   (Primitives: parentheses, symbols, empty)
  • Tokenization: Handles escaped characters (\', \"), quotes, special symbols
  • Output format: Prefix notation like union(concat(symbol[a],symbol[b]),star(symbol[c]))
  • Edge cases: Empty strings, leading/trailing |, nested parentheses

re_to_nfa.py - Regex-to-NFA Converter

  • re_to_nfa(expression): Main entry point
  1. Parses regex string → prefix notation AST
  2. Recursively builds NFA from AST
  • build_nfa(expr): Recursive construction
  • symbol[a] → calls regular.symbol(a)
  • epsilon → calls regular.epsilon()
  • union(L,R) → calls union_nfa(build_nfa(L), build_nfa(R))
  • concat(L,R) → calls concat_nfa(build_nfa(L), build_nfa(R))
  • star(X) → calls star_nfa(build_nfa(X))

NFA Construction Operations

union_nfa.py - Union (L₁ ∪ L₂)

  • Creates new start state with ε-transitions to both NFAs' start states
  • Preserves all accept states from both NFAs
  • Key insight: Nondeterminism allows "choosing" which NFA to follow concat_nfa.py - Concatenation (L₁ ∘ L₂)
  • Links accept states of first NFA to start state of second via ε-transitions
  • New start = first NFA's start, new accept = second NFA's accept states
  • Key insight: Empty string transitions connect the two languages star_nfa.py - Kleene Star (L*)
  • Creates new start/accept states
  • ε-transitions: start → original start, accept states → original start, accept states → new accept
  • Key insight: Allows zero or more repetitions via looping

agrep.py - Regex Matching Tool

  • re_to_nfa(regex): Converts regex to NFA
  • precompute_epsilon_closures(nfa): Precomputes ε-closures for all states (optimization)
  • simulate_nfa(nfa, string, closures): Efficient NFA simulation
  • Uses precomputed closures to avoid redundant BFS
  • Time complexity: O(n·m) where n = string length, m = NFA size
  • Processes stdin line-by-line, prints matching lines

Algorithm Highlights

Thompson's Construction:

  • Each regex operator (union, concat, star) has a corresponding NFA construction
  • Recursively combines smaller NFAs into larger ones
  • Resulting NFA has O(m) states where m = regex length Epsilon Closure Optimization:
  • Precompute reachable states via ε-transitions for each state
  • Reduces redundant exploration during simulation
  • Critical for performance on complex regexes

Checkpoint 3: Capturing Groups and Text Processing

Purpose: Extend regex engine with capturing groups and implement a sed-like text processor.

Key Components

regexp.py - Enhanced AST with Groups

  • New AST node: GroupNode(number, child)
  • Represents capturing group (pattern) with assigned group number
  • Groups numbered 1, 2, 3... in order of opening parentheses
  • Parser enhancements:
  • Tracks _group_count during parsing
  • Wraps parenthesized expressions in GroupNode
  • AST printer: Converts AST to prefix notation with group[k](...)

re_groups.py - Group Matching Engine

  • Matcher class: Recursive matcher that tracks captures
  • match(s): Full match (entire string must match)
  • match_partial(s): Partial match (longest prefix matching)
  • _match_node(node, s, pos, captures): Core recursive matching
    • Returns (new_position, updated_captures_dict)
    • GroupNode: Captures substring s[start:end] into captures[group_num]
    • Handles union, concat, star with capture propagation
  • apply_replacement(regex, replace, current): Substitution with backreferences
  • Matches pattern, extracts groups
  • Replaces \g<1>, \g<2>, etc. with captured groups
  • Returns modified string

msed.py - Mini sed Implementation

  • Commands:
  • s/pattern/replace/: Substitution (uses re_groups.apply_replacement)
  • /pattern/target: Branching (loop, accept, reject, skip)
  • :label: Labels for branching
  • Execution model:
  • Processes each input line through command sequence
  • Branch commands jump to labels based on pattern matching
  • Supports -f scriptfile and -e expression flags
  • Verbose mode (-v) shows execution steps

re_groups (executable)

  • CLI tool: re_groups <pattern> <string>
  • Outputs "accept" or "reject" plus captured groups

Design Patterns

  • Capture propagation: Captures flow through recursive matching, updated at group boundaries
  • Partial matching: Tries all possible starting positions, selects longest match
  • State machine: msed uses program counter (pc) to step through commands

Checkpoint 4: Backreferences

Purpose: Add backreference support (\g<1>, \g<2>, etc.) to match previously captured groups.

Key Components

backrefs.py - Backreference Parser and Matcher

  • parse_backrefs(s): Parses replacement strings
  • Extracts \g<1>, \g<2>, etc. as Backreference tokens
  • Returns list of tokens (strings and Backreference objects)
  • match_with_backrefs(tree, line): Enhanced matcher
  • Environment (env): Dictionary mapping group numbers → captured strings
  • Backreference matching: When encountering backref node:
    • Looks up env[group_num] to get previously captured string
    • Checks if remaining input starts with that string
    • Advances position by length of captured string
  • Group capture: When matching group node:
    • Captures substring into environment: env[group_num] = s[start:end]
  • Star handling: Prevents infinite loops by tracking seen configurations

bgrep.py - Backreference-Aware grep

  • Parses regex pattern (supports groups)
  • Uses match_with_backrefs to match lines
  • Filters stdin, prints matching lines

parse_re.py (cp4 version)

  • Enhanced to parse backreferences in patterns
  • Converts \g<1> to backref nodes in AST

Algorithm Highlights

Backreference Matching:

  • Requires context (environment) to store captured groups
  • Matching is context-dependent: same pattern matches differently based on previous captures
  • Example: Pattern (a+)\g<1> matches "aa" (group 1 = "a", backref must also be "a")
  • Complexity: Can be exponential in worst case (backtracking) Environment Management:
  • Each recursive call maintains its own environment copy
  • Groups captured in one branch don't affect other branches (union)
  • Captures propagate through concatenation and star

Data Flow Architecture

Input String/Regex
   ↓
[cp2] parse_re.py → Prefix AST
   ↓
[cp2] re_to_nfa.py → NFA
   ↓
[cp1] nfa.match() → Accept/Reject + Path
   OR
[cp2] agrep.py → Line filtering
   OR
[cp3] re_groups.py → Match + Captures
   ↓
[cp3] msed.py → Text transformation
   OR
[cp4] backrefs.py → Match with backreferences
   ↓
[cp4] bgrep.py → Advanced pattern matching

Key Design Decisions

  1. NFA over DFA: Chose NFA for easier construction (Thompson's algorithm), handles ε-transitions naturally
  2. BFS simulation: Ensures O(n·m) time complexity, finds shortest accepting path
  3. Prefix notation: Simplifies parsing and AST manipulation
  4. State renaming: Prevents collisions when combining NFAs
  5. Precomputed closures: Optimization for repeated NFA simulation
  6. Recursive descent parsing: Clear, maintainable parser structure
  7. Environment-based captures: Clean separation between matching and capture tracking

Complexity Analysis

  • NFA construction: O(m) states for regex of length m
  • NFA simulation: O(n·m) time, O(m) space for n-length string
  • Regex parsing: O(m) time and space
  • Group matching: O(n·m) time in typical case, can be exponential with backreferences
  • Backreference matching: Worst-case exponential, but optimized with cycle detection

Testing Strategy

Each checkpoint has comprehensive test suite (test-cp*.sh):

  • Unit tests for individual operations
  • Integration tests for full pipelines
  • Performance tests (ensures O(n) scaling, not O(n²))
  • Edge case coverage (empty strings, special characters, nested structures)

Extensibility

The architecture supports easy extension:

  • New regex operators: Add AST node + NFA construction + parser rule
  • New matching modes: Extend Matcher class with new methods
  • New text processors: Follow msed.py pattern (command parser + execution loop)

Portfolio Showcase Points

This project demonstrates:

  • Algorithm Implementation: Thompson's construction, BFS graph traversal, recursive descent parsing
  • Data Structures: NFA representation, AST manipulation, state machines
  • Software Engineering: Modular design, clear separation of concerns, comprehensive testing
  • Performance Optimization: Epsilon closure precomputation, efficient state space exploration
  • Language Theory: Regular languages, automata theory, formal language parsing
  • System Programming: CLI tools, file I/O, stdin/stdout processing
  • Advanced Features: Capturing groups, backreferences, text transformation