Architecture Overview

Project Summary

This project implements a complete regular expression engine from scratch, building up from basic automata operations to advanced features like capturing groups and backreferences. The project demonstrates core computer science concepts including:

Nondeterministic Finite Automata (NFA) construction and simulation
Regular expression parsing using recursive descent
Regex-to-NFA conversion using Thompson's construction
Pattern matching with capturing groups
Text processing via a mini sed implementation
Backreference matching for advanced regex features

Project Structure

The project is organized into 4 checkpoints (cp1-cp4), each building on the previous:

├── cp1/          # Core NFA data structures and basic operations
├── cp2/          # Regular expression parsing and NFA construction
├── cp3/          # Capturing groups and text processing (msed)
├── cp4/          # Backreferences support
├── examples/     # Test files (NFAs, regexes, CNF formulas, etc.)
└── tests/        # Automated test suites for each checkpoint

Checkpoint 1: Core NFA Foundation

Purpose: Implement the fundamental NFA data structure and basic operations.

Key Components

`nfa.py` - NFA Core Library

NFA class: Represents a nondeterministic finite automaton
States, alphabet, start state, accept states
Transition function: transitions[(state, symbol)] → list of transitions
Supports ε-transitions (empty string transitions)
Transition class: Represents a state transition (q, a, r)
From state q, on symbol a, to state r
Uses EPSILON = '&' for empty string transitions
match(m, w) function: Simulates NFA on input string
Uses breadth-first search (BFS) to explore all possible paths
Returns accepting path (list of transitions) if string is accepted
Handles ε-transitions correctly
Time complexity: O(n·m) where n = string length, m = NFA size
read() / write(): File I/O for NFA serialization
Text format: states, alphabet, start, accept states, then transitions

`regular.py` - Basic NFA Constructors

symbol(a): Creates NFA recognizing single character {a}
epsilon(): Creates NFA recognizing empty string {ε}

Executables

epsilon_nfa: Outputs NFA for ε
symbol_nfa <char>: Outputs NFA for single character
nfa_path <nfa_file> <string>: Tests if string is accepted, prints path

Design Patterns

State renaming: When combining NFAs, states are prefixed (e.g., 1_0, 2_0) to avoid collisions
BFS simulation: Explores all possible computation paths simultaneously
Configuration tracking: (state, input_position) pairs represent computation states

Checkpoint 2: Regular Expressions to NFAs

Purpose: Parse regular expressions and convert them to NFAs using Thompson's construction.

Key Components

`parse_re.py` - Recursive Descent Parser

Grammar (operator precedence from highest to lowest):

E  → T E'          (Union/alternation)
E' → | T E' | ε
T  → F T'          (Concatenation)
T' → F T' | ε
F  → P F'          (Kleene star)
F' → * F' | ε
P  → (E) | a | ε   (Primitives: parentheses, symbols, empty)

Tokenization: Handles escaped characters (\', \"), quotes, special symbols
Output format: Prefix notation like union(concat(symbol[a],symbol[b]),star(symbol[c]))
Edge cases: Empty strings, leading/trailing |, nested parentheses

`re_to_nfa.py` - Regex-to-NFA Converter

re_to_nfa(expression): Main entry point

Parses regex string → prefix notation AST
Recursively builds NFA from AST

build_nfa(expr): Recursive construction
symbol[a] → calls regular.symbol(a)
epsilon → calls regular.epsilon()
union(L,R) → calls union_nfa(build_nfa(L), build_nfa(R))
concat(L,R) → calls concat_nfa(build_nfa(L), build_nfa(R))
star(X) → calls star_nfa(build_nfa(X))

NFA Construction Operations

union_nfa.py - Union (L₁ ∪ L₂)

Creates new start state with ε-transitions to both NFAs' start states
Preserves all accept states from both NFAs
Key insight: Nondeterminism allows "choosing" which NFA to follow concat_nfa.py - Concatenation (L₁ ∘ L₂)
Links accept states of first NFA to start state of second via ε-transitions
New start = first NFA's start, new accept = second NFA's accept states
Key insight: Empty string transitions connect the two languages star_nfa.py - Kleene Star (L*)
Creates new start/accept states
ε-transitions: start → original start, accept states → original start, accept states → new accept
Key insight: Allows zero or more repetitions via looping

`agrep.py` - Regex Matching Tool

re_to_nfa(regex): Converts regex to NFA
precompute_epsilon_closures(nfa): Precomputes ε-closures for all states (optimization)
simulate_nfa(nfa, string, closures): Efficient NFA simulation
Uses precomputed closures to avoid redundant BFS
Time complexity: O(n·m) where n = string length, m = NFA size
Processes stdin line-by-line, prints matching lines

Algorithm Highlights

Thompson's Construction:

Each regex operator (union, concat, star) has a corresponding NFA construction
Recursively combines smaller NFAs into larger ones
Resulting NFA has O(m) states where m = regex length Epsilon Closure Optimization:
Precompute reachable states via ε-transitions for each state
Reduces redundant exploration during simulation
Critical for performance on complex regexes

Checkpoint 3: Capturing Groups and Text Processing

Purpose: Extend regex engine with capturing groups and implement a sed-like text processor.

Key Components

`regexp.py` - Enhanced AST with Groups

New AST node: GroupNode(number, child)
Represents capturing group (pattern) with assigned group number
Groups numbered 1, 2, 3... in order of opening parentheses
Parser enhancements:
Tracks _group_count during parsing
Wraps parenthesized expressions in GroupNode
AST printer: Converts AST to prefix notation with group[k](...)

`re_groups.py` - Group Matching Engine

Matcher class: Recursive matcher that tracks captures
match(s): Full match (entire string must match)
match_partial(s): Partial match (longest prefix matching)
_match_node(node, s, pos, captures): Core recursive matching
- Returns (new_position, updated_captures_dict)
- GroupNode: Captures substring s[start:end] into captures[group_num]
- Handles union, concat, star with capture propagation
apply_replacement(regex, replace, current): Substitution with backreferences
Matches pattern, extracts groups
Replaces \g<1>, \g<2>, etc. with captured groups
Returns modified string

`msed.py` - Mini sed Implementation

Commands:
s/pattern/replace/: Substitution (uses re_groups.apply_replacement)
/pattern/target: Branching (loop, accept, reject, skip)
:label: Labels for branching
Execution model:
Processes each input line through command sequence
Branch commands jump to labels based on pattern matching
Supports -f scriptfile and -e expression flags
Verbose mode (-v) shows execution steps

`re_groups` (executable)

CLI tool: re_groups <pattern> <string>
Outputs "accept" or "reject" plus captured groups

Design Patterns

Capture propagation: Captures flow through recursive matching, updated at group boundaries
Partial matching: Tries all possible starting positions, selects longest match
State machine: msed uses program counter (pc) to step through commands

Checkpoint 4: Backreferences

Purpose: Add backreference support (\g<1>, \g<2>, etc.) to match previously captured groups.

Key Components

`backrefs.py` - Backreference Parser and Matcher

parse_backrefs(s): Parses replacement strings
Extracts \g<1>, \g<2>, etc. as Backreference tokens
Returns list of tokens (strings and Backreference objects)
match_with_backrefs(tree, line): Enhanced matcher
Environment (env): Dictionary mapping group numbers → captured strings
Backreference matching: When encountering backref node:
- Looks up env[group_num] to get previously captured string
- Checks if remaining input starts with that string
- Advances position by length of captured string
Group capture: When matching group node:
- Captures substring into environment: env[group_num] = s[start:end]
Star handling: Prevents infinite loops by tracking seen configurations

`bgrep.py` - Backreference-Aware grep

Parses regex pattern (supports groups)
Uses match_with_backrefs to match lines
Filters stdin, prints matching lines

`parse_re.py` (cp4 version)

Enhanced to parse backreferences in patterns
Converts \g<1> to backref nodes in AST

Algorithm Highlights

Backreference Matching:

Requires context (environment) to store captured groups
Matching is context-dependent: same pattern matches differently based on previous captures
Example: Pattern (a+)\g<1> matches "aa" (group 1 = "a", backref must also be "a")
Complexity: Can be exponential in worst case (backtracking) Environment Management:
Each recursive call maintains its own environment copy
Groups captured in one branch don't affect other branches (union)
Captures propagate through concatenation and star

Data Flow Architecture

Input String/Regex
   ↓
[cp2] parse_re.py → Prefix AST
   ↓
[cp2] re_to_nfa.py → NFA
   ↓
[cp1] nfa.match() → Accept/Reject + Path
   OR
[cp2] agrep.py → Line filtering
   OR
[cp3] re_groups.py → Match + Captures
   ↓
[cp3] msed.py → Text transformation
   OR
[cp4] backrefs.py → Match with backreferences
   ↓
[cp4] bgrep.py → Advanced pattern matching

Key Design Decisions

NFA over DFA: Chose NFA for easier construction (Thompson's algorithm), handles ε-transitions naturally
BFS simulation: Ensures O(n·m) time complexity, finds shortest accepting path
Prefix notation: Simplifies parsing and AST manipulation
State renaming: Prevents collisions when combining NFAs
Precomputed closures: Optimization for repeated NFA simulation
Recursive descent parsing: Clear, maintainable parser structure
Environment-based captures: Clean separation between matching and capture tracking

Complexity Analysis

NFA construction: O(m) states for regex of length m
NFA simulation: O(n·m) time, O(m) space for n-length string
Regex parsing: O(m) time and space
Group matching: O(n·m) time in typical case, can be exponential with backreferences
Backreference matching: Worst-case exponential, but optimized with cycle detection

Testing Strategy

Each checkpoint has comprehensive test suite (test-cp*.sh):

Unit tests for individual operations
Integration tests for full pipelines
Performance tests (ensures O(n) scaling, not O(n²))
Edge case coverage (empty strings, special characters, nested structures)

Extensibility

The architecture supports easy extension:

New regex operators: Add AST node + NFA construction + parser rule
New matching modes: Extend Matcher class with new methods
New text processors: Follow msed.py pattern (command parser + execution loop)

Portfolio Showcase Points

This project demonstrates:

✅ Algorithm Implementation: Thompson's construction, BFS graph traversal, recursive descent parsing
✅ Data Structures: NFA representation, AST manipulation, state machines
✅ Software Engineering: Modular design, clear separation of concerns, comprehensive testing
✅ Performance Optimization: Epsilon closure precomputation, efficient state space exploration
✅ Language Theory: Regular languages, automata theory, formal language parsing
✅ System Programming: CLI tools, file I/O, stdin/stdout processing
✅ Advanced Features: Capturing groups, backreferences, text transformation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture Overview

Project Summary

Project Structure

Checkpoint 1: Core NFA Foundation

Key Components

`nfa.py` - NFA Core Library

`regular.py` - Basic NFA Constructors

Executables

Design Patterns

Checkpoint 2: Regular Expressions to NFAs

Key Components

`parse_re.py` - Recursive Descent Parser

`re_to_nfa.py` - Regex-to-NFA Converter

NFA Construction Operations

`agrep.py` - Regex Matching Tool

Algorithm Highlights

Checkpoint 3: Capturing Groups and Text Processing

Key Components

`regexp.py` - Enhanced AST with Groups

`re_groups.py` - Group Matching Engine

`msed.py` - Mini sed Implementation

`re_groups` (executable)

Design Patterns

Checkpoint 4: Backreferences

Key Components

`backrefs.py` - Backreference Parser and Matcher

`bgrep.py` - Backreference-Aware grep

`parse_re.py` (cp4 version)

Algorithm Highlights

Data Flow Architecture

Key Design Decisions

Complexity Analysis

Testing Strategy

Extensibility

Portfolio Showcase Points

FilesExpand file tree

Architecture.md

Latest commit

History

Architecture.md

File metadata and controls

Architecture Overview

Project Summary

Project Structure

Checkpoint 1: Core NFA Foundation

Key Components

nfa.py - NFA Core Library

regular.py - Basic NFA Constructors

Executables

Design Patterns

Checkpoint 2: Regular Expressions to NFAs

Key Components

parse_re.py - Recursive Descent Parser

re_to_nfa.py - Regex-to-NFA Converter

NFA Construction Operations

agrep.py - Regex Matching Tool

Algorithm Highlights

Checkpoint 3: Capturing Groups and Text Processing

Key Components

regexp.py - Enhanced AST with Groups

re_groups.py - Group Matching Engine

msed.py - Mini sed Implementation

re_groups (executable)

Design Patterns

Checkpoint 4: Backreferences

Key Components

backrefs.py - Backreference Parser and Matcher

bgrep.py - Backreference-Aware grep

parse_re.py (cp4 version)

Algorithm Highlights

Data Flow Architecture

Key Design Decisions

Complexity Analysis

Testing Strategy

Extensibility

Portfolio Showcase Points

`nfa.py` - NFA Core Library

`regular.py` - Basic NFA Constructors

`parse_re.py` - Recursive Descent Parser

`re_to_nfa.py` - Regex-to-NFA Converter

`agrep.py` - Regex Matching Tool

`regexp.py` - Enhanced AST with Groups

`re_groups.py` - Group Matching Engine

`msed.py` - Mini sed Implementation

`re_groups` (executable)

`backrefs.py` - Backreference Parser and Matcher

`bgrep.py` - Backreference-Aware grep

`parse_re.py` (cp4 version)