This project implements a complete regular expression engine from scratch, building up from basic automata operations to advanced features like capturing groups and backreferences. The project demonstrates core computer science concepts including:
- Nondeterministic Finite Automata (NFA) construction and simulation
- Regular expression parsing using recursive descent
- Regex-to-NFA conversion using Thompson's construction
- Pattern matching with capturing groups
- Text processing via a mini sed implementation
- Backreference matching for advanced regex features
The project is organized into 4 checkpoints (cp1-cp4), each building on the previous:
├── cp1/ # Core NFA data structures and basic operations
├── cp2/ # Regular expression parsing and NFA construction
├── cp3/ # Capturing groups and text processing (msed)
├── cp4/ # Backreferences support
├── examples/ # Test files (NFAs, regexes, CNF formulas, etc.)
└── tests/ # Automated test suites for each checkpoint
Purpose: Implement the fundamental NFA data structure and basic operations.
NFAclass: Represents a nondeterministic finite automaton- States, alphabet, start state, accept states
- Transition function:
transitions[(state, symbol)]→ list of transitions - Supports ε-transitions (empty string transitions)
Transitionclass: Represents a state transition(q, a, r)- From state
q, on symbola, to stater - Uses
EPSILON = '&'for empty string transitions match(m, w)function: Simulates NFA on input string- Uses breadth-first search (BFS) to explore all possible paths
- Returns accepting path (list of transitions) if string is accepted
- Handles ε-transitions correctly
- Time complexity: O(n·m) where n = string length, m = NFA size
read()/write(): File I/O for NFA serialization- Text format: states, alphabet, start, accept states, then transitions
symbol(a): Creates NFA recognizing single character{a}epsilon(): Creates NFA recognizing empty string{ε}
epsilon_nfa: Outputs NFA for εsymbol_nfa <char>: Outputs NFA for single characternfa_path <nfa_file> <string>: Tests if string is accepted, prints path
- State renaming: When combining NFAs, states are prefixed (e.g.,
1_0,2_0) to avoid collisions - BFS simulation: Explores all possible computation paths simultaneously
- Configuration tracking:
(state, input_position)pairs represent computation states
Purpose: Parse regular expressions and convert them to NFAs using Thompson's construction.
- Grammar (operator precedence from highest to lowest):
E → T E' (Union/alternation)
E' → | T E' | ε
T → F T' (Concatenation)
T' → F T' | ε
F → P F' (Kleene star)
F' → * F' | ε
P → (E) | a | ε (Primitives: parentheses, symbols, empty)
- Tokenization: Handles escaped characters (
\',\"), quotes, special symbols - Output format: Prefix notation like
union(concat(symbol[a],symbol[b]),star(symbol[c])) - Edge cases: Empty strings, leading/trailing
|, nested parentheses
re_to_nfa(expression): Main entry point
- Parses regex string → prefix notation AST
- Recursively builds NFA from AST
build_nfa(expr): Recursive constructionsymbol[a]→ callsregular.symbol(a)epsilon→ callsregular.epsilon()union(L,R)→ callsunion_nfa(build_nfa(L), build_nfa(R))concat(L,R)→ callsconcat_nfa(build_nfa(L), build_nfa(R))star(X)→ callsstar_nfa(build_nfa(X))
union_nfa.py - Union (L₁ ∪ L₂)
- Creates new start state with ε-transitions to both NFAs' start states
- Preserves all accept states from both NFAs
- Key insight: Nondeterminism allows "choosing" which NFA to follow
concat_nfa.py- Concatenation (L₁ ∘ L₂) - Links accept states of first NFA to start state of second via ε-transitions
- New start = first NFA's start, new accept = second NFA's accept states
- Key insight: Empty string transitions connect the two languages
star_nfa.py- Kleene Star (L*) - Creates new start/accept states
- ε-transitions: start → original start, accept states → original start, accept states → new accept
- Key insight: Allows zero or more repetitions via looping
re_to_nfa(regex): Converts regex to NFAprecompute_epsilon_closures(nfa): Precomputes ε-closures for all states (optimization)simulate_nfa(nfa, string, closures): Efficient NFA simulation- Uses precomputed closures to avoid redundant BFS
- Time complexity: O(n·m) where n = string length, m = NFA size
- Processes stdin line-by-line, prints matching lines
Thompson's Construction:
- Each regex operator (union, concat, star) has a corresponding NFA construction
- Recursively combines smaller NFAs into larger ones
- Resulting NFA has O(m) states where m = regex length Epsilon Closure Optimization:
- Precompute reachable states via ε-transitions for each state
- Reduces redundant exploration during simulation
- Critical for performance on complex regexes
Purpose: Extend regex engine with capturing groups and implement a sed-like text processor.
- New AST node:
GroupNode(number, child) - Represents capturing group
(pattern)with assigned group number - Groups numbered 1, 2, 3... in order of opening parentheses
- Parser enhancements:
- Tracks
_group_countduring parsing - Wraps parenthesized expressions in
GroupNode - AST printer: Converts AST to prefix notation with
group[k](...)
Matcherclass: Recursive matcher that tracks capturesmatch(s): Full match (entire string must match)match_partial(s): Partial match (longest prefix matching)_match_node(node, s, pos, captures): Core recursive matching- Returns
(new_position, updated_captures_dict) GroupNode: Captures substrings[start:end]intocaptures[group_num]- Handles union, concat, star with capture propagation
- Returns
apply_replacement(regex, replace, current): Substitution with backreferences- Matches pattern, extracts groups
- Replaces
\g<1>,\g<2>, etc. with captured groups - Returns modified string
- Commands:
s/pattern/replace/: Substitution (usesre_groups.apply_replacement)/pattern/target: Branching (loop, accept, reject, skip):label: Labels for branching- Execution model:
- Processes each input line through command sequence
- Branch commands jump to labels based on pattern matching
- Supports
-f scriptfileand-e expressionflags - Verbose mode (
-v) shows execution steps
- CLI tool:
re_groups <pattern> <string> - Outputs "accept" or "reject" plus captured groups
- Capture propagation: Captures flow through recursive matching, updated at group boundaries
- Partial matching: Tries all possible starting positions, selects longest match
- State machine: msed uses program counter (pc) to step through commands
Purpose: Add backreference support (\g<1>, \g<2>, etc.) to match previously captured groups.
parse_backrefs(s): Parses replacement strings- Extracts
\g<1>,\g<2>, etc. asBackreferencetokens - Returns list of tokens (strings and Backreference objects)
match_with_backrefs(tree, line): Enhanced matcher- Environment (
env): Dictionary mapping group numbers → captured strings - Backreference matching: When encountering
backrefnode:- Looks up
env[group_num]to get previously captured string - Checks if remaining input starts with that string
- Advances position by length of captured string
- Looks up
- Group capture: When matching
groupnode:- Captures substring into environment:
env[group_num] = s[start:end]
- Captures substring into environment:
- Star handling: Prevents infinite loops by tracking seen configurations
- Parses regex pattern (supports groups)
- Uses
match_with_backrefsto match lines - Filters stdin, prints matching lines
- Enhanced to parse backreferences in patterns
- Converts
\g<1>to backref nodes in AST
Backreference Matching:
- Requires context (environment) to store captured groups
- Matching is context-dependent: same pattern matches differently based on previous captures
- Example: Pattern
(a+)\g<1>matches "aa" (group 1 = "a", backref must also be "a") - Complexity: Can be exponential in worst case (backtracking) Environment Management:
- Each recursive call maintains its own environment copy
- Groups captured in one branch don't affect other branches (union)
- Captures propagate through concatenation and star
Input String/Regex
↓
[cp2] parse_re.py → Prefix AST
↓
[cp2] re_to_nfa.py → NFA
↓
[cp1] nfa.match() → Accept/Reject + Path
OR
[cp2] agrep.py → Line filtering
OR
[cp3] re_groups.py → Match + Captures
↓
[cp3] msed.py → Text transformation
OR
[cp4] backrefs.py → Match with backreferences
↓
[cp4] bgrep.py → Advanced pattern matching
- NFA over DFA: Chose NFA for easier construction (Thompson's algorithm), handles ε-transitions naturally
- BFS simulation: Ensures O(n·m) time complexity, finds shortest accepting path
- Prefix notation: Simplifies parsing and AST manipulation
- State renaming: Prevents collisions when combining NFAs
- Precomputed closures: Optimization for repeated NFA simulation
- Recursive descent parsing: Clear, maintainable parser structure
- Environment-based captures: Clean separation between matching and capture tracking
- NFA construction: O(m) states for regex of length m
- NFA simulation: O(n·m) time, O(m) space for n-length string
- Regex parsing: O(m) time and space
- Group matching: O(n·m) time in typical case, can be exponential with backreferences
- Backreference matching: Worst-case exponential, but optimized with cycle detection
Each checkpoint has comprehensive test suite (test-cp*.sh):
- Unit tests for individual operations
- Integration tests for full pipelines
- Performance tests (ensures O(n) scaling, not O(n²))
- Edge case coverage (empty strings, special characters, nested structures)
The architecture supports easy extension:
- New regex operators: Add AST node + NFA construction + parser rule
- New matching modes: Extend
Matcherclass with new methods - New text processors: Follow
msed.pypattern (command parser + execution loop)
This project demonstrates:
- ✅ Algorithm Implementation: Thompson's construction, BFS graph traversal, recursive descent parsing
- ✅ Data Structures: NFA representation, AST manipulation, state machines
- ✅ Software Engineering: Modular design, clear separation of concerns, comprehensive testing
- ✅ Performance Optimization: Epsilon closure precomputation, efficient state space exploration
- ✅ Language Theory: Regular languages, automata theory, formal language parsing
- ✅ System Programming: CLI tools, file I/O, stdin/stdout processing
- ✅ Advanced Features: Capturing groups, backreferences, text transformation