A complete regular expression engine implemented from scratch, building from basic automata operations to advanced features like capturing groups and backreferences. This class project for [CSE 30151: Theory of Computing] demonstrates algorithm design and implementation (Thompson's construction, BFS graph traversal), data structure engineering (NFA representation, AST manipulation), parser construction (recursive descent parsing with operator precedence), performance optimization (epsilon closure precomputation), software engineering practices (modular design, comprehensive testing), system programming (CLI tool development, file I/O), and formal language theory (automata theory, regular languages, context-dependent matching).
# Basic regex matching
echo "hello world" | python3 cp2/agrep.py "hello.*"
# Capturing groups
python3 cp3/re_groups.py "(a+)(b+)" "aaabbb"
# Backreferences
echo "abab" | python3 cp4/bgrep.py "(a+)\g<1>"- Languages:
- Core Concepts: Nondeterministic Finite Automata (NFA), Thompson's Construction, Recursive Descent Parsing
- Algorithms: BFS graph traversal, epsilon closure optimization, context-dependent matching
- NFA Construction & Simulation: Implements nondeterministic finite automata with ε-transitions and efficient BFS-based simulation
- Regex Parsing: Recursive descent parser supporting union (
|), concatenation, and Kleene star (*) - Regex-to-NFA Conversion: Thompson's construction algorithm converting regex patterns to NFAs
- Capturing Groups: Extracts and tracks matched substrings with group numbering
- Backreferences: Supports
\g<1>,\g<2>, etc. to match previously captured groups - Text Processing: Mini sed implementation (
msed) for pattern-based text transformation - CLI Tools: Production-ready command-line utilities (
agrep,bgrep,re_groups,msed)
- Thompson's Construction: Learned how to systematically convert regex operators to NFA structures, understanding the elegance of using nondeterminism to handle alternation
- BFS Simulation: Implemented efficient NFA simulation with O(n·m) complexity, learning to balance correctness with performance
- Epsilon Closure Optimization: Discovered the importance of precomputation to avoid redundant graph traversals during matching
- NFA over DFA: Chose NFAs for easier construction via Thompson's algorithm, accepting slightly slower simulation for simpler implementation
- State Renaming: Implemented prefix-based state renaming when combining NFAs to prevent collisions, ensuring modularity
- Environment-Based Captures: Used a dictionary-based environment to track captures, allowing clean separation between matching logic and capture tracking
- Recursive Descent Parsing: Built a maintainable parser using recursive descent, handling operator precedence through grammar structure
- NFA construction: O(m) states for regex of length m
- NFA simulation: O(n·m) time complexity, optimized with precomputed epsilon closures
- Backreference matching: Exponential worst-case, mitigated with cycle detection
# Ensure Python 3 is installed
python3 --version1. Test NFA matching:
python3 cp1/nfa_path.py examples/sipser-n1.nfa "ab"2. Convert regex to NFA:
echo "a*b" | python3 cp2/re_to_nfa.py3. Pattern matching with agrep:
echo -e "hello\nworld\nhi" | python3 cp2/agrep.py "h.*"4. Capturing groups:
python3 cp3/re_groups.py "(a+)(b+)" "aaabbb"5. Backreferences:
echo -e "abab\nabba" | python3 cp4/bgrep.py "(a+)\g<1>"6. Text processing with msed:
echo "hello world" | python3 cp3/msed.py -e "s/hello/hi/"# Test all checkpoints
bash tests/test-cp1.sh
bash tests/test-cp2.sh
bash tests/test-cp3.sh
bash tests/test-cp4.shThe project is organized into 4 progressive checkpoints, each building on the previous:
Checkpoint 1 implements nondeterministic finite automata (NFAs). Nondeterminism (essentially, unbounded parallelism) is one of the core concepts in the course, and implementing it demonstrates how to simulate nondeterminism on deterministic hardware.
Checkpoint 2 implements a parser for regular expressions and combines it with NFAs to build a regular expression matcher like grep. The implementation is asymptotically much faster than implementations using Perl or Python's built-in regular expressions.
Checkpoint 3 uses the regular expression engine to implement a fragment of sed. It also demonstrates how, in principle, any computer program could be compiled into this fragment of sed.
Checkpoint 4 extends the regular expression matcher to handle backreferences. It shows how this extended matcher can be used to (slowly) solve the Boolean satisfiability problem and therefore any problem in NP.
cp1/ → Core NFA data structures and basic operations
cp2/ → Regular expression parsing and NFA construction (Thompson's algorithm)
cp3/ → Capturing groups and text processing (msed)
cp4/ → Backreferences support
Input String/Regex
↓
[cp2] parse_re.py → Prefix AST
↓
[cp2] re_to_nfa.py → NFA
↓
[cp1] nfa.match() → Accept/Reject + Path
OR
[cp2] agrep.py → Line filtering
OR
[cp3] re_groups.py → Match + Captures
↓
[cp3] msed.py → Text transformation
OR
[cp4] backrefs.py → Match with backreferences
See Architecture.md for detailed design documentation.
├── cp1/ # Core NFA foundation
├── cp2/ # Regex parsing and NFA construction
├── cp3/ # Capturing groups and msed
├── cp4/ # Backreferences
├── examples/ # Test files (NFAs, regexes, CNF formulas)
└── tests/ # Automated test suites
MIT License - see LICENSE file for details.