Regex Engine

A complete regular expression engine implemented from scratch, building from basic automata operations to advanced features like capturing groups and backreferences. This class project for [CSE 30151: Theory of Computing] demonstrates algorithm design and implementation (Thompson's construction, BFS graph traversal), data structure engineering (NFA representation, AST manipulation), parser construction (recursive descent parsing with operator precedence), performance optimization (epsilon closure precomputation), software engineering practices (modular design, comprehensive testing), system programming (CLI tool development, file I/O), and formal language theory (automata theory, regular languages, context-dependent matching).

Demo

# Basic regex matching
echo "hello world" | python3 cp2/agrep.py "hello.*"

# Capturing groups
python3 cp3/re_groups.py "(a+)(b+)" "aaabbb"

# Backreferences
echo "abab" | python3 cp4/bgrep.py "(a+)\g<1>"

Tech Stack

Languages:
Core Concepts: Nondeterministic Finite Automata (NFA), Thompson's Construction, Recursive Descent Parsing
Algorithms: BFS graph traversal, epsilon closure optimization, context-dependent matching

Key Features

NFA Construction & Simulation: Implements nondeterministic finite automata with ε-transitions and efficient BFS-based simulation
Regex Parsing: Recursive descent parser supporting union (|), concatenation, and Kleene star (*)
Regex-to-NFA Conversion: Thompson's construction algorithm converting regex patterns to NFAs
Capturing Groups: Extracts and tracks matched substrings with group numbering
Backreferences: Supports \g<1>, \g<2>, etc. to match previously captured groups
Text Processing: Mini sed implementation (msed) for pattern-based text transformation
CLI Tools: Production-ready command-line utilities (agrep, bgrep, re_groups, msed)

What I Learned / Design Tradeoffs

Algorithm Implementation

Thompson's Construction: Learned how to systematically convert regex operators to NFA structures, understanding the elegance of using nondeterminism to handle alternation
BFS Simulation: Implemented efficient NFA simulation with O(n·m) complexity, learning to balance correctness with performance
Epsilon Closure Optimization: Discovered the importance of precomputation to avoid redundant graph traversals during matching

Design Decisions

NFA over DFA: Chose NFAs for easier construction via Thompson's algorithm, accepting slightly slower simulation for simpler implementation
State Renaming: Implemented prefix-based state renaming when combining NFAs to prevent collisions, ensuring modularity
Environment-Based Captures: Used a dictionary-based environment to track captures, allowing clean separation between matching logic and capture tracking
Recursive Descent Parsing: Built a maintainable parser using recursive descent, handling operator precedence through grammar structure

Complexity Insights

NFA construction: O(m) states for regex of length m
NFA simulation: O(n·m) time complexity, optimized with precomputed epsilon closures
Backreference matching: Exponential worst-case, mitigated with cycle detection

How to Run Locally

Prerequisites

# Ensure Python 3 is installed
python3 --version

Basic Usage

1. Test NFA matching:

python3 cp1/nfa_path.py examples/sipser-n1.nfa "ab"

2. Convert regex to NFA:

echo "a*b" | python3 cp2/re_to_nfa.py

3. Pattern matching with agrep:

echo -e "hello\nworld\nhi" | python3 cp2/agrep.py "h.*"

4. Capturing groups:

python3 cp3/re_groups.py "(a+)(b+)" "aaabbb"

5. Backreferences:

echo -e "abab\nabba" | python3 cp4/bgrep.py "(a+)\g<1>"

6. Text processing with msed:

echo "hello world" | python3 cp3/msed.py -e "s/hello/hi/"

Run Test Suites

# Test all checkpoints
bash tests/test-cp1.sh
bash tests/test-cp2.sh
bash tests/test-cp3.sh
bash tests/test-cp4.sh

Architecture

The project is organized into 4 progressive checkpoints, each building on the previous:

Project 1: Nondeterministic Finite Automata (NFAs)

Checkpoint 1 implements nondeterministic finite automata (NFAs). Nondeterminism (essentially, unbounded parallelism) is one of the core concepts in the course, and implementing it demonstrates how to simulate nondeterminism on deterministic hardware.

Project 2: Regular Expression Matcher

Checkpoint 2 implements a parser for regular expressions and combines it with NFAs to build a regular expression matcher like grep. The implementation is asymptotically much faster than implementations using Perl or Python's built-in regular expressions.

Project 3: Mini sed Implementation

Checkpoint 3 uses the regular expression engine to implement a fragment of sed. It also demonstrates how, in principle, any computer program could be compiled into this fragment of sed.

Project 4: Backreferences and NP-Completeness

Checkpoint 4 extends the regular expression matcher to handle backreferences. It shows how this extended matcher can be used to (slowly) solve the Boolean satisfiability problem and therefore any problem in NP.

cp1/  → Core NFA data structures and basic operations
cp2/  → Regular expression parsing and NFA construction (Thompson's algorithm)
cp3/  → Capturing groups and text processing (msed)
cp4/  → Backreferences support

Data Flow

Input String/Regex
   ↓
[cp2] parse_re.py → Prefix AST
   ↓
[cp2] re_to_nfa.py → NFA
   ↓
[cp1] nfa.match() → Accept/Reject + Path
   OR
[cp2] agrep.py → Line filtering
   OR
[cp3] re_groups.py → Match + Captures
   ↓
[cp3] msed.py → Text transformation
   OR
[cp4] backrefs.py → Match with backreferences

See Architecture.md for detailed design documentation.

Project Structure

├── cp1/          # Core NFA foundation
├── cp2/          # Regex parsing and NFA construction
├── cp3/          # Capturing groups and msed
├── cp4/          # Backreferences
├── examples/     # Test files (NFAs, regexes, CNF formulas)
└── tests/        # Automated test suites

License

MIT License - see LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Regex Engine

Demo

Tech Stack

Key Features

What I Learned / Design Tradeoffs

Algorithm Implementation

Design Decisions

Complexity Insights

How to Run Locally

Prerequisites

Basic Usage

Run Test Suites

Architecture

Project 1: Nondeterministic Finite Automata (NFAs)

Project 2: Regular Expression Matcher

Project 3: Mini sed Implementation

Project 4: Backreferences and NP-Completeness

Data Flow

Project Structure

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
cp1		cp1
cp2		cp2
cp3		cp3
cp4		cp4
examples		examples
tests		tests
.gitignore		.gitignore
Architecture.md		Architecture.md
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Regex Engine

Demo

Tech Stack

Key Features

What I Learned / Design Tradeoffs

Algorithm Implementation

Design Decisions

Complexity Insights

How to Run Locally

Prerequisites

Basic Usage

Run Test Suites

Architecture

Project 1: Nondeterministic Finite Automata (NFAs)

Project 2: Regular Expression Matcher

Project 3: Mini sed Implementation

Project 4: Backreferences and NP-Completeness

Data Flow

Project Structure

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages