Skip to content

swindar-zhou/regex-engine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Regex Engine

Python Shell License

A complete regular expression engine implemented from scratch, building from basic automata operations to advanced features like capturing groups and backreferences. This class project for [CSE 30151: Theory of Computing] demonstrates algorithm design and implementation (Thompson's construction, BFS graph traversal), data structure engineering (NFA representation, AST manipulation), parser construction (recursive descent parsing with operator precedence), performance optimization (epsilon closure precomputation), software engineering practices (modular design, comprehensive testing), system programming (CLI tool development, file I/O), and formal language theory (automata theory, regular languages, context-dependent matching).

Demo

# Basic regex matching
echo "hello world" | python3 cp2/agrep.py "hello.*"

# Capturing groups
python3 cp3/re_groups.py "(a+)(b+)" "aaabbb"

# Backreferences
echo "abab" | python3 cp4/bgrep.py "(a+)\g<1>"

Tech Stack

  • Languages: Python Shell Makefile
  • Core Concepts: Nondeterministic Finite Automata (NFA), Thompson's Construction, Recursive Descent Parsing
  • Algorithms: BFS graph traversal, epsilon closure optimization, context-dependent matching

Key Features

  • NFA Construction & Simulation: Implements nondeterministic finite automata with ε-transitions and efficient BFS-based simulation
  • Regex Parsing: Recursive descent parser supporting union (|), concatenation, and Kleene star (*)
  • Regex-to-NFA Conversion: Thompson's construction algorithm converting regex patterns to NFAs
  • Capturing Groups: Extracts and tracks matched substrings with group numbering
  • Backreferences: Supports \g<1>, \g<2>, etc. to match previously captured groups
  • Text Processing: Mini sed implementation (msed) for pattern-based text transformation
  • CLI Tools: Production-ready command-line utilities (agrep, bgrep, re_groups, msed)

What I Learned / Design Tradeoffs

Algorithm Implementation

  • Thompson's Construction: Learned how to systematically convert regex operators to NFA structures, understanding the elegance of using nondeterminism to handle alternation
  • BFS Simulation: Implemented efficient NFA simulation with O(n·m) complexity, learning to balance correctness with performance
  • Epsilon Closure Optimization: Discovered the importance of precomputation to avoid redundant graph traversals during matching

Design Decisions

  • NFA over DFA: Chose NFAs for easier construction via Thompson's algorithm, accepting slightly slower simulation for simpler implementation
  • State Renaming: Implemented prefix-based state renaming when combining NFAs to prevent collisions, ensuring modularity
  • Environment-Based Captures: Used a dictionary-based environment to track captures, allowing clean separation between matching logic and capture tracking
  • Recursive Descent Parsing: Built a maintainable parser using recursive descent, handling operator precedence through grammar structure

Complexity Insights

  • NFA construction: O(m) states for regex of length m
  • NFA simulation: O(n·m) time complexity, optimized with precomputed epsilon closures
  • Backreference matching: Exponential worst-case, mitigated with cycle detection

How to Run Locally

Prerequisites

# Ensure Python 3 is installed
python3 --version

Basic Usage

1. Test NFA matching:

python3 cp1/nfa_path.py examples/sipser-n1.nfa "ab"

2. Convert regex to NFA:

echo "a*b" | python3 cp2/re_to_nfa.py

3. Pattern matching with agrep:

echo -e "hello\nworld\nhi" | python3 cp2/agrep.py "h.*"

4. Capturing groups:

python3 cp3/re_groups.py "(a+)(b+)" "aaabbb"

5. Backreferences:

echo -e "abab\nabba" | python3 cp4/bgrep.py "(a+)\g<1>"

6. Text processing with msed:

echo "hello world" | python3 cp3/msed.py -e "s/hello/hi/"

Run Test Suites

# Test all checkpoints
bash tests/test-cp1.sh
bash tests/test-cp2.sh
bash tests/test-cp3.sh
bash tests/test-cp4.sh

Architecture

The project is organized into 4 progressive checkpoints, each building on the previous:

Project 1: Nondeterministic Finite Automata (NFAs)

Checkpoint 1 implements nondeterministic finite automata (NFAs). Nondeterminism (essentially, unbounded parallelism) is one of the core concepts in the course, and implementing it demonstrates how to simulate nondeterminism on deterministic hardware.

Project 2: Regular Expression Matcher

Checkpoint 2 implements a parser for regular expressions and combines it with NFAs to build a regular expression matcher like grep. The implementation is asymptotically much faster than implementations using Perl or Python's built-in regular expressions.

Project 3: Mini sed Implementation

Checkpoint 3 uses the regular expression engine to implement a fragment of sed. It also demonstrates how, in principle, any computer program could be compiled into this fragment of sed.

Project 4: Backreferences and NP-Completeness

Checkpoint 4 extends the regular expression matcher to handle backreferences. It shows how this extended matcher can be used to (slowly) solve the Boolean satisfiability problem and therefore any problem in NP.

cp1/  → Core NFA data structures and basic operations
cp2/  → Regular expression parsing and NFA construction (Thompson's algorithm)
cp3/  → Capturing groups and text processing (msed)
cp4/  → Backreferences support

Data Flow

Input String/Regex
   ↓
[cp2] parse_re.py → Prefix AST
   ↓
[cp2] re_to_nfa.py → NFA
   ↓
[cp1] nfa.match() → Accept/Reject + Path
   OR
[cp2] agrep.py → Line filtering
   OR
[cp3] re_groups.py → Match + Captures
   ↓
[cp3] msed.py → Text transformation
   OR
[cp4] backrefs.py → Match with backreferences

See Architecture.md for detailed design documentation.

Project Structure

├── cp1/          # Core NFA foundation
├── cp2/          # Regex parsing and NFA construction
├── cp3/          # Capturing groups and msed
├── cp4/          # Backreferences
├── examples/     # Test files (NFAs, regexes, CNF formulas)
└── tests/        # Automated test suites

License

MIT License - see LICENSE file for details.

About

A custom regular-expression engine implementing parsing, matching, and backreferences—built without using Python’s re library.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors