A lexical analyzer generator written in Rust, inspired by the classic lex and flex tools.
ft_lex is a lexical analyzer generator that reads specification files (.l files) containing regular expressions and actions, then generates C code to tokenize input according to those specifications. It transforms NFAs (Non-deterministic Finite Automata) to DFAs (Deterministic Finite Automata) for efficient pattern matching.
- Regular Expression Support: Full regex support including character classes, quantifiers, alternation, and grouping
- Start Conditions: Both inclusive (
%s) and exclusive (%x) start conditions for managing lexer states - DFA Generation: Automatic conversion from NFA to optimized DFA
- Standard Lex Compatibility: Supports common lex/flex syntax and features
- C Code Generation: Generates standalone C lexer code with minimal dependencies
- Rust toolchain (stable)
- C compiler (gcc/clang)
- Make
# Build the project
cargo build --release
# Run tests
cargo test
# Format code
cargo fmt# Generate lexer from specification
cargo run -- lex_files/test.l
# Compile the generated lexer
makeA .l file consists of three sections separated by %%:
%{
/* C code and includes */
#include <stdio.h>
%}
/* Definitions section */
%s COMMENT
%x STRING
%%
/* Rules section */
"/*" BEGIN(COMMENT);
<COMMENT>"*/" BEGIN(INITIAL);
[a-zA-Z_][a-zA-Z0-9_]* printf("IDENTIFIER: %s\n", yytext);
[0-9]+ printf("NUMBER: %s\n", yytext);
%%
/* User code section */
int main(void) {
yylex();
return 0;
}ft_lex/
├── src/
│ ├── main.rs # Entry point
│ ├── file_parsing/ # Lex file parsing
│ │ ├── definitions/ # Definition section parsing
│ │ ├── rules/ # Rules section parsing
│ │ └── user_routine/ # User code section parsing
│ ├── regex/ # Regex engine
│ │ ├── nfa/ # NFA construction
│ │ ├── dfa/ # DFA construction
│ │ └── tokenizer/ # Regex tokenization
│ └── lex_creation/ # C code generation
│ ├── functions/ # Generated helper functions
│ ├── tables/ # DFA transition tables
│ └── templates/ # C code templates
├── lex_files/ # Example lexer specifications
└── Makefile # Build automation
- Character classes:
[a-z],[^0-9],[:alpha:] - Quantifiers:
*,+,?,{n},{n,},{n,m} - Alternation:
a|b - Grouping:
(ab)+ - Escape sequences:
\n,\t,\x41,\101 - Dot operator:
.(matches any character except newline)
%s INCLUSIVE_STATE
%x EXCLUSIVE_STATE
<STATE>pattern action
<STATE1,STATE2>pattern action
<*>pattern actionyytext: Matched textyyleng: Length of matched textBEGIN(state): Change lexer stateREJECT: Re-match with next ruleyymore(): Append next match to currentyyless(n): Return characters to inputinput(): Read one characterunput(c): Push character back
The lex_files/valid/ directory contains several example lexer specifications:
c_keywords.l: C language keyword tokenizerjson.l: JSON tokenizerhtml.l: HTML tag parserminishell.l: Shell command lexerpascal.l: Pascal language lexer
# Run all tests
cargo test
# Run with single thread (for file system tests)
cargo test -- --test-threads=1# Build with dotfile feature
cargo run --features dotfile -- lex_files/test.l
# This generates dfa.dot and dfa.png filesThe project uses GitHub Actions for continuous integration:
- Code formatting checks (
cargo fmt) - Test execution
- Release binary builds
- Trailing context (
/) is tokenized but not fully implemented - Line/column number tracking for
^and$anchors is incomplete - Some advanced flex features are not yet supported
- Fork the repository
- Create a feature branch
- Make your changes
- Run
make fmtandmake test - Submit a pull request
This project is part of the 42 school curriculum.
- Flex Manual
- Lex & Yacc
- Dragon Book: Compilers: Principles, Techniques, and Tools