-
Notifications
You must be signed in to change notification settings - Fork 0
Lexing
The first stage in femto's compiler pipeline is lexical analysis, or lexing. During this stage, an input stream of characters (a single .fm file) is converted into a stream of character spans, called tokens, which represent keywords, identifiers, operators, etc. For example, take the following file:
let main = fn() void {
let a = 5;
let b = 1;
let c = a + b;
};
The above file gets converted to the following token stream:
.k_let, .ident, .equal, .k_fn, .l_paren, .r_paren, .ident, .l_brace,
.k_let, .ident, .equal, .int_lit, .semi,
.k_let, .ident, .equal, .int_lit, .semi,
.k_let, .ident, .equal, .ident, .plus, .ident, .semi,
.r_brace, .semi,A token represents a span into the entire file's string contents. That is, it contains a start index (inclusive), end index (exclusive), and a token tag representing the type of token.
pub const Token = struct {
tag: Tag,
loc: Loc,
pub const Loc = struct {
start: u32,
end: u32,
};
pub const Tag = enum {
eof,
ident,
semi,
k_let,
// ...
};
};Notice that the character indices are u32, not u64. femtoc places an upper limit on any single file size as UINT32_MAX characters. This cuts memory usage of storing token indices in half, and allows other optimizations later down the pipeline (such as improving cache locality). Note also that the token does not own and string data. When the pipeline wants the contents of a token, it can read it with buf[loc.start..loc.end]. There is only ever one copy of the file buffer (although later stages may copy specific strings, such as identifiers, for use in name resolution and string literal storage).
Lexical analysis is performed by the Lexer structure. It operates a finite state machine which matches characters in certain orders, and emits a token when a token-ending character is found. This could be a whitespace character, or any other character that doesn't look like part of the same token (i.e. abc() matches to .ident and .l_paren even though there is no whitespace.
pub const Lexer = struct {
source: [:0]const u8, // null-terminated pointer to read-only buffer
index: u32, // current location of lexer in buffer (only moves forward)
const State = enum {
start,
ident,
annot,
period,
equal,
// ...
};
// runs the FSM until a token is emitted
pub fn next(self: *Lexer) Token;
};The next function body fires up the FSM, starting with the .start state. The span is initialized to start at the current character, and doesn't have a known end index. Then, we loop, switching on the current state, and then the current character. This is documented in the source code of lex.zig, but should be fairly straight forward to understand. Here are a few notes:
- Single character tokens (like punctuation and some operators) don't have any intermediate states; They simply set the token tag and return.
- Multi-character tokens that start with the same characters store state. For example,
>>and>=store the.r_anglestate when the>is received in the.startstate. Then, when in the.r_anglestate, they switch on the current character and either emit.r_angle_r_angleor.r_angle_equal. - Tokens are named using the character names, not semantic meaning (
.r_angle_equalrather than.greater_equal) because many operators correspond to multiple meanings, and the lexer is not context aware. - The loop increments
self.indexat the end of each iteration. However, when breaking at the end of a token, the code must increment, since the loop afterthought won't get executed.
Each level of the compiler pipeline makes progress towards understanding and validating that femto code is legal. The lexer validates that tokens are reasonable (abc2 is an .ident, but 2abc is not valid since identifiers cannot start with numerics. However, it doesn't try to make sense of the stream of tokens; .ident .l_brace .equal .asterisk .r_brace is not caught.