Skip to content
Kalyan Sriram edited this page Mar 10, 2023 · 1 revision

Lexing

Overview

The first stage in femto's compiler pipeline is lexical analysis, or lexing. During this stage, an input stream of characters (a single .fm file) is converted into a stream of character spans, called tokens, which represent keywords, identifiers, operators, etc. For example, take the following file:

let main = fn() void {
    let a = 5;
    let b = 1;
    let c = a + b;
};

The above file gets converted to the following token stream:

.k_let, .ident, .equal, .k_fn, .l_paren, .r_paren, .ident, .l_brace,
.k_let, .ident, .equal, .int_lit, .semi,
.k_let, .ident, .equal, .int_lit, .semi,
.k_let, .ident, .equal, .ident, .plus, .ident, .semi,
.r_brace, .semi,

Token Structure

A token represents a span into the entire file's string contents. That is, it contains a start index (inclusive), end index (exclusive), and a token tag representing the type of token.

pub const Token = struct {
    tag: Tag,
    loc: Loc,

    pub const Loc = struct {
        start: u32,
        end: u32,
    };

    pub const Tag = enum {
        eof,
        ident,
        semi,
        k_let,
        // ...
    };
};

Notice that the character indices are u32, not u64. femtoc places an upper limit on any single file size as UINT32_MAX characters. This cuts memory usage of storing token indices in half, and allows other optimizations later down the pipeline (such as improving cache locality). Note also that the token does not own and string data. When the pipeline wants the contents of a token, it can read it with buf[loc.start..loc.end]. There is only ever one copy of the file buffer (although later stages may copy specific strings, such as identifiers, for use in name resolution and string literal storage).

Lexer

Lexical analysis is performed by the Lexer structure. It operates a finite state machine which matches characters in certain orders, and emits a token when a token-ending character is found. This could be a whitespace character, or any other character that doesn't look like part of the same token (i.e. abc() matches to .ident and .l_paren even though there is no whitespace.

pub const Lexer = struct {
    source: [:0]const u8,   // null-terminated pointer to read-only buffer
    index: u32,             // current location of lexer in buffer (only moves forward)

    const State = enum {
        start,
        ident,
        annot,
        period,
        equal,
        // ...
    };

    // runs the FSM until a token is emitted
    pub fn next(self: *Lexer) Token;
};

The next function body fires up the FSM, starting with the .start state. The span is initialized to start at the current character, and doesn't have a known end index. Then, we loop, switching on the current state, and then the current character. This is documented in the source code of lex.zig, but should be fairly straight forward to understand. Here are a few notes:

  • Single character tokens (like punctuation and some operators) don't have any intermediate states; They simply set the token tag and return.
  • Multi-character tokens that start with the same characters store state. For example, >> and >= store the .r_angle state when the > is received in the .start state. Then, when in the .r_angle state, they switch on the current character and either emit .r_angle_r_angle or .r_angle_equal.
  • Tokens are named using the character names, not semantic meaning (.r_angle_equal rather than .greater_equal) because many operators correspond to multiple meanings, and the lexer is not context aware.
  • The loop increments self.index at the end of each iteration. However, when breaking at the end of a token, the code must increment, since the loop afterthought won't get executed.

Level of Analysis

Each level of the compiler pipeline makes progress towards understanding and validating that femto code is legal. The lexer validates that tokens are reasonable (abc2 is an .ident, but 2abc is not valid since identifiers cannot start with numerics. However, it doesn't try to make sense of the stream of tokens; .ident .l_brace .equal .asterisk .r_brace is not caught.

Clone this wiki locally