diff --git a/docs/.oldspec.md b/docs/.oldspec.md deleted file mode 100644 index cd33bf7..0000000 --- a/docs/.oldspec.md +++ /dev/null @@ -1,458 +0,0 @@ -# Raven's specification -# 1. Introduction -Raven is a fast, pragmatic general-purpose language mixing imperative and object-oriented programming. - -> You are reading a very early version of the specification, which means there might be some specific details missing, information might be vage/poor, etc. This document will become more and more formal and informative with time, but for now, don't expect much, as not even the prototype compiler is done. \ ->\ -> If you notice any errors or have suggestions, please open an issue or a pull request.\ ->\ ->Also note that I am natively a Spanish speaker and don't have perfect, native english. It's a struggle to make even a paragraph a formal and well-structured piece you expect to find in a specification. As I said, feel free to correct anything in a pull request! I may also write a version of the specification in Spanish and translate every version to English to make the process easier, although the English specification will get updates later. - -# 2. Semantics -> **DISCLAIMER:** work in progress! - -This section covers most semantical rules inside of Raven. - -*Note:* two slashes (`//`) mean a comment, which does not affect the program in any way and allows to give or get information about the program. They will be used frequently through this specification. - -## 2.1. Programs -Each Raven program consists of: -- One (and only one) `main` function. -- Zero or more user-defined functions with any name except `main`. - -## 2.2. Values and types -> This section *might* need further extension. - -Raven supports a strict static type system. Each variable, field or argument has a type assigned at compilation that can't change at any moment in runtime. - -Each *literal* can represent one or more types. For example, `"Hello, World!"` can only be a `string`, while `56` can represent signed or unsigned integer types with any bit size (`int8, int16, int32, int64`, `uint8`, `uint16`, `uint32`, `uint64`, `float8`...). - -These are all primitive types + some basic structural types in Raven: -- **String** (`string`): represents a string of characters. -- **Numeric**: various types representing any numeric value, which can be a integer (`int`) or floating-point number (`float`). Integers can be *signed* (`int`) or *unsigned* (`uint`). All numeric types can have an optional bit-size: 8, 16, 32 or 64, with the default one being 32. -- **Boolean** (`boolean`): type consisting of only the `true` and `false` literals. An integer can be converted into a boolean (`1` = `true`, `0` = `false`) and viceversa. -- **Void** (`void`): represents **absence of a returned value in functions**. Not to be confused with `None` types of other languages, which represent **absence of value**. -- **Array** (`[T]`): represents a list of elements sharing a type. -- **Map** (`{K, V}`): represents a list of key-value pairs sharing key and value types. - -Functions in Raven are first-class, which means that functions have their own type (`(A...)R`) and have a literal: -``` -let add = function(a: int, b: int) int { return a + b; }; // Here "add" is "(int, int)int" -``` - -String literals can contain: -- Escaped characters, such as newlines (`\n`), tabs (`\t`), etc. -- Variables/values/calls using string interpolation. For example, if we want to introduce the [variable]() "five" of value `5` into a string saying "The number is ", we interpolate it by writing `The number is ${five}` (`${five}` being the interpolation), which compiles to `The number is 5`. - -## 2.3. Declarations -In Raven, *declarations* are any statement that explicitly declares either a new usable member with a type and value (variables, functions) or a completely new type (structures). - -### 2.3.1. Variables -Variables are identifiers with an assigned type and value. They can be mutable (their value can vary) or immutable (their value is only the initially assigned one). - -To declare a mutable variable, the `let` keyword is used: -``` -let you = "cool"; // Let "you" be a string with variable value "cool" -you = "even cooler"; -``` - -To declare an immutable variable, the `const` keyword is used: -``` -const you = "cool"; // Let "you" be a string with the constant value "cool" -you = "even cooler"; // ERROR: "you" is constant. -``` - -All variables have a scope. The scope determines who can access the variable: -``` -const x = 1; // accessible in all the program - -function main() int { - const y = 2; // accessible inside the main function - (function() { - const z = 3; // accessible only inside this anonymous function - })(); - return 0; -} -``` - -Variable shadowing (redeclaring a variable in a lower scope) is illegal in Raven. If we, for example, redeclared "x" in the main function, we would get a compile error. - -### 2.3.2. Functions -Functions, in Ravn, are values that when called run a chain of statements (=block). Functions have: -- Various arguments with a name and type. -- One optional return value. - -To declare a function, the `function` keyword is used: -``` -// Declare function add, which has integer arguments "a", "b" and returns an integer. -function add(a: int, b: int) int { - return a + b; -} -``` -If the function returns no value, assign it's return type to `void`. - -You can't declare the exact same function two times (same name, arguments and return type), but you can have various overloads if you change the type of arguments. - -One clear example of having various overloads is the `print` function, which can print any primitive type as a string. The way this works is by having an overload for each type. - -Another feature functions implement is the variable argument. Each function can have at least one argument which has an undefined length. The variable argument is syntax sugar which, when compiled, simply converts all values passed to the variable argument into a single array. - -The `add_all.rvn` example shows these two features: -``` -function addAll(...numbers: int) int { - let sum = 0; - for (number in numbers) { - sum += number; - } - return sum; -} - -function addAll(...numbers: float) float { - let sum = 0; - for (number in numbers) { - sum += number; - } - return sum; -} - -function main() { - print(addAll(3, 2, 6, 10)); // 21 - print(addAll(3.02, 2.06, 6.1, 10.0)); // 21.9 - - let a, b, c: uint = 5, 3, 7; - print(addAll(a, b, c)) // ERROR: No overload for unsigned integers. -} -``` - -Functions can also have default arguments, which, unless provided by the user, have a default value. -``` -function oldEnoughToDrink(age: int, minAge: int = 18) boolean { - return age >= minAge; -} - -function main() int { - print(oldEnoughToDrink(19)); // true - print(oldEnoughToDrink(19, 21)); // false - return 0; -} -``` -*`default.rvn` example* - -### 2.3.3. Structures -Structures represent a whole new structured type. They contain fields and methods, may or may not have a constructor and, opposite to other languages, can't inherit other structures (although composition, consisting of using structures inside structures, is allowed and even encouraged). It is important to know that structures don't require fields or methods, meaning that an empty structure is completely valid, although useless. - -To declare a structure, the `struct` keyword is used: -``` -struct Vector(x: float, y: float) { - private x: float, - private y: float, - - function getX(self) float { - return self.x; - }; - - function getY(self) float { - return self.y; - }; -} -``` - -Let's analyze this structure: -- It has 2 arguments in it's constructor: "x" and "y", both floats. -- It has 2 private fields: "x" and "y", both floats. -- It has two methods: "getX" and "getY", which return fields "x" and "y", as they aren't accessible from outside the structure. - -The constructor is used to assign values to fields at object creation. In this case, arguments "x" and "y", which then determine the value of fields of same name. -> By the time of writing this, constructors only serve this purpose, but this might change and allow for code execution at object creation and not only value passing.\\ -> Another change that will most likely happen is the introduction of explicit syntax to assign a field inside of the structure to one of the arguments in the constructor. - -Fields represent a data member of any type (primitive, structural or user-defined) accessible through and only through the structure. Fields can be either public or private, and every field is public until specified otherwise by using the `private` keyword. Private fields can't be accessed from outside the structure, only from the inside (for example, inside methods). Fields can also be constant, which means that the data inside them can't vary, similar to [immutable variables](). - -To create an object from a structure, we use the `new` keyword: -``` -struct Vector { - // ... -} - -function main() int { - let vec = new Vector(5.0, 2.5); - return 0; -} -``` -The object contains: -- All declared fields with a respective value (as there is no `null`, every field **must** have a value). -- All declared methods as functions (which some might count as fields, considering that functions are first-class, although functions inside structures may always be called "methods"). - -To access a field inside a structure we use the member expression: `.`. The fields "x" and "y" are private, which means that we can't use them (`vec.x`/`vec.y`): -``` -function main() int { - let vec = new Vector(5.0, 2.5); - print(vec.getX()); // Outputs "5.0" - // print(vec.x); // ERROR: Field "x" is private. - return 0; -} -``` - -As we passed `5.0` and `2.5` as arguments when creating the object, the fields "x" and "y" will have those values respectively. You may choose to include the name of the arguments when creating this object, although this is completely optional. - -### 2.3.4. Enumerations -Enumerations allow to define a list of constant identifiers assigned each to a number. Variables can then be assigned to one of these identifiers. - -Enumerations are useful for when you want a list of possible values a variable can have where each one has a constant name, instead of just being a number or string, avoiding comments or documents specifying that 0 is Idle, 1 is Active and 2 is Error and directly assigning a variable to Idle, Active or Error. - -To declare enumerations, use the `enum` keyword: -``` -enum State { - Idle, // 0 - Active, // 1 - Error // 2 -} -``` - -To then use enumerations, you can simply access each member exactly as in [structures](). -``` -let state = State.Idle; -``` -One can also simply write `.` if the variable/field/argument has the enumeration already annotated as it's type. -``` -let state: State = .Idle; - -struct Program(state: State = .Idle) { - state: State, -} -``` - -## 2.4. Statements -A statement is code that solely causes a side effect (a new variable, repetition of an action through a loop...) and has no return value. - -### Variable declaration -Declares a new variable, mutable (declared with `let`) or immutable (declared with `const`) (see [2.3.1.]()). - -Declarations can optionally have a type annotation, type which, if not present, will be inferred. If the compiler can't infer the type, then a type annotation is mandatory If the compiler can't infer the type, then a type annotation is mandatory. - -**Examples:** -``` -let five = 5; // Mutable, inferred type (integer) -const name = "Raven"; // Immutable, inferred type (string) -let red, green, blue: uint8 = 255, 0, 255; // All mutable, specified type (uint8) - -// ERRORS -let empty = []; // ERROR: Can't infer type. -let number: string = 7; // ERROR: Expression (integer) does not match type (string) -``` - -### Function declaration -Declares a new function with a static number of arguments and one return type. - -See [2.3.2.]() for more details. - -### Structure declaration -Declares a new structure, with a static number of fields and methods. It may contain construction fields, although optional. - -See [2.3.3.]() for more details. - -### Enumeration declaration -Declares a new enumeration with a static number of members. Each member consists only of an identifier. - -See [2.3.4.]() for more details. - -### If -Executes code only if a condition is true. -``` -if () { - ... -} -``` - -If-statements can additionally include one or more else-if segment(s), which will check for another condition if the first one turns out false, and only one else segment, which runs if every condition was false. -``` -if () { - ... -} else if () { - ... -} else { // All were false - ... -} -``` - -There can be as many else-ifs as one desires, but it's encouraged to not chain a lot of them. -``` -if (...) { - ... -} else if (...) { - ... -} else if (...) { - ... -} else if (...) { - ... -} -``` - -### Switch -The switch statement allows to check for multiple possible values one expression can have. - -An expression is supplied as the value we "switch on", then we define various cases, each one including one or more expressions/ranges. If one of these cases is true, the respective code is executed. The switch statement also allows to define a "default" case, which will execute if all other cases were false. -``` -let x: uint = 5 + 4; -switch (x) { - case (9) { - print("Exactly 9!") - }, - case (0..8) { - print("Lower than 9!") - }, - default { - print("Over 9!") - } -} -``` - - -### Loops -There are 4 types of loops in Raven: -- For loops -- While loops -- Until loops -- Repeat loops - -#### For loop -Loop over every element in a sequence, including: -- Every member of an array. -- Every pair of a map. -- Every character of a string. -- Every number in a range. -Among others. - -**Some examples:** - -Iterate over each element in an array (one of the most common uses). -``` -let array = [5, 4, 78, 100, 2, 9 + 10, 21]; - -for (index, value in array) { - print("Index: ${index}"); - print("Value: ${value}"); - print("---") -} -``` -*Output:* -``` -Index: 0 -Value: 5 ---- -Index: 1 -Value: 4 ---- -Index: 2 -Value: 78 -... -``` - -Iterate over each number in a range of 1 through 10. -``` -for (index in 0..10) { - print(" ${index}", ""); // no newline -} -``` -*Output:* -``` -0 1 2 3 4 5 6 7 8 9 10 -``` - -#### While loop -Loop **while** a condition is true. -``` -let i = 0; -while (i * 2 != 12) { - print(i); - i++; -} -``` -*Output:* -``` -0 -1 -2 -3 -4 -5 -6 -``` - -#### Until loop -Loop **until** a condition is true (opposite of while loop). -``` -let i = 0; -until (i * 2 == 12) { - print(i); - i++; -} -``` -*Output:* -``` -0 -1 -2 -3 -4 -5 -6 -``` - -#### Repeat loop -Loops a specific amount of times. -``` -let i = 0; -repeat 6 { - print(i); - i++; -} -``` -*Output:* -``` -0 -1 -2 -3 -4 -5 -6 -``` - -### Defer -Defer is a very simple statement which will just delay the execution of a line/block to the end of the parent block. Defers are evaluated in opposite order. -``` -function main() int { - defer print("I already won!") - defer print("See you at the end!") - - return 0; -} -``` -*Output:* -``` -See you at the end! -I already won! -``` - -> Defers are extremely useful for cleaner blocks, as all boilerplate can be written at the start, so all that's next is the actual code and not mandatory boilerplate. - -### Return -Returns an expression, marking the end of the block. Return can only be used inside functions, as the file can't return a value. -> Raven is not [Lua](). - -If return is used in a void function, a compile error is thrown. If code is written after return, a compile error is thrown. - -### Throw -Throw will stop execution of code (unless ran inside of a [try-catch statement](spec#Try-catch)) and throw an error. - -Any expression can be passed to throw, which will convert it into a string and use it as the error message. -> I see this as too unsafe and limiting: what if the user tries to pass a structure as an error thinking it may work? What if the user wants to define a structure to use as their own "error" type, but can't directly pass the error and has to convert it into a string?\\ -> Have to reconsider. - -### Try-catch -Try-catch will safely run a block, and allow the user to access the error and handle it in the desired way (simply ignoring it, printing it, using a user-defined error handler, etc.). - -# 3. Syntax - -> **You can find the grammar [here](./grammar.md)** diff --git a/examples/test.rvn b/examples/test.rvn deleted file mode 100644 index c70b373..0000000 --- a/examples/test.rvn +++ /dev/null @@ -1,26 +0,0 @@ -func exit(int code) -{ - __builtin_syscall(1, code) -} - - -func add(int a, int b) -> int -{ - return a + b -} - -func main() -{ - int sum = add(1, 5) - if(sum >= 5) - { - __builtin_syscall(4, 1, "Greater than\n", 13) - } - - float pi = 3.14 - string text = "Hello from C² Compiler!" - char letter = text[0] - - bool isAlive = true - exit(1) -} \ No newline at end of file diff --git a/include/csquare/lexer/lexer.h b/include/csquare/lexer/lexer.h new file mode 100644 index 0000000..980391c --- /dev/null +++ b/include/csquare/lexer/lexer.h @@ -0,0 +1,127 @@ +#ifndef _LEXER_H +#define _LEXER_H + +#include +#include +#include + +#define T(NAME, STR) NAME, +#define TOKEN_TYPES \ + T(T_EOF, "EOF") \ + T(T_ERROR, "ERROR") \ + T(T_IDENTIFIER, "IDENTIFIER") \ + T(T_DECIMAL, "DECIMAL") \ + T(T_UNSIGNED, "UNSIGNED DECIMAL") \ + T(T_DOUBLE, "DOUBLE") \ + T(T_FLOAT, "FLOAT") \ + T(T_QUAD, "QUAD") \ + T(T_DECIMAL_EXPO, "DECIMAL EXPONENT") \ + T(T_UNSIGNED_EXPO, "UNSIGNED DECIMAL EXPONENT") \ + T(T_DOUBLE_EXPO, "DOUBLE EXPONENT") \ + T(T_FLOAT_EXPO, "FLOAT EXPONENT") \ + T(T_QUAD_EXPO, "QUAD EXPONENT") \ + T(T_STRING, "STRING") \ + \ + T(T_EQ, "EQUALS") \ + T(T_NEQ, "NOT EQUALS") \ + T(T_ASSIGN, "ASSIGN") \ + T(T_GREATER, "GREATER") \ + T(T_LESS, "LESS") \ + T(T_GREATER_EQUALS, "GREATER OR EQUALS") \ + T(T_LESS_EQUALS, "LESS OR EQUALS") \ + T(T_ADD, "ADD") \ + T(T_SUB, "SUBTRACT") \ + T(T_DIV, "DIVIDE") \ + T(T_MUL, "MULTIPLY") \ + T(T_ADD_ASSIGN, "ADD AND ASSIGN") \ + T(T_SUB_ASSIGN, "SUBTRACT AND ASSIGN") \ + T(T_DIV_ASSIGN, "DIVIDE AND ASSIGN") \ + T(T_MUL_ASSIGN, "MULTIPLY AND ASSIGN") \ + T(T_OPEN_PAREN, "OPEN PARENTHESES") \ + T(T_CLOSE_PAREN, "CLOSE PARENTHESES") \ + T(T_OPEN_BRACE, "OPEN BRACE") \ + T(T_CLOSE_BRACE, "CLOSE BRACE") \ + T(T_OPEN_BRACKET, "OPEN BRACKET") \ + T(T_CLOSE_BRACKET, "CLOSE BRACKET") \ + T(T_PERIOD, "PERIOD") \ + T(T_COMMA, "COMMA") \ + T(T_COLON, "COLON") \ + T(T_SEMICOLON, "SEMICOLON") \ + T(T_AND, "AND") \ + T(T_OR, "OR") \ + \ + T(T_KW_DO, "DO") \ + T(T_KW_IF, "IF") \ + T(T_KW_FOR, "FOR") \ + T(T_KW_INT, "INT") \ + T(T_KW_CHAR, "CHAR") \ + T(T_KW_VOID, "VOID") \ + T(T_KW_ELSE, "ELSE") \ + T(T_KW_ENUM, "ENUM") \ + T(T_KW_LONG, "LONG") \ + T(T_KW_QUAD, "QUAD") \ + T(T_KW_BOOL, "BOOL") \ + T(T_KW_CASE, "CASE") \ + T(T_KW_CONST, "CONST") \ + T(T_KW_TYPE, "TYPE") \ + T(T_KW_FLOAT, "FLOAT") \ + T(T_KW_GOTO, "GOTO") \ + T(T_KW_INFER, "INFER") \ + T(T_KW_SHORT, "SHORT") \ + T(T_KW_UCHAR, "UCHAR") \ + T(T_KW_UINT, "UINT") \ + T(T_KW_ULONG, "ULONG") \ + T(T_KW_ERROR, "ERROR") \ + T(T_KW_RETURN, "RETURN") \ + T(T_KW_STRUCT, "STRUCT") \ + T(T_KW_DOUBLE, "DOUBLE") \ + T(T_KW_STATIC, "STATIC") \ + T(T_KW_WHILE, "WHILE") \ + T(T_KW_DEFAULT, "DEFAULT") \ + T(T_KW_SWITCH, "SWITCH") \ + T(T_KW_USHORT, "USHORT") \ + T(T_KW_CONTINUE, "CONTINUE") + +typedef enum { TOKEN_TYPES T__COUNT } token_type; + +#undef T + +extern const char *token_type_str[T__COUNT]; + +struct token { + const char *start; + int length; + token_type type; +}; + +typedef struct token token; + +token *new_token(const char *start, int length, token_type type); +void free_token(token *tk); +token *error_token(const char *msg); + +typedef struct { + token **tokens; + size_t count; + size_t capacity; +} token_list; + +void init_token_list(token_list *list); +void free_token_list(token_list *list); +void add_token(token_list *list, token *tk); + +#define isws(c) (c == ' ' || c == '\t' || c == '\n' || c == '\r') +#define isdigit(c) (c >= '0' && c <= '9') +#define isalpha(c) (c >= 'a' && c <= 'z' || c >= 'A' && c <= 'Z') + +#define LEX_FUNC_ARGS const char *p, int *len +token *lex_symbol(LEX_FUNC_ARGS); +token *lex_digit(LEX_FUNC_ARGS); +token *lex_string(LEX_FUNC_ARGS); +token *lex_ident(LEX_FUNC_ARGS); +#undef LEX_FUNC_ARGS +token_list *lex(const char *src); + +void print_token(token *tk); + +#endif diff --git a/src/csquare/opt-common.c b/src/csquare/opt-common.c index 38c914c..59b8996 100644 --- a/src/csquare/opt-common.c +++ b/src/csquare/opt-common.c @@ -4,20 +4,20 @@ #include static void handle_info(csq_options *opts, const char *val); -static const opt_map_t opts[] = { - {"--info", "-i", OPT_KIND_FUNC, offsetof(csq_options, show_info), handle_info}}; - -static void handle_info(csq_options *opts, const char *val) { - (void)opts; - printf("Csquared - %s (%s)\n", CSQ_VERSION, __DATE__); - printf("Authors: %s\n", CSQ_AUTHORS); - #ifdef CSQ_DEBUG - printf("Build: Debug\n"); - #else - printf("Build: Release\n"); - #endif - opts->show_info = true; +static const opt_map_t opts[] = {{"--info", "-i", OPT_KIND_FUNC, + offsetof(csq_options, show_info), + handle_info}}; +static void handle_info(csq_options *opts, const char *_) { + (void)opts; + printf("Csquared - %s (%s)\n", CSQ_VERSION, __DATE__); + printf("Authors: %s\n", CSQ_AUTHORS); +#ifdef CSQ_DEBUG + printf("Build: Debug\n"); +#else + printf("Build: Release\n"); +#endif + opts->show_info = true; } csq_options *options_parse(int argc, char *argv[]) { csq_options *opt = calloc(1, sizeof(csq_options)); @@ -25,14 +25,14 @@ csq_options *options_parse(int argc, char *argv[]) { return NULL; for (int i = 1; i < argc; ++i) { const char *arg = argv[i]; - bool opt_found = false; + // bool opt_found = false; for (int j = 0; opts[j].long_name != NULL; ++j) { const opt_map_t *m = &opts[j]; if (STRCMP(arg, m->long_name) == 0 || (m->short_name && STRCMP(arg, m->short_name) == 0)) { - opt_found = true; - const char* val = NULL; + // opt_found = true; + const char *val = NULL; if (m->kind == OPT_KIND_FLAG) { *(bool *)((char *)opt + m->offset) = true; } else if (m->kind == OPT_KIND_VAL) { @@ -42,8 +42,8 @@ csq_options *options_parse(int argc, char *argv[]) { fprintf(stderr, "Error: %s requires an argument\n", arg); goto error; } - } else if(m->kind == OPT_KIND_FUNC && m->func) { - m->func(opt, val); + } else if (m->kind == OPT_KIND_FUNC && m->func) { + m->func(opt, val); } break; } diff --git a/src/lexer/lex_digit.c b/src/lexer/lex_digit.c new file mode 100644 index 0000000..64c5d84 --- /dev/null +++ b/src/lexer/lex_digit.c @@ -0,0 +1,46 @@ +#include "csquare/lexer/lexer.h" +#include + +token *lex_digit(const char *p, int *len) { + const char *start = p; + token_type type = T_DECIMAL; + bool has_dot = false; + bool has_exp = false; + + while (isdigit(*p)) + p++; + + if (*p == '.') { + has_dot = true; + p++; + while (isdigit(*p)) + p++; + } + + if (*p == 'e' || *p == 'E') { + has_exp = true; + p++; + if (*p == '+' || *p == '-') + p++; + while (isdigit(*p)) + p++; + } + + if (*p == 'u') { + type = has_exp ? T_UNSIGNED_EXPO : T_UNSIGNED; + p++; + } else if (*p == 'f') { + type = has_exp ? T_FLOAT_EXPO : T_FLOAT; + p++; + } else if (*p == 'q') { + type = has_exp ? T_QUAD_EXPO : T_QUAD; + p++; + } else if (has_dot) { + type = has_exp ? T_DOUBLE_EXPO : T_DOUBLE; + } else if (!has_dot) { + type = has_exp ? T_DECIMAL_EXPO : T_DECIMAL; + } + + *len = (int)(p - start); + return new_token(start, *len, type); +} diff --git a/src/lexer/lex_ident.c b/src/lexer/lex_ident.c new file mode 100644 index 0000000..772678a --- /dev/null +++ b/src/lexer/lex_ident.c @@ -0,0 +1,80 @@ +#include "csquare/lexer/lexer.h" + +const struct { + const char *kw; + token_type type; + int len; +} keyword_table[] = {{"do", T_KW_DO, 2}, + {"if", T_KW_IF, 2}, + {"for", T_KW_FOR, 3}, + {"int", T_KW_INT, 3}, + {"char", T_KW_CHAR, 4}, + {"void", T_KW_VOID, 4}, + {"else", T_KW_ELSE, 4}, + {"enum", T_KW_ENUM, 4}, + {"long", T_KW_LONG, 4}, + {"quad", T_KW_QUAD, 4}, + {"bool", T_KW_BOOL, 4}, + {"case", T_KW_CASE, 4}, + {"type", T_KW_TYPE, 4}, + {"goto", T_KW_GOTO, 4}, + {"uint", T_KW_UINT, 4}, + {"const", T_KW_CONST, 5}, + {"float", T_KW_FLOAT, 5}, + {"infer", T_KW_INFER, 5}, + {"short", T_KW_SHORT, 5}, + {"uchar", T_KW_UCHAR, 5}, + {"ulong", T_KW_ULONG, 5}, + {"error", T_KW_ERROR, 5}, + {"while", T_KW_WHILE, 5}, + {"return", T_KW_RETURN, 6}, + {"struct", T_KW_STRUCT, 6}, + {"double", T_KW_DOUBLE, 6}, + {"static", T_KW_STATIC, 6}, + {"switch", T_KW_SWITCH, 6}, + {"ushort", T_KW_USHORT, 6}, + {"default", T_KW_DEFAULT, 7}, + {"continue", T_KW_CONTINUE, 8}}; + +int keyword_count = sizeof(keyword_table) / sizeof(keyword_table[0]); + +token *lex_ident(const char *p, int *len) { + char buf[64]; + int bufi = 0; + token_type type = T_IDENTIFIER; + + if (isalpha(*p) || *p == '_') { + buf[bufi] = *p; + bufi++; + p++; + } + + while ((isalpha(*p) || isdigit(*p) || *p == '_' || *p == '?') && + (size_t)bufi < sizeof(buf) - 1) { + buf[bufi] = *p; + bufi++; + p++; + } + + buf[bufi] = '\0'; + int skip_kw = (buf[bufi - 1] == '?'); + + if (!skip_kw) { + for (size_t i = 0; i < (size_t)keyword_count; i++) { + if (bufi != keyword_table[i].len) + continue; + + if (buf[0] != keyword_table[i].kw[0]) + continue; + + if (memcmp(buf, keyword_table[i].kw, bufi) == 0) { + type = keyword_table[i].type; + break; + } + } + } + + const char *start = p - bufi; + *len = bufi; + return new_token(start, bufi, type); +} diff --git a/src/lexer/lex_string.c b/src/lexer/lex_string.c new file mode 100644 index 0000000..561c037 --- /dev/null +++ b/src/lexer/lex_string.c @@ -0,0 +1,28 @@ +#include "csquare/lexer/lexer.h" + +token *lex_string(const char *p, int *len) { + const char *start = p; + char delim = *p; + p++; + + while (*p != delim && *p != '\0') { + if (*p == '\\' && *(p + 1) != '\0') + p++; + p++; + } + + if (*p != delim) { + *len = (int)(p - start); + + const char *prefix = "Unterminated string: "; + char *msg = malloc(strlen(prefix) + *len + 1); + + sprintf(msg, "Unterminated string: %.*s", *len, start); + return error_token(msg); + } + + p++; + + *len = (int)(p - start); + return new_token(start, *len, T_STRING); +} diff --git a/src/lexer/lex_symbol.c b/src/lexer/lex_symbol.c new file mode 100644 index 0000000..310ff0c --- /dev/null +++ b/src/lexer/lex_symbol.c @@ -0,0 +1,38 @@ +#include "csquare/lexer/lexer.h" + +const struct { + const char *sym; + token_type type; +} symbol_table[] = { + {"==", T_EQ}, {"!=", T_NEQ}, {">=", T_GREATER_EQUALS}, + {"<=", T_LESS_EQUALS}, {"+=", T_ADD_ASSIGN}, {"-=", T_SUB_ASSIGN}, + {"/=", T_DIV_ASSIGN}, {"*=", T_MUL_ASSIGN}, {"&&", T_AND}, + {"||", T_OR}, {"=", T_ASSIGN}, {"+", T_ADD}, + {"-", T_SUB}, {"/", T_DIV}, {"*", T_MUL}, + {">", T_GREATER}, {"<", T_LESS}, {"(", T_OPEN_PAREN}, + {")", T_CLOSE_PAREN}, {"{", T_OPEN_BRACE}, {"}", T_CLOSE_BRACE}, + {"[", T_OPEN_BRACKET}, {"]", T_CLOSE_BRACKET}, {".", T_PERIOD}, + {",", T_COMMA}, {":", T_COLON}, {";", T_SEMICOLON}}; + +int symbol_count = sizeof(symbol_table) / sizeof(symbol_table[0]); + +token *lex_symbol(const char *p, int *len) { + int best_len = 0; + token_type best_type = T_ERROR; + + for (int i = 0; i < symbol_count; i++) { + int sym_len = strlen(symbol_table[i].sym); + if (strncmp(p, symbol_table[i].sym, sym_len) == 0 && sym_len > best_len) { + best_len = sym_len; + best_type = symbol_table[i].type; + } + } + + if (best_len == 0) { + best_len = 1; + best_type = T_ERROR; + } + + *len = best_len; + return new_token(p, best_len, best_type); +} diff --git a/src/lexer/lexer.c b/src/lexer/lexer.c new file mode 100644 index 0000000..c1f7c94 --- /dev/null +++ b/src/lexer/lexer.c @@ -0,0 +1,147 @@ +#include "csquare/lexer/lexer.h" +#include +#include +#include + +#define T(NAME, STR) [NAME] = STR, + +const char *token_type_str[T__COUNT] = {TOKEN_TYPES}; + +token *new_token(const char *start, int length, token_type type) { + token *tk = malloc(sizeof(token)); + if (!tk) { + perror("malloc failed"); + return NULL; + } + + tk->start = strdup(start); + tk->length = length; + tk->type = type; + return tk; +} + +void free_token(token *tk) { + if (!tk) + return; + + free(tk); +} + +token *emit(const char *src, int starti, int endi, token_type type) { + if (starti < 0 || endi < starti) { + return NULL; + } + + int len = endi - starti; + char *start = malloc(len + 1); + if (!start) { + return NULL; + } + + for (int i = 0; i < len; i++) { + start[i] = src[starti + i]; + } + start[len] = '\0'; + + return new_token(start, len, type); +} + +token *error_token(const char *msg) { + return new_token(msg, strlen(msg), T_ERROR); +} + +void free_token_list(token_list *list) { + if (!list) + return; + + for (size_t i = 0; i < list->count; i++) { + if (list->tokens[i]) + free_token(list->tokens[i]); + } +} + +#define peek(n) (p[n]) + +void init_token_list(token_list *list) { + list->count = 0; + list->capacity = 32; + list->tokens = malloc(sizeof(token *) * list->capacity); +} + +void add_token(token_list *l, token *tok) { + if (l->count >= l->capacity) { + l->capacity *= 2; + l->tokens = realloc(l->tokens, sizeof(token *) * l->capacity); + } + + l->tokens[l->count++] = tok; +} + +token_list *lex(const char *src) { + token_list *list = malloc(sizeof(token_list)); + init_token_list(list); + + char *p = (char *)src; + + while (*p) { + char c = *p; + + if (isws(c)) { + p++; + continue; + } + + if (c == '/' && p[1] == '/') { + p += 2; + while (*p != '\n') { + p++; + } + continue; + } + + if (c == '/' && p[1] == '*') { + p += 2; + while (*p != '*' && *(p + 1) != '/') { + p++; + } + p += 2; + continue; + } + + int consumed = 0; + token *tk = NULL; + + if (isdigit(c)) { + tk = lex_digit(p, &consumed); + } else if (isalpha(c) || c == '_') { + tk = lex_ident(p, &consumed); + } else if (c == '"' || c == '\'') { + tk = lex_string(p, &consumed); + } else { + tk = lex_symbol(p, &consumed); + } + + p += consumed; + add_token(list, tk); + } + + return list; +} + +void print_token(token *tk) { + const char *type_color = "\x1b[32m"; + if (tk->type == T_ERROR) + type_color = "\x1b[31m"; + + printf("Text: \x1b[33m"); + + for (int i = 0; i < tk->length; i++) { + char c = tk->start[i]; + if (c == '\n') + printf("\x1b[36m\\n\x1b[0m"); + else + putchar(c); + } + + printf("\x1b[0m, Type: %s%s\x1b[0m\n", type_color, token_type_str[tk->type]); +} diff --git a/src/main.c b/src/main.c index 4fab0e3..b86b711 100644 --- a/src/main.c +++ b/src/main.c @@ -1,9 +1,57 @@ -#include +#include "csquare/lexer/lexer.h" #include "csquare/opt-common.h" +#include +#include + +char *read_file(const char *filename) { + FILE *f = fopen(filename, "rb"); + if (!f) { + perror("fopen"); + return NULL; + } + fseek(f, 0, SEEK_END); + long size = ftell(f); + fseek(f, 0, SEEK_SET); + + char *buffer = malloc(size + 1); + if (!buffer) { + perror("malloc"); + fclose(f); + return NULL; + } + + if (fread(buffer, 1, size, f) != (size_t)size) { + perror("fread"); + free(buffer); + fclose(f); + return NULL; + } + buffer[size] = '\0'; + fclose(f); + return buffer; +} + +int main(int argc, char *argv[]) { + csq_options *opts = options_parse(argc, argv); + if (!opts) + return EXIT_FAILURE; + + if (argc < 2) { + fprintf(stderr, "Usage: %s \n", argv[0]); + return EXIT_FAILURE; + } + + char *src = read_file(argv[1]); + if (!src) + return EXIT_FAILURE; + + token_list *lexed = lex(src); + + for (size_t i = 0; i < lexed->count; i++) { + print_token(lexed->tokens[i]); + } -int main(int argc, char* argv[]) -{ - csq_options *opts = options_parse(argc, argv); - if(!opts) return EXIT_FAILURE; - return 0; + free_token_list(lexed); + free(src); + return 0; }