interpreter/DOCUMENTATION.txt at master · k3zi/interpreter · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
Overall Design:
  Tokenizer:
    1. Loads source file by passing file name to `Tokenizer` constructor.
    2. Upon user calling `NextToken` fetches the next token:
      i. Reads one or more (if needed) characters from the file stream.
      ii. If the character denotes a symbol sets the current token to the type
        of symbol and a the string covered by the token in the source file.
      iii. If the character is alphanumeric the class will classify it by:
        - Lowercase -> `NextReservedToken` (since reserved tokens are the only
            tokens that start with a lowercase character)
        - Uppercase -> `NextIdentifier` (since identifier tokens are the only
            tokens that start with an uppercase character)
        - Numeric -> `NextInteger` (since numeric tokens are the only
            tokens that start with a numeric character)
      ix. If the character hasn't matched and is not whitespace, newline, or
        and end-of-file the tokenizer will throw an error.
      x. `NextReservedToken` / `NextIdentifier` / `NextInteger` process the
        file stream until all consecutive alphanumeric characters are consumed.
        These functions will throw any errors for tokens that are invalid in
        the CORE language (Ex: Integer/Identifier length greater than 8).
    3. Uses the `currentToken` getter to fetch the most recently consumed
      token.
    4. Uses `Token->getType()` to capture the number representation of all the
      tokens from the source file. If the source file was erroneous,
      the code will not output the number representation and an error will
      instead be outputted. If the tokenizer was successful in reading all
      tokens, the number representation will be printed to the program's
      output.
  Parser:
    1. Assigns to the AST the value of a constructed program.
    2. The initializer of a `Prog` is passed the parser.
    3. The parsing methods are contained in all of the Nodes constructors.
    4. When an error is reached it is propagated up to the constructor.
  Interpreter:
    1. Works similar to how the parser recursively descends.
    2. The AST.Execute() method may be used to execute a parsed AST.
    3. The AST will execute the program (Prog.Execute) which will recursively
      descend the constructed abstract syntax tree calling Execute at each node
      and handling logic for statements.
    4. The ASTContext class houses the symbol table that holds all identifiers
      and their current values.
    5. All programs must initialize a variable at all paths in the program
      before the identifier may be used. This is enforced at the Parser level.
      As such, the errors that occur at the Interpreter level are limited to
      overflow/underflow and invalid integer input.

Class Structure:
  - Class structure for the Parser & Interpreter can be found in the
    documenting header files.
  - Token.h:
    - TokenType (enum): This enum list is organized by the predetermined
      numbering of tokens. I.e: program, begin, end...
    - Token (class): A class for holding token information.
      - Private Members:
        - Type (TokenType): The type of token the `Data` represents.
        - Data (std::string): The underlying data behind the token.
        - LineNumber (int): The line number of this token.
        - ColumnNumber (int): The column number of this token.
      - Public Members:
        - `Constructor` (Token): Creates an empty token.
        -  Getters for various members (defined above):
          - getType (() -> TokenType)
          - getData (() -> std::string)
          - getLineNumber (() -> unsigned)
          - getColumnNumber (() -> unsigned)
        - setToken((TokenType, std::string) -> void): Just a simple setter for
          the token properties.
        - setLocation((unsigned L, unsigned C) -> void): Just a simple setter
          for the token's location in the source file. L = line number; C =
          column number.
  - Tokenizer.h/cpp:
    - Static Private Members:
      - IdentifierMaxLength (static const unsigned): The maximum length of a
        valid identifier in CORE.
      - IntegerMaxLength (static const unsigned): The maximum length of a valid
        integer in CORE.
    - Private Members:
      - CurrentToken (Token): The most recent token. Upon initialization this
        is the undefined token type: `TokenType::undefined`.
      - FileStream (std::ifstream): The file stream to be opened upon
        initialization and closed on destruction.
      - LineNumber (unsigned): The line number that the tokenizer is
        processing. Incremented on line breaks. Line 1 is considered the first
        line. Counting starts from 1. Not 0.
      - ColumnNumber (unsigned): The column number that the tokenizer is
        processing. Resets on line breaks. Increments on character
        consumption. Column 1 is considered the first column. Counting starts
        from 1. Not 0.
      - NextIdentifier ((std::string) -> unsigned): Processes the next
        identifier token from the file stream. The passed in `tokenString`
        must contain a single uppercase character. This function processes up
        to (but not including) a character that is non-alphanumeric. Identifier
        Format = [A-Z]+[0-9]*
        This function will throw if:
          - Any of the characters after the initial character are non-uppercase
            Ex: ABc123.
          - Any of the characters after the initial character and all uppercase
            characters are non-digits. Ex: ABC123X.
          - The resulting identifier exceeds `IdentifierMaxLength`
        Returns: The number of characters consumed. AKA The resultant token's
          length.
      - NextReservedToken ((std::string) -> unsigned): Processes the next
        reserved token from the file stream. The passed in `tokenString` must
        contain a single lowercase character. Reserved Token Format = [a-z]+
        and must match one of the predefined tokens. This function will throw
        if:
          - Any of the characters after the initial character are non-lowercase
            Ex: ABc123.
          - The result matches none of the languages identifiers.
        Returns: The number of characters consumed. AKA The resultant
          token's length.
      - NextInteger ((std::string) -> unsigned): Processes the next integer
        token from the file stream. The passed in `tokenString` must contain a
        single numeric character. This function processes up to (but not
        including) a character that is non-alphanumeric.
        Identifier Format = [0-9]+
        This function will throw if:
          - Any of the characters after the initial character are non-numeric
            Ex: 123c, 123ABC, 1x4, 45e8.
          - The resulting identifier exceeds `IntegerMaxLength`.
        Returns: The number of characters consumed. AKA The resultant
          token's length.
      - consumeToken (() -> void): Internal method that retrieves the next
        token from the file stream. Is called by public member `NextToken` and
        utilizes: `NextIdentifier`, `NextInteger`, and `NextIdentifier`.
    - Public Members
      - `Constructor` ((std::string) -> Tokenizer): The only constructor for
        the Tokenizer. Takes in a file path to a CORE source file and opens a
        file stream with the initial token set to an undefined token.
      - `Destructor` (() -> virtual): Destructor just closes the file stream.
      - lineNumber (() -> unsigned): Returns the current line number of the
        scanner.
      - columnNumber (() -> unsigned): Returns the current line number of the
        scanner.
      - currentToken (() -> Token): Getter for the current token.
      - NextToken (() -> void): Retrieves the next token from the file stream.
        That token can then subsequently be read by calling
        `Tokenizer::currentToken`. Handles error outputting (specifically line
        numbers).
      - isEOF (() -> bool): Whether the tokenizer (and file stream) has reached
        the end of token output. End of file.

Testing / Bugs:
  - Code was tested batch running against various valid and invalid CORE files
    and comparing the output to given calculations of the resultant token
    number stream; ensuring that code that was erroneous failed to produce a
    token stream and halted with a reasonable error description.
  - As of this latest version no outstanding bugs are present.
  - For Parser project integration tests were introduced.