Self hosting tokenizer by nickd4 · Pull Request #9 · uho/preForth

nickd4 · 2022-04-24T10:40:37Z

More hacking... what I set out to do was to make the seedForth tokenizer self-hosting, so that after bootstrap you would not need gForth to develop applications. So my idea was make the tokenizer work in gForth like now (for bootstrapping) and also work in seedForth interactive version (for application development). It turned out to be quite difficult, but ultimately it works.

So the actual changes to seedForth-tokenizer.fs to make it run under seedForth were not that huge, mainly a matter of accounting for seedForth's case sensitivity and restricted syntax for hex and character literals and various things like that, as well as minor differences in the words available (parse-name instead of <name> etc). But the larger difficulty was in making a seedForth or seedForthInteractive program run cleanly as a filter. I had to modify the runtime library and I/O system a lot.

There was also another issue to deal with which concerns the wrapping of the *.seed and *.seedsource files. Originally the input was wrapped in PROGRAM / END and the output was wrapped with an automatic bye token added at the end. I have removed the need for all of this wrapping, at the cost of its being slightly more awkward to invoke the gForth version of the tokenizer. Since this is only done from the Makefile during bootstrap, that's not a big deal. It's only just occurred to me now that the unusual extension *.seedsource was probably due to the wrapping, so maybe we can rename them to *.forth now?

Here is a detailed summary of all the changes I have made to support the self-hosting tokenizer:

Create the ./seedForth-tokenizer script, which operates as a filter and takes a *.seedsource file on stdin and outputs the corresponding *.seed file on stdout. It works similarly to ./seed by concatenating the various input files into ./seedForth.
Build the functionality of cat into every compiled preForth/seedForth application, so it will either process stdin if there are no command line arguments, or else open and read each file specified on the command line in sequence, where - is stdin. The effect of this change is that when running ./seed you no longer need to press Enter after typing bye to make seedForth quit. The extra keystroke was needed to force the front-end cat invocation to try to send something to seedForth and then it would realize the pipe was broken and quit. With seedForth managing its own input, you can quit cleanly.
Make the key? word use a poll() rather than ioctl() system call, since we now expect that stdin might come from a file.
Implement an eemit word throughout the system which is the same as emit but writes to stderr. I use this for debugging.
Remap the tokens key and emit to higher token numbers, to make it easier to detect the EOT character which used to correspond to the key token. Implement a new eot token at 4 which is similar to the bye token. The reason for this change is because the bye token was overloaded to use as [, i.e. it would restart the interpreter after compiling the ; token and during certain control flow constructs. This meant you couldn't compile a bye token into a program. By moving the original usage of bye onto the new eot token, it means bye is no longer special and can be compiled normally, while also the changes to the existing system are minimal, and as a bonus, if input runs out during a :-definition, the resulting EOT will be interpreted as [ and send us back to interpretive state, where a further EOT is considered invalid and quits the interpreter too.
Split out the essential definitions from seedForthInteractive.seedsource into a new seedForthRuntime.seedsource and from hi.forth into a new runtime.forth. The original files seedForthInteractive.seedsource and hi.forth still exist and contain all the tests as well as less essential words like sqr which you can grab if you actually need them. To use the system as it was previously, you have to tokenize seedForthRuntime.seedsource + seedForthInteractive.seedsource into seedForthInteractive.seed and then pass it runtime.forth + hi.forth and the Makefile and ./seed script have been updated appropriately. But when running the self-hosted tokenizer, it uses a different *.seed file which is basically generated by tokenizing seedForthRuntime.seedsource + a call to boot, and it uses the runtime.forth without the hi.forth part.
Increase tib from 80 to 255 characters, also fix a bug in accept which allowed it to write one character beyond the tib. Note that some source lines in the system as originally were > 80 characters, and I think they may have been silently truncated and the incomplete code not noticed. The extra character seemed to cause a crash on my Z80 port, which alerted me to the issue. There isn't really a good way to flag too-long lines to the user, but I have at least made it not echo any extra characters.
Hack on accept, refill and restart words to detect EOT and return something or quit. This is needed to prevent the self-hosted tokenizer from hanging after it tokenizes all the input. I'm not entirely happy with the solutions I came up with here, and I think possibly the entire concept of using EOT as a marker for the end of input might be flawed. Could we make key throw an exception instead? I'm not a very experienced Forth programmer so I don't really know how this would be done conventionally. But at any rate, you can now exit from ./seed by typing Ctrl-D (no Enter) or bye and I find the first more comfortable. Just be aware that a partial last line is not supported, either in a *.seedsource file or a ./seed session, it will say "not found".
Make the initial state of echo and input-echo be off. That's because after loading a seedForth runtime from *.seed file, you will always want to load further runtime as textual Forth source. So it's cleaner to let the second runtime enable echo. This makes the output of ./seed cleaner as well. But primarily it's needed to avoid junk getting into the tokenized *.seed files.
Implement DO/?DO/LOOP, as the experimental ?DO that was commented didn't have a correct companion LOOP.

Some of the more detailed changes might not be well explained in the above summary, or might be objectionable for whatever reason, so please feel free to check with me. Also, keep in mind that this changeset is "on top of" the previous changeset that I PR'ed the other day, so github will show both changesets. It's annoying the way github does this, and it does not recalculate the changeset after you merge the first PR. But you can force it to, by changing the base branch name and then changing it back.

I had a really good time doing this, even though it involved a lot of head-scratching and dealing with strange crashes and errors and unexpected behaviour. As I mentioned I'm not an experienced Forth programmer, but I've become more conversant with it.

Note: There is a minor bug in this PR, that I had directly invoked gforth in Makefile instead of $(HOSTFORTH). It is fixed in #12 so I have not fixed it here. If you do want the fixed version of this PR see the branch self_hosting_tokenizer1 in my github account. I wouldn't recommend using that branch though, because it will cause conflicts later when mergining #12 and others.

…kenizer

…NU make)

…sistency

…lowercase

…to bss rather than text section which avoids the need to call mprotect(), rename things

…t portions

…-rts.pre

…ing input

… ioctl()

… be wrapped with PROGRAM / END, also removes automatic bye token that was generated by END

…time.seedsource, so that we can run textual forth code without the tests or the banner

…ng phases

…ime.forth

…izing)

…verywhere

…ye token)

… writes to stderr, fix self-hosted tokenizer termination issue (was debugged with eemit)

…anion LOOP

nick-lifx added 20 commits April 22, 2022 13:40

Add .gitignore, rationalize make clean, remove a gforth warning in to…

df227c7

…kenizer

Remove .$(UNIXFLAVOUR) extension and file copies in favour of ifeq (G…

56c4dcf

…NU make)

Remove redundant stuff in the preForth-generated asm code

a2e4911

Use tabs in preForth-i386-rts.pre and seedForth-i386.pre, improve con…

cb3ba98

…sistency

Rationalize how cr is emitted in generated preForth code, make DB/DD …

fa4f0c2

…lowercase

Rationalize the sections in assembly output, make code be compiled in…

927ca5a

…to bss rather than text section which avoids the need to call mprotect(), rename things

Split seedForth-i386.pre into machine dependent/less machine dependen…

77b48fe

…t portions

Split seedforth-i386.pre further into header and body portions

f843d08

Remove duplicated code in seedForth-i386.pre, take from preForth-i386…

cdd8686

…-rts.pre

Implement built-in "cat" functionality in preForth/seedForth for read…

c962f42

…ing input

Make key? use fdin instead of always STDIN_FILENO, and use poll() not…

ebf5bfa

… ioctl()

Rationalize seedForth-tokenize.fs so that seedsource no longer has to…

d48d34a

… be wrapped with PROGRAM / END, also removes automatic bye token that was generated by END

Move most code from seedForthInteractive.seedsource into seedForthRun…

375ab0f

…time.seedsource, so that we can run textual forth code without the tests or the banner

Rationalize how echo is handled during the seedForthInteractive loadi…

8da6c87

…ng phases

Split out control flow words and a few others from hi.forth into runt…

62bcae1

…ime.forth

Make ./seedForth-tokenizer self hosting (works, but hangs after token…

78426fc

…izing)

Remap tokens so that EOT is no longer a seedForth token, detect EOT e…

cbfdac8

…verywhere

Implement eot token similar to the old bye token (so we can compile b…

caac682

…ye token)

Implement a new preForth/seedForth token eemit which is like emit but…

75138c3

… writes to stderr, fix self-hosted tokenizer termination issue (was debugged with eemit)

Implement DO/?DO/LOOP, as experimental ?DO didn't have a correct comp…

7115f49

…anion LOOP

nickd4 force-pushed the self_hosting_tokenizer branch from e3b1a9b to 7115f49 Compare May 1, 2022 00:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Self hosting tokenizer#9

Self hosting tokenizer#9
nickd4 wants to merge 20 commits intouho:masterfrom
nickd4:self_hosting_tokenizer

nickd4 commented Apr 24, 2022 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nickd4 commented Apr 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nickd4 commented Apr 24, 2022 •

edited

Loading