Literals

This document describes the bytecode for generating heap-allocated literals in the SML/NJ compiler. It replaces the previous bytecode and was introduced as part of the support for future improvements, such as 64-bit support, Real32, and better Int64 and IntInf integration.

Compilation

The compiler extracts the literal values from the CPS IR and generates a program for building a record of literal values. Direct references to literal values in the CPS code are replaced by references to components of the literal record. The binary data representing the literal-construction program is packaged as part of the Binfile generated by the compiler.

Endianess

Multiple byte quantities are represented in big-endian form (most-significant byte first).

Header

The first four 32-bit words of the literal representation correspond to the following C struct:

struct literal_header {
    uint32_t    magic;
    uint32_t    maxstk;
    uint32_t    wordsz;
    uint32_t    maxsaved;
};

where

magic contains the version ID (which should be 0x20171031)
maxstk is the maximum stack depth required, and
wordsz is the size of an ML value (32 or 64)
maxsaved is the number of saved literals (used for sharing)

Note that Version 1 files will have the version ID 0x19981022 and have the first two header fields, but not the wordsz or numsaved fields.

Opcodes

The following is a list of the symbolic opcodes used in the interpreter. We describe the instruction encoding below.

INT(n) literal value in the default (tagged) integer or word type (Int.int or Word.word). The value n should be in the range -2^w-1^ to 2^w^-1 when encoded as a w-bit 2's complement integer. The width w will be 31 or 63 depending on the host architecture.
INT32(n) 32-bit literal value for either the type Int32.int or Word32.int.
INT64(n) 64-bit literal value for either the type Int64.int or Word64.int.
BIGINT(n) arbitrary precision integer literal (currently not used).
IVEC8(n, b1, ..., bn) packed vector of 8-bit integers for either the type Int8Vector.vector or Word8Vector.vector.
IVEC16(n, h1, ..., hn) packed vector of 16-bit integers for either the type Int16Vector.vector or Word16Vector.vector (currently not used).
IVEC32(n, w1, ..., wn) packed vector of 32-bit integers for either the type Int32Vector.vector or Word32Vector.vector (currently not used).
IVEC64(n, d1, ..., dn) packed vector of 64-bit integers for either the type Int64Vector.vector or Word64Vector.vector (currently not used).
REAL32(f) 32-bit floating-point literal for the type Real32.real (currently not used).
REAL64(f) 64-bit floating-point literal for the type Real32.real.
RVEC32(n, f1, ..., fn) packed vector of 32-bit floating-point literals for the type Real32Vector.vector (currently not used).
RVEC64(n, F1, ..., Fn) packed vector of 64-bit floating-point literals for the type Real64Vector.vector.
STR8(s) string literal (8-bit characters)
RECORD(n) construct record from the topmost n literal values
VECTOR(n) construct a vector from the topmost n literal values
RAW8(n) raw sequence of bytes. This literal does not have an SML type.
RAW16(n) raw sequence of 16-bit values. This literal does not have an SML type.
RAW32(n) raw sequence of 32-bit values. This literal does not have an SML type.
RAW64(n) raw sequence of 64-bit values. This literal does not have an SML type.
CONCAT(n) pop n records/vectors from the stack and concatenate them into a single record/vector. This operation allows the implementation to avoid excessively large stacks when building very large record/vector literals.
SAVE(i) save the top of the stack in the i^th^ save slot, which allows it to be shared by some subsequent aggregate literal.
LOAD(i) push the i^th^ saved literal onto the stack.
RETURN signals the end of the program; the stack depth should be one and that value is popped and returns as the result.

Future extensions

There are a number of additional features that we might want to support, which we list here.

support for 32-bit string literals for the type WideString.string
support for array literals (like vectors, but mutable)

Instruction encoding

Notation

In the encoding below, we use the following conventions:

b represents a signed 8-bit integer.
ub represents an unsigned 8-bit integer.
c represents a 8-bit character.
h represents a signed 16-bit integer.
w represents a signed 32-bit integer.
lw represents a signed 64-bit integer.
n represents a 32-bit integer length (usually unsigned).
d represents a bignum digit whose size will be the default word size.
f represents a 32-bit floating-point literal.
F represents a 64-bit floating-point literal.
i represents a tagged default int or word literal (e.g., Int.int or Word.word).

Encoding

00000000 (0x00)
INT(0)
default tagged literal value 0.
00000001 (0x01)
INT(1)
default tagged literal value 1.
00000010 (0x02)
INT(2)
default tagged literal value 2.
00000011 (0x03)
INT(3)
default tagged literal value 3.
00000100 (0x04)
INT(4) default tagged literal value 4.
00000101 (0x05)
INT(5) default tagged literal value 5.
00000110 (0x06)
INT(6) default tagged literal value 6.
00000111 (0x07)
INT(7) default tagged literal value 7.
00001000 (0x08)
INT(8) default tagged literal value 8.
00001001 (0x09)
INT(9) default tagged literal value 9.
00001010 (0x0A)
INT(10) default tagged literal value 10.
00001011 (0x0B)
INT(-1) default tagged literal value -1.
00001100 (0x0C)
INT(-2) default tagged literal value -2.
00001101 (0x0D)
INT(-3) default tagged literal value -3.
00001110 (0x0E)
INT(-4) default tagged literal value -4.
00001111 (0x0F)
INT(-5) default tagged literal value -5.
00010000 (0x10 b)
INT(b) --- for tagged integer literals in the range -128..127.
00010001 (0x11 h)
INT(h) --- for tagged integer literals in the range -32768..32767.
00010010 (0x12 w)
INT(w) --- for tagged integer literals in the range -2147483648..2147483647.
00010011 (0x13 lw)
INT(lw) --- for all other tagged integer literals (64-bit target only).
00010100 (0x14 b)
INT32(b) --- for 32-bit integer literals in the range -128..127.
00010101 (0x15 h)
INT32(h) --- for 32-bit integer literals in the range -32768..32767.
00010110 (0x16 w)
INT32(w) --- for all other 32-bit integer literals.
00010111 (0x17 b)
INT64(b) --- for 64-bit integer literals in the range -128..127.
00011000 (0x18 h)
INT64(h) --- for 64-bit integer literals in the range -64768..64767.
00011001 (0x19 w)
INT64(w) --- for 64-bit integer literals in the range -2147483648..2147483647.
00011010 (0x1A lw)
INT64(lw) --- for all other 64-bit integer literals.
00011011 (0x1B n d1 ... d~|n|)
BIGINT(i) --- where i = sign(n) b^|n|-1^ d|n|~ ... d1. I.e., the absolute value of n is the number of digits, where is n is negative, then i is negative. The digits follow n in least-significant to most-significant order. If n is zero, the i is zero. The base b and size of the digits will depend on the target word size.
00011100 (0x1C ub i1 ... iub)
IVEC(ub, i1, ..., iub) --- short int vector (up to 255 elements).
00011101 (0x1D n i1 ... in)
IVEC(ub, i1, ..., in)
00011110 (0x1E ub b1 ... bub)
IVEC8(ub, b1, ..., bub) --- short bytevectors (up to 255 elements).
00011111 (0x1F n b1 ... bn)
IVEC8(n, b1, ..., bn)
00100000 (0x20 ub h1 ... hub)
IVEC16(ub, h1, ..., hub) --- short 16-bit integer vectors (up to 255 elements).
00100001 (0x21 n h1 ... hn)
IVEC16(n, h1, ..., hn)
00100010 (0x22 ub w1 ... wub)
IVEC32(ub, w1, ..., wub) --- short 32-bit integer vectors (up to 255 elements).
00100011 (0x23 n w1 ... wn)
IVEC32(n, w1, ..., wn)
00100100 (0x24 ub lw1 ... lwub)
IVEC64(ub, lw1, ..., lwub) --- short 64-bit integer vectors (up to 255 elements).
00100101 (0x25 n lw1 ... lwn)
IVEC64(n, lw1, ..., lwn)
00100110 (0x26 f)
REAL32(f)
00100111 (0x27 F)
REAL64(F)
00101000 (0x28 ub f1 ... fub)
RVEC32(ub, f1, ..., fub) --- short 32-bit real vectors (up to 255 elements).
00101001 (0x29 n f1 ... fn)
RVEC32(n, f1, ..., fn)
00101010 (0x2A ub F1 ... Fub)
RVEC64(ub, F1, ..., Fub) --- short 64-bit real vectors (up to 255 elements).
00101011 (0x2B n F1 ... Fn)
RVEC64(n, F1, ..., Fn)
00101100 (0x2C ub c1 ... cub)
STR8(s) --- where size(s) = ub and c1, ..., cub are the characters of s.
00101101 (0x2D n c1 ... cn)
STR8(s) --- where size(s) = n and c1, ..., cn are the characters of s.
00101110 (0x2E)
reserved for STR32
00101111 (0x2F)
reserved for STR32
00110000 (0x30)
RECORD(1)
00110001 (0x31)
RECORD(2)
00110010 (0x32)
RECORD(3)
00110011 (0x33)
RECORD(4)
00110100 (0x34)
RECORD(5)
00110101 (0x35)
RECORD(6)
00110101 (0x36)
RECORD(7)
00110101 (0x37 ub)
RECORD(ub)
00110101 (0x38 h)
RECORD(h)
00110101 (0x39 ub)
VECTOR(ub)
00110101 (0x3A h)
VECTOR(h)
00110101 (0x3B ub)
RAW8(ub)
00110101 (0x3C h)
RAW8(h)
00110101 (0x3D ub)
RAW16(ub)
00110101 (0x3E h)
RAW16(h)
00111111 (0x3F ub)
RAW32(ub)
01000000 (0x49 h)
RAW32(h)
01000001 (0x41 ub)
RAW64(ub)
01000010 (0x42 h)
RAW64(h)
01000011 (0x43 h)
CONCAT(h)
01000100 (0x44 ub)
SAVE(ub)
01000101 (0x45 h)
SAVE(h)
01000110 (0x46 ub)
LOAD(ub)
01000111 (0x47 h)
LOAD(h)
01001000 -- 11111110 (0x48 -- 0xFE)
unused
11111111 (0xFF)
RETURN

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Literals

Compilation

Endianess

Header

Opcodes

Future extensions

Instruction encoding

Notation

Encoding

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally