boot2

Playing with the boostrap
git clone https://git.ryansepassi.com/git/boot2.git
Log | Files | Refs | README

hex2++

A small, byte-oriented assembler/linker that takes hex source with labels and references and emits a flat binary. Implemented in P1; used by cc.scm and the P1 backends as the final stage of the M1pp → hex2++ toolchain. M1pp output feeds hex2++ directly — there is no intermediate macro/hex stage.

Invocation

hex2++ [-B ADDR]        # base address
       [-E | -e]        # big-endian | little-endian (default: little)
       [-b]             # binary digit mode (default: hex)
       [-N]             # non-executable output
       IN OUT

IN and OUT are positional: a single input file and a single output file. To assemble several sources together, concatenate them upstream (e.g. with catm) and pass the combined file as IN. Output is one flat binary written from Base_Address upward. Unless -N is set and the output is a regular file, the output is chmod 0750'd.

There is no per-target configuration. Any target-specific encoding (RISC-V bitfield-scattered immediates, native branch displacements, etc.) is the responsibility of the upstream M1pp layer, which packs full instruction words at expansion time. hex2++ sees only contiguous-byte values.

Lexical structure

Bytes within a token may be separated by whitespace freely; only digit count matters.

Active characters:

0-9 a-f A-F  hex digits (HEX mode)
0-1          binary digits (BINARY mode)
:            label definition
. (+kw)      directive (.align, .fill, .scope, .endscope, .ptrsize)
! @ $ ~ % &  label reference
- >          label arithmetic in references (synonyms)
# ;          line comment
ws           token separator

Labels

:NAME       define label NAME at the current emit position (ip)

Label names are tokens terminated by whitespace or -. Labels may be referenced before they are defined; forward references resolve in pass 2.

The label namespace is global except that names beginning with . inside a .scope are local to that scope. The leading character of the token disambiguates labels from directives: :.NAME is a label definition, &.NAME / %.NAME / etc. are label references, and a bare .NAME (no leading : or sigil, at statement position) is a directive. Directive names are therefore reserved only at statement position, and remain available as label tokens when prefixed with : or a sigil.

.scope
  :.L1
  ...
  &.L1
.endscope

Label references

A reference is a single sigil character followed by a label expression:

Sigil Width Form Range
! 1 B rel -128..127
@ 2 B rel -32768..32767
$ 2 B abs 0..65535
~ 3 B rel -2^23..2^23-1
% ptrsize rel unchecked
& ptrsize abs unchecked

The width of % and & is set by .ptrsize — 4 bytes by default, 8 for 64-bit pointer targets.

The label expression takes one of two forms:

SIGIL LABEL                    # plain reference
SIGIL LABEL - OTHER            # emit target(LABEL) - target(OTHER)
SIGIL LABEL > OTHER            # synonym for `LABEL - OTHER`

The LABEL - OTHER form overrides the default base with another label, and applies uniformly to all sigils. > is accepted as an alias for - so hex2 inputs that use the relative-base override syntax assemble unchanged; both produce identical bytes. Both labels must be defined somewhere in the input. Range checks apply identically to plain and arithmetic forms.

Only one subtraction per reference; no addition, nesting, or parenthesization.

Examples:

# jump table entries
:jt
  &case0-jt   &case1-jt   &case2-jt

# string length prefix (string bytes themselves come from the
# upstream M1pp layer, which decodes a bare `"hello"` into the
# five hex bytes shown here)
:s_begin
  68 65 6c 6c 6f
:s_end
  &s_end-s_begin

Directives

.align N [PATTERN]

.align N            # pad to N-byte boundary with zero bytes
.align N PATTERN    # pad with the given byte/word pattern

The pad pattern is supplied by whichever upstream layer knows the target (typically a per-backend M1pp macro). hex2++ stays target-neutral.

.fill N B

.fill N B           # emit N copies of byte B

.scope / .endscope

See Labels.

.ptrsize N

.ptrsize 4          # default
.ptrsize 8          # 64-bit pointer targets

Sets the byte width of the & and % sigils. N must be 4 or 8.

.ptrsize is whole-invocation: the first occurrence seen across all inputs binds the width for the entire run, and any subsequent .ptrsize must specify the same value or it is an error. If no .ptrsize directive appears, the width defaults to 4.

Implementation outline

Two passes:

The label table carries (name, target_ip, scope_id) entries. Lookup for a dotted name walks the scope stack innermost-out and returns the first match; lookup for a non-dotted name ignores scope.

Both labels in LABEL-OTHER have known addresses by the start of pass 2, so the subtraction is a single operation at emit time. No third pass is required.