How Can I Represent An Assembly Program With Labels?

Question 1

I am trying to represent assembly programs in Coq and compile them into a sequence of bytes.

I have an assembly language, a decoder, a compiler, and proof that the decoder is the inverse of the compiler for instructions satisfying a well-typedness property.

Now I'm working on creating representations of assembly code that are easier for humans to read and write which will be compiled to bytes. I'd like to support labels so I don't have to manually do instruction-length and offset arithmetic at every branch and jump. That's the difference between this representation and asm below.

The data is supposed to flow like this:

Assembly Representation (What I'm asking about)
 |
 V
Assembler (Incomplete but easy)
 |
 V
Assembly (Complete, see the `asm` type below)
 |
 V
 Compiler (Complete)
 |
 V
 Bytes
 |
 V
 Decoder (Complete)
 |
 V
Assembly (Complete, see the `asm` type below)

This is what I have so far:

Require Import NArith.
Inductive asm : Type :=
 | ADD (rd rs1 rs2:N)
 | BEQ (rs1 rs2:N) (si:Z)
 (* other assembly instructions *)
.
(* Tentative function for finding location of a label *)
Require Import String.
Require Import List.
Import ListNotations.
Open Scope string.
Open Scope N.
Definition insn_length (a:asm) : N := 4.
Fixpoint label_loc (label:string) (base_addr:N) (code:list (sum string asm)) : option N :=
 match code with
 | nil => None
 | inl label' :: t => if (label =? label')%string then Some base_addr 
 else label_loc label base_addr t
 | inr insn :: t => label_loc label (base_addr + (insn_length insn)) t
 end.
(* Ideal program representation is a list of labels and instructions
 and allows instructions to use label locations to compute immediates *)
Definition ideal_example (base_address : N) : list (sum string asm) :=
 [ inr (ADD 0 0 0) ;
 inl "label" ;
 inr (BEQ 0 0 (label_loc "label" base_address (ideal_example base_address)))
 ].

I tried mutually-recursive fixpoints thinking that maybe I could refer to "constant" defined after the program that calculate the label locations. I do not like this solution because of the additional layer of indirection and it's not really a fixpoint operation. As Coq points out "A fixpoint needs at least one parameter." Here's my attempt that throws that error :

Fixpoint mutual_recursion_example : list (sum string asm) :=
 [ inr (ADD 0 0 0) ;
 inl "label" ;
 inr (BEQ 0 0 klabel)
 ]
with klabel := label_loc "label" 0 mutual_recursion_example.

I'm considering creating another language for preprocessing, say pre_asm, but then I need to somehow prove that the step from pre_asm to asm is semantic preserving, so I hope to get some feedback first. I'm hoping I missed some useful pattern or insight that is well known in the community.

(* Possible pre-assembly language. *)
Inductive pre_asm :=
 | pADD (rd rs1 rs2:N)
 | pBEQ (rs1 rs2:N) (si_or_label:sum Z string)
.
Fixpoint preprocess (base_address:N) (code:list (sum string pre_asm)) : list asm
 := (* fill in here *).

Question 2

With Yannick Forster, we used a generic low-level certified compîler for undecidability results in Coq. The preprint can be found here: members.loria.fr/DLarchey/files/papers/UNDEC_CPP19.pdf

Question 3

This work was later included as part of the foundations of the "Coq Library of Undecidability Proofs": github.com/uds-psl/coq-library-undecidability

Question 4

Thank you Julio, I added clarification. Dominique, I'm not sure what to make of the paper or how it is immediately relevant to my question. I'm assuming I should focus on section 6?

Question 5

Sections 4, 5 and 6 and the associated Coq code.

Question 6

I have converted my comments into an answer (SO is for definite Q&A, not discussion). It is a partial answer in the sense that I do not offer concrete definitions: I might complete it depending on your feedback, although I'd rather suggest (as long as everything is clear) that you just accept it and ask more specific questions later if you need.

Question 7

[Here I just offer a possible design: I might complete it with (more) concrete definitions depending on the OP's feedback.]

Let HAsm be the human-readable representation of instructions, i.e. optionally-labelled instructions (do not make "label" itself an instruction), where jump targets are labels (must be existing labels within a HProg, i.e. a human-readable program). Let Asm be the internal representation of instructions, unlabelled and where jump targets are address offsets (must be within-bounds within a Prog, i.e. a program in the internal representation).

We/I'd want a bidirectional transform HProg <--> Prog between programs in the two representations, but we cannot in fact get an invertible such transform, as Prog does not (and shall not) keep track of labels in a corresponding HProg: from a Prog we can only synthesize labels based on some scheme (from a jump's target offset, possibly combined with the jump instruction's offset relative to the program's base address).

Instead, let [HProg] denote classes of HProg under alpha-equivalence by labels (i.e. treating labels like variable names). Then we can define an invertible transform [HProg] <--itransform--> Prog, which, for correctness, we must prove establishes a 1-to-1 correspondence between [HProg] and Prog.

Further, let IProg be the canonical representatives of classes in [HProg], and define a function hcanon : HProg -> IProg that, for given HProg, returns the canonical representative of the [HProg] class it belongs to.

Finally, we can extend our transform chain to HProg --hcanon--> IProg <--itransform--> Prog, where, to close the circle then for correctness, hcanon must rewrite labels in the given HProg in such a way that the returned IProg is labelled exactly as the fixed labelling scheme to transform from Prog to IProg (was [HProg]) would do.

user22862809 · Accepted Answer · 2024-12-30 14:56:57Z

[Here I just offer a possible design: I might complete it with (more) concrete definitions depending on the OP's feedback.]

Let HAsm be the human-readable representation of instructions, i.e. optionally-labelled instructions (do not make "label" itself an instruction), where jump targets are labels (must be existing labels within a HProg, i.e. a human-readable program). Let Asm be the internal representation of instructions, unlabelled and where jump targets are address offsets (must be within-bounds within a Prog, i.e. a program in the internal representation).

We/I'd want a bidirectional transform HProg <--> Prog between programs in the two representations, but we cannot in fact get an invertible such transform, as Prog does not (and shall not) keep track of labels in a corresponding HProg: from a Prog we can only synthesize labels based on some scheme (from a jump's target offset, possibly combined with the jump instruction's offset relative to the program's base address).

Instead, let [HProg] denote classes of HProg under alpha-equivalence by labels (i.e. treating labels like variable names). Then we can define an invertible transform [HProg] <--itransform--> Prog, which, for correctness, we must prove establishes a 1-to-1 correspondence between [HProg] and Prog.

Further, let IProg be the canonical representatives of classes in [HProg], and define a function hcanon : HProg -> IProg that, for given HProg, returns the canonical representative of the [HProg] class it belongs to.

Finally, we can extend our transform chain to HProg --hcanon--> IProg <--itransform--> Prog, where, to close the circle then for correctness, hcanon must rewrite labels in the given HProg in such a way that the returned IProg is labelled exactly as the fixed labelling scheme to transform from Prog to IProg (was [HProg]) would do.

CollectivesTM on Stack Overflow

How Can I Represent An Assembly Program With Labels?

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Hot Network Questions

CollectivesTM on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Post as a guest

Related