Lets Write a Compiler!
Our goal is to write a compiler which is a function:
compiler :: SourceProgram -> TargetProgram
In 131 TargetProgram
is going to be a binary executable.
Lets write our first Compilers
SourceProgram
will be a sequence of four tiny “languages”
- Numbers
- e.g.
7
,12
,42
…
- Numbers + Increment
- e.g.
add1(7)
,add1(add1(12))
, …
- Numbers + Increment + Decrement
- e.g.
add1(7)
,add1(add1(12))
,sub1(add1(42))
- Numbers + Increment + Decrement + Local Variables
- e.g.
let x = add1(7), y = add1(x) in add1(y)
Recall: What does a Compiler look like?
An input source program is converted to an executable binary in many stages:
- Parsed into a data structure called an Abstract Syntax Tree
- Checked to make sure code is well-formed (and well-typed)
- Simplified into some convenient Intermediate Representation
- Optimized into (equivalent) but faster program
- Generated into assembly
x86
- Linked against a run-time (usually written in C)
Simplified Pipeline
Goal: Compile source into executable that, when run, prints the result of evaluating the source.
Approach: Lets figure out how to write
- A compiler from the input string into assembly,
- A run-time that will let us do the printing.
Next, lets see how to do (1) and (2) using our sequence of adder
languages.
Adder-1
- Numbers
- e.g.
7
,12
,42
…
The “Run-time”
Lets work backwards and start with the run-time.
Here’s what it looks like as a C
program main.c
#include <stdio.h>
extern int our_code() asm("our_code_label");
int main(int argc, char** argv) {
int result = our_code();
"%d\n", result);
printf(return 0;
}
main
just callsour_code
and prints its return value,our_code
is (to be) implemented in assembly,- Starting at label
our_code_label
, - With the desired return value stored in register
EAX
- per, the
C
calling convention
- Starting at label
Test Systems in Isolation
Key idea in (Software) Engineering:
Decouple systems so you can test one component without (even implementing) another.
Lets test our “run-time” without even building the compiler.
Testing the Runtime: A Really Simple Example
Given a SourceProgram
42
We want to compile the above into an assembly file forty_two.s
that looks like:
section .text
global our_code_labelour_code_label:
mov eax, 42
ret
For now, lets just
- write that file by hand, and test to ensure
- object-generation and then
- linking works
(On MacOS)
nasm -f macho64 -o forty_two.o forty_two.s
$ clang -g -m64 -o forty_two.run c-bits/main.c forty_two.o $
(On Linux)
nasm -f elf64 -o forty_two.o forty_two.s
$ clang -g -m64 -o forty_two.run c-bits/main.c forty_two.o $
We can now run it:
forty_two.run
$ 42
Hooray!
The “Compiler”
Recall, that compilers were invented to avoid writing assembly by hand
First Step: Types
To go from source to assembly, we must do:
Our first step will be to model the problem domain using types.
Lets create types that represent each intermediate value:
Text
for the raw input sourceExpr
for the ASTAsm
for the output x86 assembly
Defining the Types: Text
Text
is raw strings, i.e. sequences of characters
texts :: [Text]
=
texts "It was a dark and stormy night..."
[ "I wanna hold your hand..."
, "12"
, ]
Defining the Types: Expr
We convert the Text
into a tree-structure defined by the datatype
data Expr = Number Int
Note: As we add features to our language, we will keep adding cases to Expr
.
Defining the Types: Asm
Lets also do this gradually as the x86 instruction set is HUGE!
Recall, we need to represent
section .text
global our_code_labelour_code_label:
mov eax, 42
ret
An Asm
program is a list of instructions each of which can:
- Create a
Label
, or - Move a
Arg
into aRegister
Return
back to the run-time.
type Asm = [Instruction]
data Instruction
= ILabel Text
| IMov Arg Arg
| IRet
Where we have
data Register
= EAX
data Arg
= Const Int -- a fixed number
| Reg Register -- a register
Second Step: Transforms
Ok, now we just need to write the functions:
parse :: Text -> Expr -- 1. Transform source-string into AST
compile :: Expr -> Asm -- 2. Transform AST into assembly
asm :: Asm -> Text -- 3. Transform assembly into output-string
Pretty straightforward:
parse :: Text -> Expr
= parseWith expr
parse where
= integer
expr
compile :: Expr -> Asm
Number n) =
compile (IMov (Reg EAX) (Const n)
[ IRet
,
]
asm :: Asm -> Text
= L.intercalate "\n" [instr i | i <- is] asm is
Where instr
is a Text
representation of each Instruction
instr :: Instruction -> Text
IMov a1 a2) = printf "mov %s, %s" (arg a1) (arg a2)
instr (
arg :: Arg -> Text
Const n) = printf "%d" n
arg (Reg r) = reg r
arg (
reg :: Register -> Text
EAX = "eax" reg
Brief digression: Typeclasses
Note that above we have four separate functions that crunch different types to the Text
representation of x86 assembly:
asm :: Asm -> Text
instr :: Instruction -> Text
arg :: Arg -> Text
reg :: Register -> Text
Remembering names is hard.
We can write an overloaded function, and let the compiler figure out the correct implementation from the type, using Typeclasses.
The following defines an interface for all those types a
that can be converted to x86 assembly:
class ToX86 a where
asm :: a -> Text
Now, to overload, we say that each of the types Asm
, Instruction
, Arg
and Register
implements or has an instance of ToX86
instance ToX86 Asm where
= L.intercalate "\n" [asm i | i <- is]
asm is
instance ToX86 Instruction where
IMov a1 a2) = printf "mov %s, %s" (asm a1) (asm a2)
asm (
instance ToX86 Arg where
Const n) = printf "%d" n
asm (Reg r) = asm r
asm (
instance ToX86 Register where
EAX = "eax" asm
Note in each case above, the compiler figures out the correct implementation, from the types…
Adder-2
Well that was easy! Lets beef up the language!
- Numbers + Increment
- e.g.
add1(7)
,add1(add1(12))
, …
Repeat our Recipe
- Build intuition with examples,
- Model problem with types,
- Implement compiler via type-transforming-functions,
- Validate compiler via tests.
1. Examples
First, lets look at some examples.
Example 1
How should we compile?
add1(7)
In English
- Move
7
into theeax
register - Add
1
to the contents ofeax
In ASM
mov eax, 7
add eax, 1
Aha, note that add
is a new kind of Instruction
Example 2
How should we compile
add1(add1(12))
In English
- Move
12
into theeax
register - Add
1
to the contents ofeax
- Add
1
to the contents ofeax
In ASM
mov eax, 12
add eax, 1
add eax, 1
Compositional Code Generation
Note correspondence between sub-expressions of source and assembly
We will write compiler in compositional manner
- Generating
Asm
for each sub-expression (AST subtree) independently, - Generating
Asm
for super-expression, assuming the value of sub-expression is inEAX
2. Types
Next, lets extend the types to incorporate new language features
Extend Type for Source and Assembly
Source Expressions
data Expr = ...
| Add1 Expr
Assembly Instructions
data Instruction
= ...
| IAdd Arg Arg
Example-1 Revisited
= "add1(7)"
src1
= Add1 (Number 7)
exp1
= [ IMov (Reg EAX) (Const 7)
asm1 IAdd (Reg EAX) (Const 1)
, ]
Example-2 Revisited
= "add1(add1(12))"
src2
= Add1 (Add1 (Number 12))
exp2
= [ IMov (Reg EAX) (Const 12)
asm2 IAdd (Reg EAX) (Const 1)
, IAdd (Reg EAX) (Const 1)
, ]
3. Transforms
Now lets go back and suitably extend the transforms:
parse :: Text -> Expr -- 1. Transform source-string into AST
compile :: Expr -> Asm -- 2. Transform AST into assembly
asm :: Asm -> Text -- 3. Transform assembly into output-string
Lets do the easy bits first, namely parse
and asm
Parse
parse :: Text -> Expr
= parseWith expr
parse
expr :: Parser Expr
= try primExpr
expr <|> integer
primExpr :: Parser Expr
= Add1 <$> rWord "add1" *> parens expr primExpr
Asm
To update asm
just need to handle case for IAdd
instance ToX86 Instruction where
IMov a1 a2) = printf "mov %s, %s" (asm a1) (asm a2)
asm (IAdd a1 a2) = printf "add %s, %s" (asm a1) (asm a2) asm (
Note
- GHC will tell you exactly which functions need to be extended (Types, FTW!)
- We will not discuss
parse
andasm
any more…
Compile
Finally, the key step is
compile :: Expr -> Asm
Number n)
compile (= [ IMov (Reg EAX) (Const n)
IRet
,
]Add1 e)
compile (= compile e -- EAX holds value of result of `e` ...
++ [ IAdd (Reg EAX) (Const 1) ] -- ... so just increment it.
Examples Revisited
Lets check that compile behaves as desired:
>>> (compile (Number 12)
IMov (Reg EAX) (Const 12) ]
[
>>> compile (Add1 (Number 12))
IMov (Reg EAX) (Const 12)
[ IAdd (Reg EAX) (Const 1)
,
]
>>> compile (Add1 (Add1 (Number 12)))
IMov (Reg EAX) (Const 12)
[ IAdd (Reg EAX) (Const 1)
, IAdd (Reg EAX) (Const 1)
, ]
Adder-3
You do it!
- Numbers + Increment + Double
- e.g.
add1(7)
,twice(add1(12))
,twice(twice(add1(42)))
Adder-4
- Numbers + Increment + Decrement + Local Variables
- e.g.
let x = add1(7), y = add1(x) in add1(y)
Can you think why local variables make things more interesting?
Repeat our Recipe
- Build intuition with examples,
- Model problem with types,
- Implement compiler via type-transforming-functions,
- Validate compiler via tests.
Step 1: Examples
Lets look at some examples
Example: let1
let x = 10
in
x
Need to store 1 variable – x
Example: let2
let x = 10 -- x = 10
= add1(x) -- y = 11
, y = add1(y) -- z = 12
, z in
-- 13 add1(z)
Need to store 3 variables– x, y, z
Example: let3
let a = 10
= let b = add1(a)
, c in
add1(b)in
add1(c)
Need to store 3 variables – a
, b
, c
– but at most 2 at a time
- First
a, b
, thena, c
- Don’t need
b
andc
simultaneously
Problem: Registers are Not Enough
A single register eax
is useless:
- May need 2 or 3 or 4 or 5 … values.
There is only a fixed number (say, N
) of registers
- And our programs may need to store more than
N
values, so
Need to dig for more storage space!
Memory: Code, Globals, Heap and Stack
Here’s what the memory – i.e. storage – looks like:
Focusing on “The Stack”
Lets zoom into the stack region, which when we start looks like this:
The stack grows downward (i.e. to smaller addresses)
We have lots of 4-byte slots on the stack at offsets from the “stack pointer” at addresses:
[RBP - 4 * 1]
,[RBP - 4 * 2]
,[RBP - 4 * 3]
…,
Note: On 32-bit machines the “base” is the EBP
register (not RBP
).
How to compute mapping from variables to slots ?
The i
-th stack-variable lives at address [RBP - 4 * i]
Required A mapping
- From source variables (
x
,y
,z
…) - To stack positions (
1
,2
,3
…)
Solution The structure of the let
s is stack-like too…
- Maintain an
Env
that mapsId |-> StackPosition
let x = e1 in e2
adds x |-> i
to Env
- where
i
is ``current’’ size of stack.
Let-bindings and Stacks: Example-1
-- []
let x = 1
in -- [ x |-> 1 ]
x
Let-bindings and Stacks: Example-2
-- []
let x = 1
-- [x |-> 1]
= add1(x)
, y -- [y |-> 2, x |-> 1]
= add1(y)
, z in -- [z |- 3, y |-> 2, x |-> 1]
add1(z)
QUIZ
At what position on the stack do we store variable c
?
let a = 1
=
, c let b = add1(a)
in add1(b)
in
add1(c)
A. 1
B. 2
C. 3
D. 4
E. not on stack!
Strategy
-- ENV(n)
let x = E1
in -- [x |-> n+1, ENV(n)]
E2
-- ENV(n)
Strategy: Variable Definition
At each point, we have env
that maps (previously defined) Id
to StackPosition
To compile let x = e1 in e2
we
- Compile
e1
usingenv
(i.e. resulting value will be stored ineax
) - Move
eax
into[RBP - 4 * i]
- Compile
e2
usingenv'
(where env'
be env
with x |-> i
i.e. push x
onto env
at position i
)
Strategy: Variable Use
To compile x
given env
- Move
[RBP - 4 * i]
intoeax
(where env
maps x |-> i
)
Example: Let-bindings to Asm
Lets see how our strategy works by example:
Example: let1
QUIZ: let2
When we compile
let x = 10
= add1(x)
, y in
add1(y)
The assembly looks like
mov eax, 10 ; LHS of let x = 10
mov [RBP - 4*1], eax ; save x on the stack
mov eax, [RBP - 4*1] ; LHS of , y = add1(x)
add eax, 1 ; ""
???add eax, 1
What .asm instructions shall we fill in for ???
mov [RBP - 4 * 1], eax ; A
mov eax, [RBP - 4 * 1]
mov [RBP - 4 * 1], eax ; B
mov [RBP - 4 * 2], eax ; C
mov [RBP - 4 * 2], eax ; D
mov eax, [RBP - 4 * 2]
; E (empty! no instructions)
Example: let3
Lets compile
let a = 10
= let b = add1(a)
, c in
add1(b)in
add1(c)
Lets figure out what the assembly looks like!
mov eax, 10 ; LHS of let a = 10
mov [RBP - 4*1], eax ; save a on the stack
???
Step 2: Types
Now, we’re ready to move to the implementation!
Source Expressions
type Id = Text
data Expr = ...
| Let Id Expr Expr -- `let x = e1 in e2` represented as `Let x e1 e2`
| Var Id -- `x` represented as `Var x`
Assembly Instructions
Lets enrich the Instruction
to include the register-offset [RBP - 4*i]
data Arg = ...
| RegOffset Reg Int -- `[RBP - 4*i]` modeled as `RegOffset RBP i`
Environments
An Env
type to track stack-positions of variables with API
push
variable ontoEnv
(returning its position),lookup
a variable’s position inEnv
push :: Id -> Env -> (Int, Env)
= (i, (x, i) : env)
push x env where
= 1 + length env
i
lookup :: Id -> Env -> Maybe Int
lookup x ((y, i) : env)
| x == y = Just i
| otherwise = lookup x env
lookup x [] = Nothing
Step 3: Transforms
Almost done: just write code formalizing the above strategy
Code: Variable Use
Var x) = [ IMov (Reg EAX) (RegOffset RBP i) ]
compileEnv env (where
= fromMaybe err (lookup x env)
i = error (printf "Error: Variable '%s' is unbound" x) err
Code: Variable Definition
Let x e1 e2 l) = compileEnv env e1
compileEnv env (++ IMov (RegOffset RBP i) (Reg EAX)
: compileEnv env' e2
where
= pushEnv x env (i, env')
Step 4: Tests
Lets take our adder
compiler out for a spin!
Recap: We just wrote our first Compilers
SourceProgram
will be a sequence of four tiny “languages”
- Numbers
- e.g.
7
,12
,42
…
- Numbers + Increment
- e.g.
add1(7)
,add1(add1(12))
, …
- Numbers + Increment + Decrement
- e.g.
add1(7)
,add1(add1(12))
,sub1(add1(42))
- Numbers + Increment + Decrement + Local Variables
- e.g.
let x = add1(7), y = add1(x) in add1(y)
Using a Recipe
- Build intuition with examples,
- Model problem with types,
- Implement compiler via type-transforming-functions,
- Validate compiler via tests.
Will iterate on this till we have a pretty kick-ass language.