eForth Overview
Before diving directly into eForth, I would like to
discuss the general principles of Forth language. The language consists of a
collection of words, which reside in the memory of a computer and can be
executed by entering their names on the computer keyboard. A list of words can be compiled, given a
new name and made a new word. In
fact, most words in Forth are defined as lists of existing words. A small set of primitive words are
defined in machine code of the native CPU.
All other words are built from this primitive words and eventually refer
to them when executed.
Words are similar to procedures and subroutines in
other languages. The difference is
that Forth words are executable interactively when referenced by name, and they
can be compiled into lists which can be referenced as new words. Programming in Forth is to define new
and more powerful words as lists of existing words. This process continues until the final
word becomes the solution to an application.
Here I will state 'The Forth Law of Computing"
without a proof:
All computable functions can
be constructed by defining new words as lists of words which include a small
number of primitive words.
This eForth model consists of about 200 words, of
which only 31 are primitive words.
Although it is very difficult to prove the above law, I will demonstrate
it to you that from this small set of primitive words a complete operating
system with many tools, that is the eForth model itself, can be built. If an operating system can be built this
way, it is not difficult to understand that any application can be so
developed.
Forth is very similar to machine code. In a computer, the CPU has a finite set
of machine instructions, and all computable functions are implemented as lists
of these machine instructions. High
level languages generally replace machine instruction lists by statements, functions,
subroutines, and procedures, which can be used to construct procedures and
subroutines at higher levels until the last procedure which is the
application. This also helps
demonstrating the validity of the above law.
The primitive words must be constructed using native
machine code of the host computer.
They are also called low level words or code words. All other words are constructed as lists
of existing words. They are called
high level words or colon words because ":" (colon) is a Forth word
which defines or constructs new words to replace lists of existing words.
Forth as a computing system has two principal components:
an user interface as the Forth language processor which interprets the commands
entered from keyboard or equivalent devices; and a machine interface which
interprets lists or words recursively until it can issue machine instructions
in the primitive words to the host computer for execution. The user interface processes commands in
text form. It is often referred to
as the text interpreter and the outer interpreter.
The machine interface executes words by processing recursively
the word lists compiled in colon words to reach the primitive words which are
handed to the host computer for execution. It is often called the inner
interpreter and the address interpreter, because the word lists are often stored
in the dictionary as address lists.
Virtual Forth Computer
Forth is a
computer model which can be implemented on any real CPU with reasonable
resources. This model is often
called a virtual Forth computer.
The minimal components of a virtual Forth computer are:
1.
A dictionary in memory to hold all the execution procedures.
2.
A return stack to hold return addresses of procedures yet to be
executed.
3.
A data stack to hold parameters passing between procedures.
4.
A user area in RAM memory to hold all the system variables.
5.
A CPU to move date among stacks and memory, and to do ALU operations to parameters
stored on the data stack.
The eForth model is a detailed specification of a virtual
Forth computer which can be implemented on many different CPU's and forces them
to behave identically in executing an identical Forth instruction set. It was
first implemneted on a PC using Intel 8086 CPU as a guiding model for other
implementations. Here we will try
to describe precisely the behavior of the virtual Forth computer. To describe
precisely how this computer functions, we will use the 8086 machine code to
clarify the specification.
The following registers are required for a virtual
Forth computer:
Forth Register 8086
Register Function
IP SI Interpreter
Pointer
SP SP Data
Stack Pointer
RP RP Return
Stack Pointer
WP AX Word
or Work Pointer
UP (in
memory ) User
Area Pointer
In the dictionary, each procedure (or word in Forth
terminology) occupies an area called code field, which contains executable
machine code and data required by the code. There are two types of words used
in eForth: code word whose code field contains only machine instructions, and
colon word whose code field contains a call to the list processing subroutine
and a list of word addresses. A word
address is the code field address of the word in the dictionary. 4 bytes are allocated for the call to
list processor. Word addresses are 2 bytes in length, and are pointers to code
fields of words in the dictionary.
The length of a code field varies depending upon the complexity of the
word.
In the code field of a code word there is a list of
machine instructions of the native CPU.
The machine instructions are terminated by a group of instructions,
generally specified as a macro instruction named $NEXT. The function of $NEXT is to fetch the next
word pointed to by the Interpreter Pointer IP, increment IP to point to the
next word in the word list, and jump to the address just fetched. Since a word address points to a code
field containing executable machine instructions, executing a word means
jumping directly to the code field pointed to by the word address. $NEXT thus allows the virtual Forth computer
to execute a list of words with very little CPU overhead. In the 8086 implementation, $NEXT is a
macro assembling the following two machine instructions as shown below.
In a colon word, the first four byte in the code
field must be a subroutine call instruction to process the address list
following this call instruction.
This address list processing subroutine is named doLIST. doLIST pushes the contents in IP onto
the return stack, copies the address of the first entry in its address list
into IP and then calls $NEXT. $NEXT
will then start executing this list of addresses in sequence.
The last entry in the address list of a colon word
must be EXIT. EXIT is a code word
which undoes what doLIST accomplished.
EXIT pops the top item on the return stack into the IP register.
Consequently, IP points to the address following the colon word just
executed. EXIT then invokes $NEXT
which continues the processing of the word list, briefly interrupted by the
last colon word in this word list.
$NEXT MACRO
LODSW
\
load next word into WP (AX)
JMP
AX \
jump directly to the word thru WP
ENDM \
IP (SI) now points to the next word
doLIST ( a -- ) \
Run address list in a colon word.
XCHG BP,SP \
exchange pointers
PUSH SI \
push return stack
XCHG BP,SP \
restore the pointers
POP SI \
new list address
$NEXT
CODE EXIT \
Terminate a colon definition.
XCHG BP,SP \
exchange pointers
POP SI \
pop return stack
XCHG BP,SP \
restore the pointers
$NEXT
It is interesting to note that in this eForth
implementation, $NEXT is a macro, doLIST is a subroutine, and EXIT is actually
a Forth code words. $NEXT, doLIST
and EXIT are collectively call the
'inner interpreters' and 'address interpreters' of Forth. They are the corner stones of a virtual
Forth computer as they control the execution flow of Forth words in the system.
Based on the above mechanism to execute code words
and colon words, a Forth computer can be constructed using a small set of
machine dependent code words and a much larger set of colon words. Tools are provided so that the user can
extend the system by adding new words in truly modular fashion to solve any
practical problems.
There are 190 high level words in eForth, built on
the 31 low level primitive words.
The high level word set is required to build the outer interpreter and
the associated utility words. As
the outer interpreter itself represents a fairly substantial application, the
word set necessary to build the outer interpreter forms a very solid foundation
to build most other applications.
However, for any real world application one would not expect that this
eForth word set is sufficient. The
beauty of Forth is that in programming an application, the user designs and
implements a new word set best tailored to his application. Forth is an open system, assuming that
no operating system can be complete and all-encompassing. The user has the best understanding of
his own needs, and he knows the best way to accomplish his goal.
Memory Map
The most
important contribution by von Neumann to the computer design was the
recognition that a single, uniform memory device can be used to store program
and data, contrasting to the then prevailing architectures in which program and
data were stored separately and most often using very different storage
media. It greatly simplified the
design of computers and had become the dominant computer architecture for all
the important computer families ever since.
Memory space is a concept of paramount importance in
computer hardware and assembly programming, but often hidden and ignored in
most conventional high level languages.
High level languages and operating systems hide the addressable memory
space from the user in order to protect the operating system, because there are
very sensitive areas in the memory space and unintentional alterations to the
information stored in these areas would cause the system to malfunction or even
to crash. The point of view from
the operating system and from the computer priesthood, these sensitive areas must be protected
at all cost, and they are the reserved territory of the systems
programmers. Ordinary applications
programmers are allocated only enough space to run their programs safely, for
their own good.
Forth opens the entire memory space to the user. The user can freely store data and code
into memory and retrieve them from the memory. Coming with the freedom is the
responsibility of handling the memory correctly.
Memory used in eForth is separated into the following
areas:
Cold boot 100H-17FH Cold start and variable initial values
Code dictionary 180H-1344H Code dictionary growing upward
Free space 1346H-33E4H Shared by code and name dictionaries
Name/word 33E6H-3BFFH Name dictionary growing downward
Data stack 3C00H-3E7FH Growing downward
TIB
3E80H-
Growing upward
Return stack -3F7FH Growing
downward
User variables 3F80H-3FFFH
These areas are allocated by assembly constants and
can be changed conveniently to suit the target environment. The following assembly code segment
prescribes the memory allocation in a typical eForth system. The memory map is also illustrated in a
schematic drawing for easier visulization.
;; Memory allocation
;; 0//code>--//--<name//up>--<sp//tib>--rp//em
EM EQU 04000H ;top
of memory
COLDD EQU 00100H ;cold
start vector
US EQU 64*CELLL ;user
area size in cells
RTS EQU 64*CELLL ;return
stack/TIB size
RPP EQU EM-8*CELLL ;start
of return stack (RP0)
TIBB EQU RPP-RTS ;terminal
input buffer (TIB)
SPP EQU TIBB-8*CELLL ;start
of data stack (SP0)
UPP EQU EM-256*CELLL ;start
of user area (UP0)
NAMEE EQU UPP-8*CELLL ;name
dictionary
CODEE EQU COLDD+US ;code
dictionary

eForth Kernel
One of the most important feature of eForth is the
small machine dependent kernel, which allows its to be ported to other CPU's very
conveniently. The selection of words in this kernel is based on the criteria
that they are very difficult if not impossible to synthesize from other primitive
words. From this set of kernel
words, all other Forth words have to be built. The kernel words can be classified as
following:
System interface: BYE,
?rx, tx!, !io
Inner interpreters: doLIT, doLIST, next, ?branch, branch, EXECUTE, EXIT
Memory access:
! , @, C!, C@
Return stack:
RP@, RP!, R>, R@, R>
Data stack:
SP@, SP!, DROP, DUP, SWAP, OVER
Logic:
0<, AND, OR,
XOR
Arithmetic:
UM+
The virtual Forth computer is based on a tw-stack
architecture. The return stack is
used to allow a high level word to be executed in the address list of another
high level word. It is very similar
to the return stack used for nested subroutine calls in a conventional
computer. Before executing a high
level word in an address list, the next address of the list is pushed on the
return stack so that the IP register can be used to scan the address list in
the called word. When the called
word is executed to completion, the stored address on the returned stack is
popped back into IP register and execution of the calling word list can be
continued.
The data stack is used to pass parameters from one
word to another. Conventional
computers use the return stack to do the parameter passing, and it takes a very
complicated compiler to figure out which are return addresses and which are
parameters. Forth segregated these
two types of information on two separate stacks and thus greatly simplies the
execution and compilation of words.
Passing parameter on the data stack also reduces the syntactical
complexity of Forth language to the minimum and allows words to be strung
together into lists with minimum overhead in compilation and interpretation.
The kernal words move and process data and address
among the stacks and the memory.
They emcompass the minimal functionality necessary to make a computer to
behave like a Forth computer. A
complete understanding of these kernel word is vital to the understanding of a
virtual Forth computer. However, it
is not difficult to understand the kernel words, because there are only 31 of
them.
It is my intention to use this eForth model to
illustrate the validity of 'the Forth Law of Computing', which stated that all
computable functions can be constructed by lists of these kernel words and the
high level words built from these kernel words. The eForth model includes a text
interpreter which allows the user to type lists of word names and execute them
in sequence, a compiler which allows the user to name lists of words and
compile new words, and utilities like memory dump, stack dump, and a colon word
decompiler. Thus the eForth system
forms a fairly complete software development environment for the user to
develop applications. If such a
system can be built from this small set of kernel words, it should be obvious
that most practical applications can also be built from it.
System Interface
BYE returns control from eForth back to the operating
system. !io initializes the serial
I/O device in the system so that it can interact with the user through a
terminal. These two words are not
needed once the eForth system is up and running, but they are essential to
bring the system up in DOS. ?rx is
used to implement ?KEY and KEY, and tx! is used to implement EMIT. eForth communicates with the user
through these words which supports terminal interactions and file
download/upload. Here these words
are defined using the DOS service calls.
For embedded controllers, these three words must be defined for the
specific I/O devices.
?RX is a unique design invented by Bill Muench to
support serial input . ?RX provides
the functions required of both KEY and KEY? which accept input from a
terminal. ?RX inspects the terminal
device and returns a character and a true flag if the character has been
received and is waiting to be retrieved.
If no character was received, ?RX simply returns a false flag. With ?RX, both KEY and KEY? can be
defined as high level colon definitions.
TX! sends a character on the data stack to the
terminal device. Both ?RX and TX!
are coded here as DOS calls. In
embedded applications, they will have to be coded in machine specific code to
handle the specific serial I/O device.
!IO initializes the serial I/O device, which is not necessary here
because it is taking care of by the DOS.
In embedded systems, the I/O device must be initialized by !IO.
CODE BYE ( -- , exit Forth )
INT 020H \
return to DOS
CODE ?RX (
-- c T | F ) \
Return input character and true,
\
or a false if no input.
$CODE 3,'?RX',QRX
XOR BX,BX \
BX=0 setup for false flag
MOV DL,0FFH \
input command
MOV AH,6 \
MS-DOS Direct Console I/O
INT 021H
JZ QRX3 \
?key ready
OR AL,AL \
AL=0 if extended char
JNZ QRX1 \
?extended character code
INT 021H
MOV BH,AL \
extended code in msb
JMP QRX2
QRX1: MOV BL,AL
QRX2: PUSH BX \
save character
MOV BX,-1 \
true flag
QRX3: PUSH BX
$NEXT
CODE TX! (
c -- ) \
Send character c to output device.
POP DX \
char in DL
CMP DL,0FFH \
0FFH is interpreted as input
JNZ TX1 \
do NOT allow input
MOV DL,32 \
change to blank
TX1: MOV AH,6 \
MS-DOS Direct Console I/O
INT 021H \
display character
$NEXT
CODE !IO (
-- ) \
Initialize the serial I/O devices.
$NEXT
Inner Interpreter
In the word list of a colon definition, it is
generally assumed that words are execution addresses, which can be executed
sequentially by the address interpreter $NEXT. However, occasionally we do need to
compile other types of data in-line with the words. Special mechanisms must be used to tell
the address interpreter to treat these data differently. All data entries must be preceded by
special words which can handle the data properly. A special word and its associated data
form a data structure. Data
structures are extensions of words and can be thought of as building blocks to
form lists in colon definitions.
$NEXT must be assembled at the end of a code
word. It fetches the next address
in the address list pointed to by IP and jumps to that address. It allows an address list to be scanned
and thus executed. doLIST starts
the execution of an address list by saving IP on the return stack and stores
the starting address of an address list into IP, and then $NEXT starts
executing this address list. EXIT
must be compiled as the last entry in an address list. It terminates the execution of the
current address list and returns execution to the address saved on the return
stack.
EXECUTE takes the execution address from the data
stack and executes that word. This
powerful word allows the user to execute any word which is not a part of an
address list.
doLIT pushes the next word onto the data stack as an
integer literal instead of as an addresses to be executed by $NEXT. It allows numbers to be compiled as
in-line literals, supplying data to the data stack at run time. doLIT is not used by itself, but rather
compiled by LITERAL which inserts doLIT and its asociated integer into the
address list under construction.
Anytime you see a number in a colon definition, LITERAL is invoked to
compile an integer literal with doLIT.
Integer literals are by far the most numerous data
structures in colon definitions other than regular words. Address literals are used to build
control structures. String literals
are used to embed text strings in colon definitions.
$NEXT MACRO
LODSW
\
load next word into WP (AX)
JMP
AX \
jump directly to the word thru WP
ENDM \
IP (SI) now points to the next word
doLIST ( a -- ) \
Run address list in a colon word.
XCHG BP,SP \
exchange pointers
PUSH SI \
push return stack
XCHG BP,SP \
restore the pointers
POP SI \
new list address
$NEXT
CODE EXIT \
Terminate a colon definition.
XCHG BP,SP \
exchange pointers
POP SI \
pop return stack
XCHG BP,SP \
restore the pointers
$NEXT
CODE EXECUTE (
ca -- ) \
Execute the word at ca.
POP BX
JMP BX \
jump to the code address
CODE doLIT (
-- w ) \
Push inline literal on data stack.
LODSW \
get the literal compiled in-line
PUSH AX \
push literal on the stack
$NEXT \
execute next word after literal
Loops and Branches
eForth uses three different types of address
literals. 'next', '?branch' and 'branch' are followed not by word addresses but
by pointers to locations in a list to be executed next. These address literals are the building
blocks upon which loops and branching structures are constructed. An address literal is followed by a
branch pointer which causes execution to be transferred to that location. The branch location most often points to
a different location in the address list of the same colon word.
CODE next (
-- ) \
Decrement index and exit loop
\
if index is less than 0.
SUB WORD
PTR [BP],1 \
decrement the index
JC NEXT1 \
?decrement below 0
MOV SI,0[SI] \
no, continue loop
$NEXT
NEXT1:ADD BP,2 \
yes, pop the index
ADD SI,2 \
exit loop
$NEXT
CODE ?branch (
f -- ) \
Branch if flag is zero.
POP BX \
pop flag
OR BX,BX \
?flag=0
JZ BRAN1 \
yes, so branch
ADD SI,2 \
point IP to next cell
$NEXT
BRAN1:MOV SI,0[SI] \
IP:=(IP), jump to new address
$NEXT
CODE branch (
-- ) \
Branch to an inline address.
MOV SI,0[SI] \
jump to new address unconditionally
$NEXT
Address literals are used to construct control
structures in colon definitions.
'next' is compiled by NEXT.
'?branch' is compiled by IF, WHILE and UNTIL. 'branch' is compiled by AFT, ELSE,
REPEAT and AGAIN. In the colon
words to be discussed in the later sections, you will not see these kernel
words but words which construct loops and branches. For examples:
IF
( compiles ?branch and address after THEN ) <true clause> THEN
IF
( compiles ?branch and address after ELSE ) <true clause>
ELSE
( compiles branch and address after THEN ) <false clause>
THEN
BEGIN
(marks current address ) <loop clause>
AGAIN
( compiles branch and address after BEGIN )
BEGIN
( mark current address ) <loop clause>
UNTIL
( compiles ?branch and address after BEGIN )
BEGIN
( mark current address ) <loop clause>
WHILE
( compiles ?branch and address after REPEAT ) <true clause>
REPEAT
( compile branch and address after BEGIN )
FOR ( set up loop, mark current address )
<loop clause>
NEXT
( compile next and address after FOR )
FOR
( set up loop, mark current address ) <loopclause>
AFT
( change marked address to current address, compile branch
and
address after THEN ) <skip clause>
THEN
<loop clause> NEXT ( compile
next and address after AFT )
Memory Access
Four memory accessing words are included in the
eForth kernel: ! (store), @ (fetch), C! (C-store) and C@ (C-fetch). ! and @ access memory in cells, whose
size depends on the CPU underneath.
eForth assumes that the CPU can access memory in bytes and that all
addresses are in the units of bytes. C! and C@ allow the user access
memory in bytes.
The two most important resources in a computer are
the CPU and the memory. There is
not much one can do with the CPU, except to use its instruction set to write
programs. However, the real
usefulness and intelligence lies with the memory, which holds both the program
and the data. In conventional
languages, you humbly request memory to store your data, and the compiler reluctantly
allocate it to you. If you exceed
your memory allocation, your program will be ruthlessly terminated.
In Forth, you have all the memory and you are allowed
to do anything with the memory. !,
@, C! and C@ do not place restriction on their use. You can use them to write self-modifying
code if you like. However, you must
know exactly what you are doing.
It is not a very good idea to change the contents of
the dictionary, except in the parameter fields of variables and arrays you
defined specifically for data storage.
The space occupied by the stacks should be respected, too. The user variable area holds vital
information for the system to run correctly. The space bewteen the code dictionary
and the name dictionary are not used and you are free to use it to store
temporary data. Be reminded,
however, that as you define new words, the dictionaries are extended and may
over-write data you placed there.
The moral is: Use @ and C@ freely, but be careful
with ! and C!.
CODE ! (
w a -- ) \
Pop the data stack to memory.
POP BX \
get address from tos
POP 0[BX] \
store data to that adddress
$NEXT
CODE @ (
a -- w ) \
Push memory location to data stack.
POP BX \
get address
PUSH 0[BX] \
fetch data
$NEXT
CODE C! (
c b -- ) \
Pop data stack to byte memory.
POP BX \
get address
POP AX \
get data in a cell
MOV 0[BX],AL \
store one byte
$NEXT
CODE C@ (
b -- c ) \
Push byte memory content on data stack.
POP BX \
get address
XOR AX,AX
\
AX=0 zero the hi byte
MOV AL,0[BX] \
get low byte
PUSH AX \
push on stack
$NEXT
Return Stack
RP! pushes the address on the top of the data stack
to the return stack and thus initializes the return stack. RP! is only used to initialize the
system and are seldom used in applications. RP@ pushes the contents of the return
stack pointer RP on the data stack. It is also used very rarely in applications.
>R pops a number off the data stack and pushes it
on the return stack.. R> does
the opposite. R@ copies the top
item on the return stack and pushes it on the data stack.
The eForth system uses the return stack for two
specific purposes: to save addresses while recusing through an address list,
and to store the loop index during a FOR-NEXT loop. As the addresses piled up on the return
stack changes dynamically as words are executed, there is very little useful
information the user can get from the return stack at the run time. In setting up a loop, FOR compiles
>R, which pushes the loop index from the data stack to the return
stack. Inside the FOR-NEXT loop,
the running index can be recalled by R@.
NEXT compiles 'next' with an address after FOR. when 'next' is executed, it decrements
the loop index on the top of the return stack. If the index becomes negative, the loop
is terminated; otherwise, 'next' jumps back to the word after FOR.
Return stack is used by the virtual Forth computer to
save return addresses to be processes later. It is also a convenient place to store
data temporarily. The return stack
can thus be considered as a extension of the data stack. However, one must be very careful in
using the return stack for temporary storage. The data pushed on the return stack must
be popped off before EXIT is executed.
Otherwise, EXIT will get the wrong address to return to, and the system
generally will crash.
CODE RP@ (
-- a ) \
Push current RP to data stack.
PUSH BP \
copy address to return stack
$NEXT \
pointer register BP
CODE RP! (
a -- ) \
Set the return stack pointer.
POP BP \
copy (BP) to tos
$NEXT
CODE R> (
-- w ) \
Pop return stack to data stack.
PUSH 0[BP] \
copy w to data stack
ADD BP,2 \
adjust RP for popping
$NEXT
CODE R@ (
-- w ) \
Copy top of return stack to data stack.
PUSH 0[BP] \
copy w to data stack