eForth Overview

 

 

Before diving directly into eForth, I would like to discuss the general principles of Forth language. The language consists of a collection of words, which reside in the memory of a computer and can be executed by entering their names on the computer keyboard.  A list of words can be compiled, given a new name and made a new word.  In fact, most words in Forth are defined as lists of existing words.  A small set of primitive words are defined in machine code of the native CPU.  All other words are built from this primitive words and eventually refer to them when executed.

 

Words are similar to procedures and subroutines in other languages.  The difference is that Forth words are executable interactively when referenced by name, and they can be compiled into lists which can be referenced as new words.  Programming in Forth is to define new and more powerful words as lists of existing words.  This process continues until the final word becomes the solution to an application.

 

Here I will state 'The Forth Law of Computing" without a proof:

 

All computable functions can be constructed by defining new words as lists of words which include a small number of primitive words.

 

This eForth model consists of about 200 words, of which only 31 are primitive words.  Although it is very difficult to prove the above law, I will demonstrate it to you that from this small set of primitive words a complete operating system with many tools, that is the eForth model itself, can be built.  If an operating system can be built this way, it is not difficult to understand that any application can be so developed.

 

Forth is very similar to machine code.  In a computer, the CPU has a finite set of machine instructions, and all computable functions are implemented as lists of these machine instructions.  High level languages generally replace machine instruction lists by statements, functions, subroutines, and procedures, which can be used to construct procedures and subroutines at higher levels until the last procedure which is the application.  This also helps demonstrating the validity of the above law.

 

The primitive words must be constructed using native machine code of the host computer.  They are also called low level words or code words.  All other words are constructed as lists of existing words.  They are called high level words or colon words because ":" (colon) is a Forth word which defines or constructs new words to replace lists of existing words.

 

Forth as a computing system has two principal components: an user interface as the Forth language processor which interprets the commands entered from keyboard or equivalent devices; and a machine interface which interprets lists or words recursively until it can issue machine instructions in the primitive words to the host computer for execution.  The user interface processes commands in text form.  It is often referred to as the text interpreter and the outer interpreter. 

 

The machine interface executes words by processing recursively the word lists compiled in colon words to reach the primitive words which are handed to the host computer for execution. It is often called the inner interpreter and the address interpreter, because the word lists are often stored in the dictionary as address lists.

 

 

Virtual Forth Computer

 

Forth is a computer model which can be implemented on any real CPU with reasonable resources.  This model is often called a virtual Forth computer.  The minimal components of a virtual Forth computer are:

 

1.   A dictionary in memory to hold all the execution procedures.

2.   A return stack to hold return addresses of procedures yet to be executed.

3.   A data stack to hold parameters passing between procedures.

4.   A user area in RAM memory to hold all the system variables.

5.   A CPU to move date among stacks and memory, and to do ALU operations to parameters stored on the data stack.

 

The eForth model is a detailed specification of a virtual Forth computer which can be implemented on many different CPU's and forces them to behave identically in executing an identical Forth instruction set. It was first implemneted on a PC using Intel 8086 CPU as a guiding model for other implementations.  Here we will try to describe precisely the behavior of the virtual Forth computer. To describe precisely how this computer functions, we will use the 8086 machine code to clarify the specification.

 

The following registers are required for a virtual Forth computer:

 

Forth Register 8086 Register               Function          

           

IP                                 SI                             Interpreter Pointer       

SP                                SP                                         Data Stack Pointer      

RP                               RP                            Return Stack Pointer

WP                              AX                               Word or Work Pointer

UP                               (in memory )               User Area Pointer               

 

In the dictionary, each procedure (or word in Forth terminology) occupies an area called code field, which contains executable machine code and data required by the code. There are two types of words used in eForth: code word whose code field contains only machine instructions, and colon word whose code field contains a call to the list processing subroutine and a list of word addresses.  A word address is the code field address of the word in the dictionary.  4 bytes are allocated for the call to list processor. Word addresses are 2 bytes in length, and are pointers to code fields of words in the dictionary.  The length of a code field varies depending upon the complexity of the word.

 

In the code field of a code word there is a list of machine instructions of the native CPU.  The machine instructions are terminated by a group of instructions, generally specified as a macro instruction named $NEXT.  The function of $NEXT is to fetch the next word pointed to by the Interpreter Pointer IP, increment IP to point to the next word in the word list, and jump to the address just fetched.  Since a word address points to a code field containing executable machine instructions, executing a word means jumping directly to the code field pointed to by the word address.  $NEXT thus allows the virtual Forth computer to execute a list of words with very little CPU overhead.  In the 8086 implementation, $NEXT is a macro assembling the following two machine instructions as shown below.

 

In a colon word, the first four byte in the code field must be a subroutine call instruction to process the address list following this call instruction.  This address list processing subroutine is named doLIST.  doLIST pushes the contents in IP onto the return stack, copies the address of the first entry in its address list into IP and then calls $NEXT.  $NEXT will then start executing this list of addresses in sequence.

 

The last entry in the address list of a colon word must be EXIT.  EXIT is a code word which undoes what doLIST accomplished.  EXIT pops the top item on the return stack into the IP register. Consequently, IP points to the address following the colon word just executed.  EXIT then invokes $NEXT which continues the processing of the word list, briefly interrupted by the last colon word in this word list.

 

$NEXT   MACRO

        LODSW                       \ load next word into WP (AX)

        JMP     AX                  \ jump directly to the word thru WP

        ENDM                        \ IP (SI) now points to the next word

 

doLIST      ( a -- )                \ Run address list in a colon word.

      XCHG  BP,SP                   \ exchange pointers

      PUSH  SI                      \ push return stack

      XCHG  BP,SP                   \ restore the pointers

      POP   SI                      \ new list address

      $NEXT

 

CODE  EXIT                          \ Terminate a colon definition.

      XCHG  BP,SP                   \ exchange pointers

      POP   SI                      \ pop return stack

      XCHG  BP,SP                   \ restore the pointers

      $NEXT

 

It is interesting to note that in this eForth implementation, $NEXT is a macro, doLIST is a subroutine, and EXIT is actually a Forth code words.  $NEXT, doLIST and EXIT are collectively  call the 'inner interpreters' and 'address interpreters' of Forth.  They are the corner stones of a virtual Forth computer as they control the execution flow of Forth words in the system.

 

Based on the above mechanism to execute code words and colon words, a Forth computer can be constructed using a small set of machine dependent code words and a much larger set of colon words.  Tools are provided so that the user can extend the system by adding new words in truly modular fashion to solve any practical problems.

 

There are 190 high level words in eForth, built on the 31 low level primitive words.  The high level word set is required to build the outer interpreter and the associated utility words.  As the outer interpreter itself represents a fairly substantial application, the word set necessary to build the outer interpreter forms a very solid foundation to build most other applications.  However, for any real world application one would not expect that this eForth word set is sufficient.  The beauty of Forth is that in programming an application, the user designs and implements a new word set best tailored to his application.  Forth is an open system, assuming that no operating system can be complete and all-encompassing.  The user has the best understanding of his own needs, and he knows the best way to accomplish his goal.

 

 

Memory Map

 

The most important contribution by von Neumann to the computer design was the recognition that a single, uniform memory device can be used to store program and data, contrasting to the then prevailing architectures in which program and data were stored separately and most often using very different storage media.  It greatly simplified the design of computers and had become the dominant computer architecture for all the important computer families ever since.

 

Memory space is a concept of paramount importance in computer hardware and assembly programming, but often hidden and ignored in most conventional high level languages.  High level languages and operating systems hide the addressable memory space from the user in order to protect the operating system, because there are very sensitive areas in the memory space and unintentional alterations to the information stored in these areas would cause the system to malfunction or even to crash.  The point of view from the operating system and from the computer priesthood,  these sensitive areas must be protected at all cost, and they are the reserved territory of the systems programmers.  Ordinary applications programmers are allocated only enough space to run their programs safely, for their own good.

 

Forth opens the entire memory space to the user.  The user can freely store data and code into memory and retrieve them from the memory.  Coming with the freedom is the responsibility of handling the memory correctly.

 

Memory used in eForth is separated into the following areas:

 

Cold boot         100H-17FH         Cold start and variable initial values

Code dictionary   180H-1344H        Code dictionary growing upward

Free space        1346H-33E4H       Shared by code and name dictionaries

Name/word         33E6H-3BFFH       Name dictionary growing downward

Data stack        3C00H-3E7FH       Growing downward

TIB               3E80H-            Growing upward

Return stack      -3F7FH            Growing downward

User variables    3F80H-3FFFH

 

 

These areas are allocated by assembly constants and can be changed conveniently to suit the target environment.  The following assembly code segment prescribes the memory allocation in a typical eForth system.  The memory map is also illustrated in a schematic drawing for easier visulization.

 

;; Memory allocation

;; 0//code>--//--<name//up>--<sp//tib>--rp//em

EM    EQU   04000H                  ;top of memory

COLDD EQU   00100H                  ;cold start vector

US    EQU   64*CELLL                ;user area size in cells

RTS   EQU   64*CELLL                ;return stack/TIB size

RPP   EQU   EM-8*CELLL              ;start of return stack (RP0)

TIBB  EQU   RPP-RTS                 ;terminal input buffer (TIB)

SPP   EQU   TIBB-8*CELLL            ;start of data stack (SP0)

UPP   EQU   EM-256*CELLL            ;start of user area (UP0)

NAMEE EQU   UPP-8*CELLL             ;name dictionary

CODEE EQU   COLDD+US                ;code dictionary

 

 

 


eForth Kernel

 

 

 

One of the most important feature of eForth is the small machine dependent kernel, which allows its to be ported to other CPU's very conveniently. The selection of words in this kernel is based on the criteria that they are very difficult if not impossible to synthesize from other primitive words.  From this set of kernel words, all other Forth words have to be built.  The kernel words can be classified as following:

 

System interface:           BYE, ?rx, tx!, !io

Inner interpreters:       doLIT, doLIST, next, ?branch,  branch, EXECUTE, EXIT

Memory access:          ! , @,  C!,  C@

Return stack:                       RP@,  RP!,  R>, R@,  R>

Data stack:                         SP@,  SP!,  DROP, DUP,  SWAP,  OVER

Logic:                              0<,  AND,  OR,  XOR

Arithmetic:              UM+

 

The virtual Forth computer is based on a tw-stack architecture.  The return stack is used to allow a high level word to be executed in the address list of another high level word.  It is very similar to the return stack used for nested subroutine calls in a conventional computer.  Before executing a high level word in an address list, the next address of the list is pushed on the return stack so that the IP register can be used to scan the address list in the called word.  When the called word is executed to completion, the stored address on the returned stack is popped back into IP register and execution of the calling word list can be continued.

 

The data stack is used to pass parameters from one word to another.  Conventional computers use the return stack to do the parameter passing, and it takes a very complicated compiler to figure out which are return addresses and which are parameters.  Forth segregated these two types of information on two separate stacks and thus greatly simplies the execution and compilation of words.  Passing parameter on the data stack also reduces the syntactical complexity of Forth language to the minimum and allows words to be strung together into lists with minimum overhead in compilation and interpretation.

 

The kernal words move and process data and address among the stacks and the memory.  They emcompass the minimal functionality necessary to make a computer to behave like a Forth computer.  A complete understanding of these kernel word is vital to the understanding of a virtual Forth computer.  However, it is not difficult to understand the kernel words, because there are only 31 of them.

 

It is my intention to use this eForth model to illustrate the validity of 'the Forth Law of Computing', which stated that all computable functions can be constructed by lists of these kernel words and the high level words built from these kernel words.  The eForth model includes a text interpreter which allows the user to type lists of word names and execute them in sequence, a compiler which allows the user to name lists of words and compile new words, and utilities like memory dump, stack dump, and a colon word decompiler.  Thus the eForth system forms a fairly complete software development environment for the user to develop applications.  If such a system can be built from this small set of kernel words, it should be obvious that most practical applications can also be built from it. 

 

 


System Interface

 

BYE returns control from eForth back to the operating system.  !io initializes the serial I/O device in the system so that it can interact with the user through a terminal.  These two words are not needed once the eForth system is up and running, but they are essential to bring the system up in DOS.  ?rx is used to implement ?KEY and KEY, and tx! is used to implement EMIT.  eForth communicates with the user through these words which supports terminal interactions and file download/upload.  Here these words are defined using the DOS service calls.  For embedded controllers, these three words must be defined for the specific I/O devices.

 

?RX is a unique design invented by Bill Muench to support serial input .  ?RX provides the functions required of both KEY and KEY? which accept input from a terminal.  ?RX inspects the terminal device and returns a character and a true flag if the character has been received and is waiting to be retrieved.  If no character was received, ?RX simply returns a false flag.  With ?RX, both KEY and KEY? can be defined as high level colon definitions.

 

TX! sends a character on the data stack to the terminal device.  Both ?RX and TX! are coded here as DOS calls.  In embedded applications, they will have to be coded in machine specific code to handle the specific serial I/O device.  !IO initializes the serial I/O device, which is not necessary here because it is taking care of by the DOS.  In embedded systems, the I/O device must be initialized by !IO.

 

CODE BYE    ( -- , exit Forth )

      INT   020H                    \ return to DOS

 

CODE  ?RX   ( -- c T | F )          \ Return input character and true,

                                    \ or a false if no input.

      $CODE 3,'?RX',QRX

      XOR   BX,BX                   \ BX=0 setup for false flag

      MOV   DL,0FFH                 \ input command

      MOV   AH,6                    \ MS-DOS Direct Console I/O

      INT   021H

      JZ    QRX3                    \ ?key ready

      OR    AL,AL                   \ AL=0 if extended char

      JNZ   QRX1                    \ ?extended character code

      INT   021H

      MOV   BH,AL                   \ extended code in msb

      JMP   QRX2

QRX1: MOV   BL,AL

QRX2: PUSH  BX                      \ save character

      MOV   BX,-1                   \ true flag

QRX3: PUSH  BX

      $NEXT

 

CODE  TX!   ( c -- )                \ Send character c to output device.

      POP   DX                      \ char in DL

      CMP   DL,0FFH                 \ 0FFH is interpreted as input

      JNZ   TX1                     \ do NOT allow input

      MOV   DL,32                   \ change to blank

TX1:  MOV   AH,6                    \ MS-DOS Direct Console I/O

      INT   021H                    \ display character

      $NEXT

 

CODE  !IO   ( -- )                  \ Initialize the serial I/O devices.

      $NEXT

 


Inner Interpreter

 

In the word list of a colon definition, it is generally assumed that words are execution addresses, which can be executed sequentially by the address interpreter $NEXT.  However, occasionally we do need to compile other types of data in-line with the words.  Special mechanisms must be used to tell the address interpreter to treat these data differently.  All data entries must be preceded by special words which can handle the data properly.  A special word and its associated data form a data structure.  Data structures are extensions of words and can be thought of as building blocks to form lists in colon definitions.

 

$NEXT must be assembled at the end of a code word.  It fetches the next address in the address list pointed to by IP and jumps to that address.  It allows an address list to be scanned and thus executed.  doLIST starts the execution of an address list by saving IP on the return stack and stores the starting address of an address list into IP, and then $NEXT starts executing this address list.  EXIT must be compiled as the last entry in an address list.  It terminates the execution of the current address list and returns execution to the address saved on the return stack.

 

EXECUTE takes the execution address from the data stack and executes that word.  This powerful word allows the user to execute any word which is not a part of an address list.

 

doLIT pushes the next word onto the data stack as an integer literal instead of as an addresses to be executed by $NEXT.  It allows numbers to be compiled as in-line literals, supplying data to the data stack at run time.  doLIT is not used by itself, but rather compiled by LITERAL which inserts doLIT and its asociated integer into the address list under construction.  Anytime you see a number in a colon definition, LITERAL is invoked to compile an integer literal with doLIT.

 

Integer literals are by far the most numerous data structures in colon definitions other than regular words.  Address literals are used to build control structures.  String literals are used to embed text strings in colon definitions.

 

$NEXT   MACRO

        LODSW                       \ load next word into WP (AX)

        JMP     AX                  \ jump directly to the word thru WP

        ENDM                        \ IP (SI) now points to the next word

 

doLIST      ( a -- )                \ Run address list in a colon word.

      XCHG  BP,SP                   \ exchange pointers

      PUSH  SI                      \ push return stack

      XCHG  BP,SP                   \ restore the pointers

      POP   SI                      \ new list address

      $NEXT

 

CODE  EXIT                          \ Terminate a colon definition.

      XCHG  BP,SP                   \ exchange pointers

      POP   SI                      \ pop return stack

      XCHG  BP,SP                   \ restore the pointers

      $NEXT

 

CODE  EXECUTE     ( ca -- )         \ Execute the word at ca.

      POP   BX

      JMP   BX                      \ jump to the code address

 

CODE  doLIT ( -- w )                \ Push inline literal on data stack.

      LODSW                         \ get the literal compiled in-line

      PUSH  AX                      \ push literal on the stack

      $NEXT                         \ execute next word after literal

 

 


Loops and Branches

 

eForth uses three different types of address literals. 'next', '?branch' and 'branch' are followed not by word addresses but by pointers to locations in a list to be executed next.  These address literals are the building blocks upon which loops and branching structures are constructed.  An address literal is followed by a branch pointer which causes execution to be transferred to that location.  The branch location most often points to a different location in the address list of the same colon word. 

 

CODE  next  ( -- )                  \ Decrement index and exit loop

                                    \ if index is less than 0.

      SUB   WORD PTR [BP],1         \ decrement the index

      JC    NEXT1                   \ ?decrement below 0

      MOV   SI,0[SI]                \ no, continue loop

      $NEXT

NEXT1:ADD   BP,2                    \ yes, pop the index

      ADD   SI,2                    \ exit loop

      $NEXT

 

CODE  ?branch     ( f -- )          \ Branch if flag is zero.

      POP   BX                      \ pop flag

      OR    BX,BX                   \ ?flag=0

      JZ    BRAN1                   \ yes, so branch

      ADD   SI,2                    \ point IP to next cell

      $NEXT

BRAN1:MOV   SI,0[SI]                \ IP:=(IP), jump to new address

      $NEXT

 

CODE  branch      ( -- )            \ Branch to an inline address.

      MOV   SI,0[SI]                \ jump to new address unconditionally

      $NEXT

 

Address literals are used to construct control structures in colon definitions.  'next' is compiled by NEXT.  '?branch' is compiled by IF, WHILE and UNTIL.  'branch' is compiled by AFT, ELSE, REPEAT and AGAIN.  In the colon words to be discussed in the later sections, you will not see these kernel words but words which construct loops and branches.  For examples:

 

            IF ( compiles ?branch and address after THEN ) <true clause>  THEN

            IF ( compiles ?branch and address after ELSE )  <true clause>

                        ELSE ( compiles branch and address after THEN ) <false clause>

                        THEN

            BEGIN (marks current address ) <loop clause>

                        AGAIN ( compiles branch and address after BEGIN )

            BEGIN ( mark current address ) <loop clause>

                        UNTIL ( compiles ?branch and address after BEGIN )

            BEGIN ( mark current address ) <loop clause>

                        WHILE ( compiles ?branch and address after REPEAT ) <true clause>

                        REPEAT ( compile branch and address after BEGIN )

            FOR  ( set up loop, mark current address ) <loop clause>

                        NEXT ( compile next and address after FOR )

            FOR ( set up loop, mark current address ) <loopclause>

                        AFT ( change marked address to current address, compile branch

                                    and address after THEN ) <skip clause>

                        THEN <loop clause>  NEXT ( compile next and address after AFT )

                       


Memory Access

 

Four memory accessing words are included in the eForth kernel: ! (store), @ (fetch), C! (C-store) and C@ (C-fetch).  ! and @ access memory in cells, whose size depends on the CPU underneath.  eForth assumes that the CPU can access memory in bytes and that all addresses are in the units of bytes.   C! and C@ allow the user access memory in bytes. 

 

The two most important resources in a computer are the CPU and the memory.  There is not much one can do with the CPU, except to use its instruction set to write programs.  However, the real usefulness and intelligence lies with the memory, which holds both the program and the data.  In conventional languages, you humbly request memory to store your data, and the compiler reluctantly allocate it to you.  If you exceed your memory allocation, your program will be ruthlessly terminated. 

 

In Forth, you have all the memory and you are allowed to do anything with the memory.  !, @, C! and C@ do not place restriction on their use.  You can use them to write self-modifying code if you like.  However, you must know exactly what you are doing.

 

It is not a very good idea to change the contents of the dictionary, except in the parameter fields of variables and arrays you defined specifically for data storage.  The space occupied by the stacks should be respected, too.  The user variable area holds vital information for the system to run correctly.  The space bewteen the code dictionary and the name dictionary are not used and you are free to use it to store temporary data.  Be reminded, however, that as you define new words, the dictionaries are extended and may over-write data you placed there.

 

The moral is: Use @ and C@ freely, but be careful with ! and C!.

 

CODE  !     ( w a -- )              \ Pop the data stack to memory.

      POP   BX                      \ get address from tos

      POP   0[BX]                   \ store data to that adddress

      $NEXT

 

CODE  @     ( a -- w )              \ Push memory location to data stack.

      POP   BX                      \ get address

      PUSH  0[BX]                   \ fetch data

      $NEXT

 

CODE  C!    ( c b -- )              \ Pop data stack to byte memory.

      POP   BX                      \ get address

      POP   AX                      \ get data in a cell

      MOV   0[BX],AL                \ store one byte

      $NEXT

 

CODE  C@    ( b -- c )              \ Push byte memory content on data stack.

      POP   BX                      \ get address

      XOR   AX,AX                   \ AX=0 zero the hi byte

      MOV   AL,0[BX]                \ get low byte

      PUSH  AX                      \ push on stack

      $NEXT

 

 


Return Stack

 

RP! pushes the address on the top of the data stack to the return stack and thus initializes the return stack.  RP! is only used to initialize the system and are seldom used in applications.  RP@ pushes the contents of the return stack pointer RP on the data stack.  It is also used very rarely in applications.

 

>R pops a number off the data stack and pushes it on the return stack..  R> does the opposite.  R@ copies the top item on the return stack and pushes it on the data stack.

 

The eForth system uses the return stack for two specific purposes: to save addresses while recusing through an address list, and to store the loop index during a FOR-NEXT loop.  As the addresses piled up on the return stack changes dynamically as words are executed, there is very little useful information the user can get from the return stack at the run time.  In setting up a loop, FOR compiles >R, which pushes the loop index from the data stack to the return stack.  Inside the FOR-NEXT loop, the running index can be recalled by R@.  NEXT compiles 'next' with an address after FOR.  when 'next' is executed, it decrements the loop index on the top of the return stack.  If the index becomes negative, the loop is terminated; otherwise, 'next' jumps back to the word after FOR.

 

Return stack is used by the virtual Forth computer to save return addresses to be processes later.  It is also a convenient place to store data temporarily.  The return stack can thus be considered as a extension of the data stack.  However, one must be very careful in using the return stack for temporary storage.  The data pushed on the return stack must be popped off before EXIT is executed.  Otherwise, EXIT will get the wrong address to return to, and the system generally will crash.

 

CODE  RP@   ( -- a )                \ Push current RP to data stack.

      PUSH  BP                      \ copy address to return stack

      $NEXT                         \ pointer register BP

 

CODE  RP!   ( a -- )                \ Set the return stack pointer.

      POP   BP                      \ copy (BP) to tos

      $NEXT

 

CODE  R>    ( -- w )                \ Pop return stack to data stack.

      PUSH  0[BP]                   \ copy w to data stack

      ADD   BP,2                    \ adjust RP for popping

      $NEXT

 

CODE  R@    ( -- w )                \ Copy top of return stack to data stack.

      PUSH  0[BP]                   \ copy w to data stack