How I built my tiny programming language - ModLang

Jun 5, 2021 · 7 min read · c++ project core computer science ·

Share on:

Abstraction

As technology is advancing, the layers of abstraction are also increasing. What was once a necessity to learn in order to create something is now hidden away in abyss. It is not a bad thing but the very essence of advancement.

Curiosity

I have always been very curious about learning how things worked at the deepest level. But often people (and I myself) have questioned if it is worth digging ocean deep when you know you won't be required to know it in your day to day work, or that you could mostly get away without it.

Even past this sort of reasoning, there comes the notion of reinventing the same wheel when you could do something else "worthwhile". I finally decided to put an end to this cycle of thoughts and just get to "doing" rather than "thinking of doing".

While tinkering around developing and competitive coding, I became ever-so curious about how programming languages came into being and how it all worked in the first place. I wanted to know the magic under the black box.

There's so many languages out there today, each having their own strengths and weaknesses, or I should say, their own purpose(s). It doesn't feel right to compare two languages or claim one is better than the other.

What are Programming Languages truly?

Programming languages are sets of abstract mathematical rules, definitions, and restrictions to be followed by the programmer while writing the source code. But they are meaningless to the computer unless it is converted into a form that the computer can understand, ie, machine language, or machine code. This conversion (or translation) is done by programs called the Interpreter or the Compiler.

Every programming language can be implemented by a compiler. Every programming language can be implemented by an interpreter. Many modern mainstream programming languages have both interpreted and compiled implementations.

Interestingly, we have Compiler Design subject in this semester (6) and there couldn't be a better time to put the theory learnt into practise. So I built a minimal, toy programming language, named Mod (or ModLang) in C++ and STL (Standard Template Library) with no other external dependencies. It is dynamically typed and interpreted and it's syntax is a blend of my favourite styles and features in C++, Python and JavaScript.

How does an Interpreter work?

There are mainly 3 phases in an Interpreter:

Lexing (Scanning)
Parsing
Evaluating

Lexing or Scanning is the phase in which the source code is converted to a set of tokens.
A Parser then builds a data structure called Parse Tree or Abstract Syntax Tree (AST) giving a structural representation of the input and checks for correct syntax in the process.
Lastly, the Evaluator evaluates the nodes in the AST and produces the desired output.

Implementation

Following are the rough blueprints of how each phase looks like in code:

1. Lexing

Initially, all the tokens - data types, data structures, keywords, and other symbols are defined as per the language design.

 1class Token
 2{
 3public:
 4	TokenType type;
 5	std::string literal;
 6
 7	Token() {}
 8	Token(TokenType type, std::string literal) : type(type), literal(literal) {}
 9};
10
11const TokenType INTEGER = "INTEGER"; // 'int' data type
12const TokenType ARRAY = "ARRAY"; // 'array' data structure
13const TokenType IDENT = "IDENT"; // identifiers
14const TokenType LET = "LET"; // 'let' keyword
15...

...

gdscript3

The sequence of input characters are then converted into corresponding tokens. If there's no match, an error is thrown.

 1class Lexer
 2{
 3private:
 4	std::string input;
 5	int pos; // current position in input
 6	int readPos; // current reading position in input
 7	char curChar; // current char under examination
 8
 9	void skipWhitespace();
10	void readChar();
11	char peekChar();
12	std::string readIdentifier();
13	std::string readNumber();
14	std::string readString();
15
16public:
17	void New(std::string &input);
18	Token nextToken();
19};

...

2. Parsing

The tokens are wrapped as corresponding nodes and inserted into an AST according to the syntax and statements defined in the language design.

Parsing can be done in 2 ways:

Bottom-up
Top-down

Mod has a Bottom-Up Recursive Descent Parser or Pratt Parser named after it's inventor Vaughan Pratt (in 1973). It works by calling associated functions repeatedly in a recursive fashion.

 1// Interfaces 
 2
 3class Node
 4{
 5public:
 6	virtual std::string tokenLiteral() = 0;
 7	virtual std::string getStringRepr() = 0;
 8	virtual std::string nodeType() = 0;
 9
10	virtual ~Node() {}
11};
12
13class Statement : public Node
14{
15public:
16	// " "
17	virtual ~Statement() {}
18};
19
20class Program : public Node
21{
22public:
23	std::vector<Statement *> statements;
24	// " "
25	virtual ~Program() {}
26};
27
28class Expression : public Node
29{
30public:
31    // " "
32	virtual ~Expression() {}
33};

...

 1class Parser
 2{
 3private:
 4	Lexer lexer;
 5	Token curToken;
 6	Token peekToken;
 7	std::vector<std::string> errors;
 8
 9	std::unordered_map<TokenType, Precedence> precedences;
10
11	std::unordered_map<TokenType, PrefixParseFn> prefixParseFns;
12	std::unordered_map<TokenType, InfixParseFn> infixParseFns;
13
14	Statement *parseStatement();
15
16	Expression *parseExpression(Precedence precedence);
17	Expression *parsePrefixExpression();
18	Expression *parseInfixExpression(Expression *);
19    ...
20
21	Expression *parseIdentifier();
22	Expression *parseIntegerLiteral();
23    ...
24
25public:
26	void New(Lexer &lexer);
27	void nextToken();
28
29	Program *ParseProgram();
30};

...

This is by far the hardest phase as there are a lot of aspects to consider in order to correctly process the tokens and create the AST. Expressions need to be evaluated based on precedence. Based on the location the token is found, we apply associated infix, prefix or postfix functions to it. The whole logic must be very generalized and be applicable in different contexts. This is where it's beauty lies in too! It is so beautifully vowen to produce exactly the intended result whilst returning error(s) by the programmer (if any).

3. Evaluating

The nodes of the AST are then evaluated on-the-fly by the Interpreter and hence is named - Tree-Walking.

 1class Evaluator
 2{
 3private:
 4	Object *evalProgram(Program *program, Environment *env);
 5
 6	Object *evalPrefixExpression(std::string operand, Object *right);
 7	Object *evalInfixExpression(std::string operand, Object *left, Object *right);
 8
 9	Object *evalIdentifier(Identifier *ident, Environment *env);
10    ...
11
12	Environment *extendFunctionEnv(Function *fn, std::vector<Object *> &args, Environment *outer);
13
14public:
15	Object *Eval(Node *node, Environment *env);
16};

...

Objects are created during variable definition and are kept track of by the Environment, which is essentially a hash map storing variable name (key) and it's object (value).

Features of ModLang

Integer, boolean, and string primitive data types.
Basic arithmetic for integer expressions.
If-else conditional statement and while loop statement.
First class functions, higher-order functions, closures and recursion.
Data structures like arrays, hashmaps, hashsets, stacks, queues and more.
Built-in utility functions.
An REPL (Read Evaluate Print Loop) shell for experimenting.

The link to the source code of Mod can be found here. I am open to suggestions and contributions.

It is mesmerizing how all of these features are packed into few thousands of lines of code giving a really good bird's eye view of the real world implementation of the same, which in contrast to this, consists of millions of lines of code.

Future Plans

There are a couple of features I'd love to work on incrementally:

Garbage collection - Currently unused objects or objects out of scope are not deallocated automatically leading to a lot of memory usage and wastage.
Class - Bring Object oriented programming paradigm to Mod.
Macro - Piece of code in a program that is replaced by the value of the macro.

What I learnt with this project

How to write my own programming language!
Modern C++ and STL
- better understanding of C++ and its working.
- implementing interfaces using pure virtual functions.
- making proper use of pointers - function pointers.
Use 'make' tool for speeding up development and automating code compilation.

Mistakes

Reading too much - I spent too much timing reading and researching and kept implementation to the last. Building on the fly would have been a lot better.
Making things unecessarily repetitive and boring - failing to adhere to DRY principle, not looking up ways to improve development productivity sooner ('make' tool), silly bugs, forgetting to commit to GitHub, having to rework on things, etc.

Takeaways

Low Level / System programming is not too hard as I thought it would be. It just requires a bit more in-depth know-how of things and dedication.

Acknowledgement

While looking up for resources, I came across few books worth gold, without which the project could not have been possible. Mod is based on the implementation given in these books.

Special thanks to Lijo Zechariah for coming up with this cool logo for Mod: