How Does the Bash++ Compiler Work?

The source tree for the compiler might seem confusing or intimidating at first – especially if you start with src/main.cpp – but it’s actually pretty simple, if a bit long-winded.

NOTE: It’s my hope that this can serve as a helpful introduction to the compiler for any would-be contributors to Bash++. This article is valid as of Bash++ version 0.4.x. Although I don’t anticipate any fundamental changes to the way the compiler works, nevertheless, this article may become outdated in the future. At any rate, this article only explains how the compiler works in the most broad strokes.

We use ANTLR4 to generate the lexer and parser for Bash++. Here, only a very simplistic explanation of how ANTLR4 works is given. For a more in-depth explanation, please refer to the ANTLR4 documentation.

A fair warning also that the current implementations for the Lexer and the Parser are altogether a terrible mess, and will likely be rewritten in the future.

Lexer
Parser
Walking the Parse Tree
- Entities and the Entity Stack

Lexer

It all starts with the lexer.

The lexer, or lexical analyzer, reads in the raw source code typed by a Bash++ programmer and breaks it down into a series of tokens. Each token has a type and a value. This way, we can identify keywords, identifiers, operators, and other elements of the language.

The lexer rules are defined in src/BashppLexer.g4

The basis of the lexer is essentially a set of regular expressions – but there’s also some more complicated logic in there as well.

For example, here’s a snippet of how we determine whether the token in is a keyword or just an ordinary identifier:

BASH_KEYWORD_IN: 'in' {
	switch (modeStack.top()) {
		case mode_quote:
		case mode_heredoc:
		case mode_arith:
		case mode_array_assignment:
			emit(incoming_token_can_be_lvalue ? IDENTIFIER_LVALUE : IDENTIFIER, getText());
			break;
		default:
			if (metaTokenCount != most_recent_case_keyword_metatoken_position + 2
				&& metaTokenCount != most_recent_for_keyword_metatoken_position + 2) {
				emit(incoming_token_can_be_lvalue ? IDENTIFIER_LVALUE : IDENTIFIER, getText());
			}
	}
};

First, we check the mode stack to see what we’re currently doing. If we’re inside a quoted string, a heredoc, an arithmetic expression, or an array assignment, then we immediately know that in can’t possibly be a keyword. So we emit it as an identifier.

For any other cases, we check the Meta-Token Count to see if the in token is part of a for or case statement. If it is, then we emit it as a keyword. Otherwise, we emit it as an identifier.

An in-depth explanation of the concept of meta-tokens and how we count them can be found in the source code for the lexer. We won’t go into it here.

If you ever want to see the lexer’s output for a given source file, you can run the compiler with the -t option:

bpp -t ./source_file.bpp

This will print out a list of all the tokens emitted by the lexer, along with their types and values (and will not compile the program). This can be very helpful for debugging purposes.

Finally, once we’ve identified all of the tokens in the source code, we can pass them to the parser.

Parser

The parser takes a good, long look at the tokens emitted by the lexer and tries to make sense of them. It does this by checking the tokens against a set of grammar rules defined in src/BashppParser.g4.

For instance, we know that a class definition has to start with the class keyword followed by the class name. After that, we expect to find a class body (methods, data members, etc) in-between a pair of curly braces. So we can check our token stream against this grammar rule and see if we come up with any matches.

Once we’ve found matches and decided what the structure of the code is, the parser builds a parse tree (shown in the above animation) that represents the structure of the program. You can see in the above-generated tree that the program contains two classes, A and B, that class A contains a method and a data member, and that class B contains a method.

The tree structure might look a little unfamiliar, but it’s altogether pretty simple.

The node at the very top is the root node, which represents the entire program. Underneath that, we have a series of child nodes that represent the different things inside the program (in this case, a couple of classes). Then, underneath those nodes are their children, that is, the things inside them. And so on and so forth.

If you’re ever curious about what the parser sees when it parses a given source file, you can run the compiler with the -p option:

bpp -p ./source_file.bpp

This will print out a rough representation of the parse tree to the terminal (and will not compile the program). This can be very helpful for debugging purposes, as well.

Now that we have the parse tree, we can start to do something useful with it. We can walk the tree from top to bottom and do something with each node we encounter.

Walking the Parse Tree

The listener is a special kind of object that we can use to walk the parse tree. It automatically calls a series of functions as it encounters different nodes in the tree. For example, when it encounters a class definition, it will call the enterClass_definition function. When it encounters a method definition, it will call the enterMethod_definition function. And so on and so forth.

The listener is outlined in src/listener/BashppListener.h. Each of the different functions are defined in files under src/listener/handlers.

So here’s the outline:

First, we enter a node in the tree. When we do this, we call the enter function for that node.
Then, we enter that node’s first child. When we do this, we call the enter function for that child node.
Once we’ve hit the bottom, we exit the child node. When we do this, we call the exit function for that child node.
Only then (after we’ve finished with all the children) do we exit the parent node. When we do this, we call the exit function for that parent node.
And then we repeat the process for the next sibling node, etc etc

All the while, we’re generating compiled code at each step along the way.

Here’s an example snippet of how we handle a class definition (src/listener/handlers/class_definition.cpp):

void BashppListener::enterClass_definition(BashppParser::Class_definitionContext *ctx) {
	skip_syntax_errors
	std::shared_ptr<bpp::bpp_class> new_class = std::make_shared<bpp::bpp_class>();
	entity_stack.push(new_class);

	// Get the class name
	std::string class_name = ctx->IDENTIFIER(0)->getText();

	// Verify that the class name is valid
	if (is_protected_keyword(class_name)) {
		entity_stack.pop();
		throw_syntax_error(ctx->IDENTIFIER(0), "Invalid class name: " + class_name);
	}

	// ... (abbreviated) ...
	// More sanity checks on the class name are here in the actual source code
	// ... (abbreviated) ...

	new_class->set_name(class_name);

	// Inherit from a parent class if specified
	if (ctx->IDENTIFIER().size() > 1) {
		std::string parent_class_name = ctx->IDENTIFIER(1)->getText();
		std::shared_ptr<bpp::bpp_class> parent_class = program->get_class(parent_class_name);
		if (parent_class == nullptr) {
			entity_stack.pop();
			throw_syntax_error(ctx->IDENTIFIER(1), "Parent class not found: " + parent_class_name);
		}
		new_class->inherit(parent_class);
	}
}

You can see that the process for handling any given node is pretty straightforward – each piece only has to worry about its one job.

In the above snippet, entering a class definition, we first create an object to represent the new class (of the type bpp::bpp_class). We do some basic sanity checks to guarantee that the class’s name is valid, etc. Then, we push it onto the entity stack.

Entities and the Entity Stack

The entity stack is another relatively simple idea. We can push things onto the top of the stack and pop them off again. This way, we can keep track of what we’re currently working on.

When we enter a class definition, we create an object to represent the class and we push it onto the entity stack. Now, when we enter some other thing inside the class (like a method), we can just look at the top of the stack to see what we’re currently working on – this tells us where to put the new method.

When we exit the class definition, we pop the class object off the stack, announcing that we’re done with it.

The different kinds of entities are outlined in src/bpp_include/bpp.h. Some different kinds of entities are classes, methods, data members, objects, and so on. Each of these entities has its own set of properties and methods that we can use to manipulate them.

For example, there is a distinction between bpp::bpp_entity and bpp::bpp_code_entity. A code entity is an entity that can contain code directly inside it, like a method for example. A non-code entity is an entity that cannot contain code directly inside it, like a class or an object.

Once we’ve handled something in the tree (maybe the echo command, for example) and generated compiled code for it, we don’t need to worry about where to put this code – we just put it inside whatever’s at the top of the entity stack. This makes sure that the code inside your method is understood to be inside the method, and not just go somewhere random in the program.

In general, once we’re ready to pop an entity off the stack, we put all of its code into whatever’s next on the stack. So your echo command gets put inside the method’s code, and then the method’s code gets put inside the broader program.

All of the different kinds of entities are defined in files under src/bpp_include and outlined in src/bpp_include/bpp.h.

Let’s take a look at the exitClass_definition function (src/listener/handlers/class_definition.cpp):

void BashppListener::exitClass_definition(BashppParser::Class_definitionContext *ctx) {
	skip_syntax_errors
	std::shared_ptr<bpp::bpp_class> new_class = std::dynamic_pointer_cast<bpp::bpp_class>(entity_stack.top());

	if (new_class == nullptr) {
		throw internal_error("entity_stack top is not a bpp_class", ctx);
	}

	entity_stack.pop();

	new_class->finalize(program);

	// Add the class to the program
	program->add_class(new_class);
}

You can see it’s pretty simple! We check that the top of the entity stack is the class we’re looking for. If not, something’s gone horribly wrong (a bug in the compiler, since everything inside the class should’ve been popped off the stack by now). We pop the class off the stack and finalize it. At last, we add the class to the program.

Notes

This article is a very high-level overview of how the Bash++ compiler works. There are many more details and intricacies that I haven’t covered here. My hope is only to make the compiler a bit more accessible to anyone who might be interested in contributing to it.

If you’re interested in learning more, I encourage you to take a look at the source code and the ANTLR4 documentation. In addition, we have Doxygen documentation to explain the different classes and functions in the codebase in more detail. This can be very helpful for understanding how everything fits together.

Finally, just because I’m so pleased with how it turned out, here’s the full animation from lexical analysis to walking the parse tree:

Contents

Lexer

Parser

Walking the Parse Tree

Entities and the Entity Stack

Notes