Multiprocess Concurrency in Bash++: Shared Data Structures

There’s a degree to which I’m uncomfortable making decisions about Bash++. When certain questions come up about how the language should work which could have more than one answer, I find myself wishing that that there was a committee or collective body that could discuss it.

A good example of this is the question of how to handle static and dynamic linking. Initially, I put this as an option in the compiler – as in C++. However, in time I came to think that it made more sense for it to be a part of the language itself, and so was born the syntax:

@include dynamic "/path/to/compile-time/library.bpp" as "/path/to/run-time/library.sh"

I’m now convinced that this is the right way to do it for our language, but at the time it was a difficult decision to make. I’m much more comfortable working collectively than individually. Nevertheless, it’s my own language, and there is no committee to take these responsibilities from me.

The latest of these decisions is the introduction of SharedStack and SharedQueue to the Bash++ standard library.

Justification

Bash++ does not have “threads” – we have processes. The shell can fork processes, and when it does, the child process inherits the parent’s environment. But the child process is not a thread of the parent process. It is a separate process with its own memory space.

What this means is that these two processes do not share memory. The way that a forked process inherits its parent’s environment is by copying it. This means that if you change the environment in the child process, it does not affect the parent process, and vice versa.

So what if we do want to share memory between processes? What if we have a multi-process application that needs to use stacks and queues, and coordinate between its many children?

Enter ‘Shared’ Data Structures

The classes SharedStack and SharedQueue are designed to allow multiple processes to share the same stack or queue, respectively.

It would be a mistake to write the spec in such a way that we become “married” to the implementation – it’s far more important to focus on the effect rather than the implementation – but I will discuss the implementation here for clarity.

The implementation of SharedStack and SharedQueue is based on the idea of using a file to store the data. When a process wants to push an item onto the stack or queue, it writes the item to the file. When it wants to pop an item off, it reads from the file. Because all of our forked processes have access to the file system, they can all read from and write to the same file.

We can make it concurrency-safe by using file locks. When a process wants to push or pop an item, it first acquires a lock on the file. This prevents other processes from reading or writing to the file while the lock is held. Once the operation is complete, the lock is released. If we ask for a lock and it’s not available, we block until it is available. This guarantees that reads and writes are atomic, and that no two processes can read or write to the file at the same time (corrupting the data).

When a SharedStack or SharedQueue is created, it creates a temporary file to store the data. This is done by the constructor. The path to this file is set as a data member of the class. The destructor deletes the file.

One concern might be the security of the data. The first way we address this is by chmodding the file to be readable and writable only by the user who created it. This means that other users on the system cannot read or write to the file, even if they know its path.

If we need to prevent other processes running under the same user from accessing the file, we can tell the object that we’d like to encrypt the data. This is done by calling the setEncrypted method on the object. This will cause the data to be encrypted before it is written to the file, and decrypted when it is read from the file. The encryption is done using a randomly-generated symmetric key which is stored only in memory and is never written to disk. In order to use this feature, you must have the openssl command-line tool installed on your system.

Example Usage

You’ll have to instantiate the SharedStack or SharedQueue object before forking any processes that will use it. As I mentioned earlier, the child process inherits the parent’s environment at the time it’s forked – so in order for it to have access to the same SharedStack or SharedQueue, it must be created in the parent process before the fork.

@include <SharedStack>

@SharedStack stack
@stack.setEncrypted true

{
	@stack.push "item1"
	@stack.push "item2"
} &

{
	@stack.push "item3"
	@stack.push "item4"
} &

The two blocks above will run in parallel, each backgrounded as child processes by the shell. Of course, we can’t predict the order in which the items will be pushed onto the stack, but we can be sure that they will all be pushed onto the same stack, and in an atomic way that prevents data corruption.

In this example, we also turn on encryption for the stack. This means that the items will be encrypted before they are written to the file, and decrypted when they are read from the file. Enabling this encryption is a bit of a performance hit, but you may find it necessary if you’re working with sensitive data.

The decision of whether to encrypt the data must be made before the first item is pushed onto the stack or queue. Once the first item is pushed, the encryption setting cannot be changed. The reason for this is that, if we started using it among multiple processes and then decided we didn’t want to encrypt the data anymore (or vice versa), the child processes wouldn’t get the memo.

Conclusions (Or Lack Thereof)

This is the first time that Bash++ has introduced a feature which relies on external command-line tools. There’s something that feels philosophically unsatisfying about introducing an external dependency to the language. However, I think that the benefits of being able to share data between processes outweigh this concern.

For safety, any application which includes the SharedStack or SharedQueue classes will automatically terminate with an error message if the required command-line tools are not available. The very top of these STL files make this check before even declaring the classes themselves.

The question of whether this implementation is the best way to do this is also open as far as I’m concerned. Is it right to use the file system? Is there a better way to do this? If you have better ideas, I’m open to hearing them.

Another concern is that the file which is created is only deleted when the destructor is called. In Bash++, destructors are only called if the object goes out of scope before the program exits or if the @delete command is called directly. This is a consequence of the fact that we’re compiling to Bash – we can’t know at compile-time where the program will exit. If the program exits without the destructor being called, the file will not be deleted. Again, it feels philosophically problematic to insist that the user must remember to call @delete on the object before the program exits. There may be a way to remedy this – perhaps by using traps or signal handlers to ensure that destructors are always called (something which we haven’t explored just yet), or perhaps some other mechanism (e.g., using deleted files, or similar).

In the future, I would like to see the standard library grow. In particular, I would like to see thread-like abstractions overtop of the shell’s forking abilities that make it easier to write concurrent applications. My thinking is that SharedStack and SharedQueue are a good first step in that direction. If only there was a collective body to confirm!