How to create the effect of Multithreading in Linux shell scripts?

Technically speaking there is no multithreading in Linux shell script but it is possible to achieve the effect of multithreading inside Linux shell scripts. In other words, it is possible to achieve the ability to run Linux shell scripts in parallel background multiprocessing synchronization. This unfortunately cannot be achieved with ease and power in Linux shell.  The following are techniques and methods to achieve that.

1st– The ampersand (&), the Linux wait command and the pipes |

In a Linux shell script, the effect of Multithreading can be achieved by the introduction of an ampersand ‘&’ which is appended at the end of a command/program/shell script and even at the end of a function/code block inside a script. This makes whatever is called to run in the background. The wait command can be used to pause until the complete execution of all background process terminate. In other words, stalls the flow until the completion of all background jobs/processes before continuing.

Pipes | normally allow one program (on the left side of the pipe symbol) to supply its output to the input of another one (on the right side) and this is achieved concurrently. In fact, both sides are running in the exact same time and this can be shown from the fact that the right side might pause if the feeding program from the left side is still computing the next line. Pipes don’t achieve anything on their own in term of creating the effect of concurrency but when used with & and the wait command, they can achieve a decent job.

2nd – Xargs

The main orthodox function of Xargs (pronounced Ex-args) is to submit a large set of arguments to any command. They are normally used in conjunction with pipes as shown in the following example:

ls * | xargs rm

This actually removes all files in the working directory. It is like saying rm file1.txt, rm file2.txt… Normally, xargs runs one command at a time. This is called “serial” execution but it has also another unknown or esoteric functionality of xargs utility which is of interest. It can act as a thread pool or more correctly as a process pool.  This is achieved through the –max-procs=max-procs option  and -P max-procs option of xargs. The GNU documentation explained it very well:

“When parallelism works in your application, xargs provides an easy way to get your work done faster.

–max-procs=max-procs

-P max-procs

Run up to max-procs processes at a time; the default is 1. If max-procs is 0, xargs will run as many processes as possible at a time. Use the ‘-n’, ‘-s’, or ‘-L’ option with ‘-P’; otherwise chances are that the command will be run only once.

For example, suppose you have a directory tree of large image files and a makeallsizes script that takes a single file name and creates various sized images from it (thumbnail-sized, web-page-sized, printer-sized, and the original large file). The script is doing enough work that it takes significant time to run, even on a single image. You could run:

find originals -name ‘*.jpg’ | xargs -1 makeallsizes

This will run makeallsizes filename once for each .jpg file in the originals directory. However, if your system has two central processors, this script will only keep one of them busy. Instead, you could probably finish in about half the time by running:

find originals -name ‘*.jpg’ | xargs -1 -P 2 makeallsizes

xargs will run the first two commands in parallel, and then whenever one of them terminates, it will start another one, until the entire job is done.

The same idea can be generalized to as many processors as you have handy. It also generalizes to other resources besides processors. For example, if xargs is running commands that are waiting for a response from a distant network connection, running a few in parallel may reduce the overall latency by overlapping their waiting time.”

3rd – GNU Parallel to achieve the effect of Multithreading

GNU Parallel is a shell tool for executing jobs in parallel using one or more processor on the same or different machines.

It will spawn processes according to the number of processors on the system. This can be customize with the options –j n (which means run n jobs in parallel).

4th  – coproc command to achieve the effect of Multithreading

“The coproc keyword starts a command as a background job, setting up pipes connected to both its stdin and stdout so that you can interact with it bidirectionally. Optionally, the co-process can have a name NAME. If NAME is given, the command that follows must be a compound command. If no NAME is given, then the command can be either simple or compound.” [Link1]

“A coprocess is a shell command preceded by the coproc reserved word. A coprocess is executed asynchronously in a subshell, as if the command had been terminated with the ‘&’ control operator, with a two-way pipe established between the executing shell and the coprocess.

The format for a coprocess is:

coproc [NAME] command [redirections]

This creates a coprocess named NAME. If NAME is not supplied, the default name is COPROC. NAME must not be supplied if command is a simple command.” [Link2]

Leave a Reply

Your email address will not be published. Required fields are marked *