Contents:
There are far too many workflow managers to enumerate here, but perhaps the most popular, at least in bioinformatics, is Nextflow. Like many workflow managers, Nextflow advertises itself as being “polyglot”. They claim to support many programming languages.
It is true that within a Nextflow script you can write code from several scripting languages (Bash, Python, Perl and others). The following code is an example of “polyglot” programming from their online documentation (https://www.nextflow.io/example2.html):
#!/usr/bin/env nextflow
params.range = 100
/*
* A trivial Perl script that produces a list of number pairs
*/
process perlTask {
output:
stdout
shell:
'''
#!/usr/bin/env perl
use strict;
use warnings;
my $count;
my $range = !{params.range};
for ($count = 0; $count < 10; $count++) {
print rand($range) . ', ' . rand($range) . "\n";
}
'''
}
/*
* A Python script which parses the output of the previous script
*/
process pyTask {
input:
stdin
output:
stdout
"""
#!/usr/bin/env python
import sys
x = 0
y = 0
lines = 0
for line in sys.stdin:
items = line.strip().split(",")
x += float(items[0])
y += float(items[1])
lines += 1
print("avg: %s - %s" % ( x/lines, y/lines ))
"""
}
workflow {
perlTask | pyTask | view
}
The script defines processes that wrap scripts in Perl and Python. Each script
is a full program – hashbang and all. Nextflow takes these scripts,
interpolates parameters, writes them to disk, and executes them. The parameter
insertion, at least, is helpful. But writing code in a Nextflow script separates
the target language (Python/Perl) from the language ecosystem. For instance, you
cannot easily run the popular autoformatter black
or the typechecker mypy
on
Python code that is written in a Nextflow file. Also all language-specific IDE
tools are lost. All that is gained from including foreign code in a Nextflow
file is string interpolation and a reduction in the number of files.
Here is my own Python program that reproduces the above Nextflow script:
#!/usr/bin/env python
import subprocess
import tempfile
import os
import re
def pipe(code, data_input=None, params={}):
# String interpolation for all passed parameters
for (key, val) in params.items():
= re.sub(
code = "!{" + key + "}",
pattern =str(val),
repl=code
string
)
try:
# Create a temporary file to store the script
with tempfile.NamedTemporaryFile(mode='w', delete=False) as temp_file:
temp_file.write(code)= temp_file.name
temp_file_path
# Make the script executable
0o755)
os.chmod(temp_file_path,
# Execute the script capturing the STDOUT and using input data, if provided
= subprocess.run([temp_file_path], capture_output=True, text=True, input=data_input)
result
# Delete the temporary file
if os.path.exists(temp_file_path):
os.remove(temp_file_path)
# Check if the command executed successfully
if result.returncode == 0:
# On success, return the STDOUT
return result.stdout
else:
# On failure, return the STDERR
return f"Error: {result.stderr}"
except Exception as e:
# If file creation fails, then write the error
return f"Error: {str(e)}"
= """\
perl_code #!/usr/bin/env perl
use strict;
use warnings;
my $count;
my $range = !{params.range};
for ($count = 0; $count < 10; $count++) {
print rand($range) . ', ' . rand($range) . "\n";
}
"""
= """\
python_code #!/usr/bin/env python
import sys
x = 0
y = 0
lines = 0
for line in sys.stdin:
items = line.strip().split(",")
x += float(items[0])
y += float(items[1])
lines += 1
print("avg: %s - %s" % ( x/lines, y/lines ))
"""
def main():
= pipe(perl_code, params = {"params.range" : "100"})
perl_result = pipe(python_code, data_input=perl_result)
python_result print(python_result)
if __name__ == "__main__":
main()
This Python code makes two systems calls, one to a executable Python file and one to an executable Perl file. This is pretty standard scripting code. The only unusual bit is that we wrote the Perl and Python code as raw strings in the controlling script. We then wrote them to temporary files, made them executable, and ran them as standard system applications. The code would have been simpler and more modular if we had written the Perl and Python code to their own dedicated files.
Multi-lingual programming is a major area of research in computer science. Pasting the code for applications in different languages into a script is hardly the polyglot holy grail that the community has been seeking. No, this is just a convoluted variation of normal scripting.
Let’s step back and look at what polyglot computing really is. Most work in this area has focused on “lateral” polyglot systems where a program in one primary language calls a function in a foreign language. The narrowest definition limits calls to foreign function interfaces (FFIs), where structures in memory are shared. In this case, a foreign function call may have little or no overhead relative to a within-language call. Open topics of research include typechecking across language boundaries and designing binary representations that are reusable between languages. A wider definition allows serialization-based interfaces. Work in this direction includes systems such as Google’s protobufs which can generate efficient code to serialize types in many different languages to a common intermediate representation. This allows calls between programs that are entirely agnostic of the foreign implementation. These serialization engines allow rich data structures to be passed between languages and abstract away all the complexity of parsing and formatting. The programmer need only provide a specification of the common data types (i.e., the API). Cross-language serialization loses performance relative to FFIs, but is easier to implement, more generalizable, and can be used to distribute computing across a network.
Nextflow offers no binary interoperability between languages and also provides no form of language-specific serialization. Nextflow can be described as a composition language where a high-level script describes the coordination of lower-level nodes. The nodes a Nextflow scripts manage are UNIX applications. These applications can be written in any language and executed, but this is a feature of the OS not Nextflow.
My language, morloc
, in contrast, is a true polyglot language. Like Nextflow,
morloc
is a composition language. But the nodes in morloc
are functions in
specific languages rather than applications. morloc
generates interfaces
between these functions via automatic, type-directed, serialization. Here is the
morloc
implementation of the Nextflow pipeline. The morloc
script foo.loc
sources functions from the Python and Perl files “foo.py” and “foo.pl”. The
pipeline output is stored in the result
term:
module foo (result)
import types
source Perl from "foo.pl" ("randPairs")
randPairs :: Int -> [(Real, Real)]
source Py from "foo.py" ("sumPairs")
sumPairs :: [(Real, Real)] -> (Real, Real)
result = sumPairs (randPairs 100)
Each sourced function is accompanied by a type signature that succinctly
describes the domain and co-domain of the function. randPairs
is a function
from an integer to a list of pairs of real numbers. sumPairs
is a function
from a list of pairs of reals to a pair of reals. These signatures lessen our
reliance on fallible hand-written documentation. Further, if we misuse our
functions, the morloc
typechecker will raise an error message at compile time
(rather than surprise us at runtime).
The Perl file defines the function randPairs
:
use strict;
use warnings;
sub randPairs {
my ($range) = @_;
my @pairs;
for (my $i = 0; $i < 10; $i++) {
my $x = rand($range);
my $y = rand($range);
push @pairs, [$x, $y];
}return @pairs;
}
And the Python file defines the function sumPairs
:
def sumPairs(pairList)
= 0
xSum = 0
ySUm for (x,y) in pairList:
+= x
xSum += y
ySum return (xSum, ySum)
Notice the beautiful absence of all string manipulation, string to float
conversions, and IO operations. Also notice that the range parameter, 100, is
passed in as a typechecked integer literal. There are no hacks to insert text
into foreign language source code. These two files are not applications and are
not executable. They are modules containing natural function definitions that
the morloc
compiler will import into the generated code.
Perhaps most importantly, data structures are passed between the functions
rather than strings. In the Nextflow code, a table of numbers is printed with
comma delimiters, a space after each comma, and UNIX newlines characters. The
Perl code and the Python code have to agree on these conventions or the code
may silently break. The Python programmer must understand the implementation
details of the Perl code. In morloc
, the two functions may be written
independently.
Full disclosure, the above morloc
code will not currently work because Perl is
not yet supported. Currently morloc
supports Python, R, and C++. Incidentally,
Nextflow does not support compiled languages like C++. The main reason they do
not, I suspect, is that there is no demand. Anyone writing C++ code would prefer
to write in their own ecosystem where they have better control over the build
process.
In summary, morloc
is polyglot because it allows composition of pure,
idiomatic functions across languages. Nextlow is polyglot only in the sense that
it can make system calls to executable files implemented in various languages.