Nextflow is not polyglot

by Zebulun Arendsee

Contents:

There are far too many workflow managers to enumerate here, but perhaps the most popular, at least in bioinformatics, is Nextflow. Like many workflow managers, Nextflow advertises itself as being “polyglot”. They claim to support many programming languages.

It is true that within a Nextflow script you can write code from several scripting languages (Bash, Python, Perl and others). The following code is an example of “polyglot” programming from their online documentation (https://www.nextflow.io/example2.html):

#!/usr/bin/env nextflow
 
params.range = 100
 
/*
 * A trivial Perl script that produces a list of number pairs
 */
process perlTask {
    output:
    stdout
 
    shell:
    '''
    #!/usr/bin/env perl
    use strict;
    use warnings;
 
    my $count;
    my $range = !{params.range};
    for ($count = 0; $count < 10; $count++) {
        print rand($range) . ', ' . rand($range) . "\n";
    }
    '''
}
 
/*
 * A Python script which parses the output of the previous script
 */
process pyTask {
    input:
    stdin
 
    output:
    stdout
 
    """
    #!/usr/bin/env python
    import sys
 
    x = 0
    y = 0
    lines = 0
    for line in sys.stdin:
        items = line.strip().split(",")
        x += float(items[0])
        y += float(items[1])
        lines += 1
 
    print("avg: %s - %s" % ( x/lines, y/lines ))
    """
}
 
workflow {
    perlTask | pyTask | view
}

The script defines processes that wrap scripts in Perl and Python. Each script is a full program – hashbang and all. Nextflow takes these scripts, interpolates parameters, writes them to disk, and executes them. The parameter insertion, at least, is helpful. But writing code in a Nextflow script separates the target language (Python/Perl) from the language ecosystem. For instance, you cannot easily run the popular autoformatter black or the typechecker mypy on Python code that is written in a Nextflow file. Also all language-specific IDE tools are lost. All that is gained from including foreign code in a Nextflow file is string interpolation and a reduction in the number of files.

Here is my own Python program that reproduces the above Nextflow script:

#!/usr/bin/env python
import subprocess
import tempfile
import os
import re

def pipe(code, data_input=None, params={}):
    # String interpolation for all passed parameters
    for (key, val) in params.items():
        code = re.sub(
            pattern = "!{" + key + "}",
            repl=str(val),
            string=code
        )

    try:
        # Create a temporary file to store the script
        with tempfile.NamedTemporaryFile(mode='w', delete=False) as temp_file:
            temp_file.write(code)
            temp_file_path = temp_file.name

            # Make the script executable
            os.chmod(temp_file_path, 0o755)
        
        # Execute the script capturing the STDOUT and using input data, if provided
        result = subprocess.run([temp_file_path], capture_output=True, text=True, input=data_input)

        # Delete the temporary file
        if os.path.exists(temp_file_path):
            os.remove(temp_file_path)
        
        # Check if the command executed successfully
        if result.returncode == 0:
            # On success, return the STDOUT
            return result.stdout
        else:
            # On failure, return the STDERR
            return f"Error: {result.stderr}"
    except Exception as e:
        # If file creation fails, then write the error
        return f"Error: {str(e)}"

perl_code = """\
#!/usr/bin/env perl
use strict;
use warnings;

my $count;
my $range = !{params.range};
for ($count = 0; $count < 10; $count++) {
    print rand($range) . ', ' . rand($range) . "\n";
}
"""

python_code = """\
#!/usr/bin/env python
import sys

x = 0
y = 0
lines = 0
for line in sys.stdin:
    items = line.strip().split(",")
    x += float(items[0])
    y += float(items[1])
    lines += 1

print("avg: %s - %s" % ( x/lines, y/lines ))
"""

def main():
    perl_result = pipe(perl_code, params = {"params.range" : "100"})
    python_result = pipe(python_code, data_input=perl_result)
    print(python_result)

if __name__ == "__main__":
    main()

This Python code makes two systems calls, one to a executable Python file and one to an executable Perl file. This is pretty standard scripting code. The only unusual bit is that we wrote the Perl and Python code as raw strings in the controlling script. We then wrote them to temporary files, made them executable, and ran them as standard system applications. The code would have been simpler and more modular if we had written the Perl and Python code to their own dedicated files.

Multi-lingual programming is a major area of research in computer science. Pasting the code for applications in different languages into a script is hardly the polyglot holy grail that the community has been seeking. No, this is just a convoluted variation of normal scripting.

Let’s step back and look at what polyglot computing really is. Most work in this area has focused on “lateral” polyglot systems where a program in one primary language calls a function in a foreign language. The narrowest definition limits calls to foreign function interfaces (FFIs), where structures in memory are shared. In this case, a foreign function call may have little or no overhead relative to a within-language call. Open topics of research include typechecking across language boundaries and designing binary representations that are reusable between languages. A wider definition allows serialization-based interfaces. Work in this direction includes systems such as Google’s protobufs which can generate efficient code to serialize types in many different languages to a common intermediate representation. This allows calls between programs that are entirely agnostic of the foreign implementation. These serialization engines allow rich data structures to be passed between languages and abstract away all the complexity of parsing and formatting. The programmer need only provide a specification of the common data types (i.e., the API). Cross-language serialization loses performance relative to FFIs, but is easier to implement, more generalizable, and can be used to distribute computing across a network.

Nextflow offers no binary interoperability between languages and also provides no form of language-specific serialization. Nextflow can be described as a composition language where a high-level script describes the coordination of lower-level nodes. The nodes a Nextflow scripts manage are UNIX applications. These applications can be written in any language and executed, but this is a feature of the OS not Nextflow.

My language, morloc, in contrast, is a true polyglot language. Like Nextflow, morloc is a composition language. But the nodes in morloc are functions in specific languages rather than applications. morloc generates interfaces between these functions via automatic, type-directed, serialization. Here is the morloc implementation of the Nextflow pipeline. The morloc script foo.loc sources functions from the Python and Perl files “foo.py” and “foo.pl”. The pipeline output is stored in the result term:

module foo (result)
import types 

source Perl from "foo.pl" ("randPairs")
randPairs :: Int -> [(Real, Real)]

source Py from "foo.py" ("sumPairs")
sumPairs :: [(Real, Real)] -> (Real, Real)

result = sumPairs (randPairs 100)

Each sourced function is accompanied by a type signature that succinctly describes the domain and co-domain of the function. randPairs is a function from an integer to a list of pairs of real numbers. sumPairs is a function from a list of pairs of reals to a pair of reals. These signatures lessen our reliance on fallible hand-written documentation. Further, if we misuse our functions, the morloc typechecker will raise an error message at compile time (rather than surprise us at runtime).

The Perl file defines the function randPairs:

use strict;
use warnings;

sub randPairs {
    my ($range) = @_;
    my @pairs;
    for (my $i = 0; $i < 10; $i++) {
        my $x = rand($range);
        my $y = rand($range);
        push @pairs, [$x, $y];
    }
    return @pairs;
}

And the Python file defines the function sumPairs:

def sumPairs(pairList)
    xSum = 0
    ySUm = 0
    for (x,y) in pairList:
        xSum += x
        ySum += y
    return (xSum, ySum)

Notice the beautiful absence of all string manipulation, string to float conversions, and IO operations. Also notice that the range parameter, 100, is passed in as a typechecked integer literal. There are no hacks to insert text into foreign language source code. These two files are not applications and are not executable. They are modules containing natural function definitions that the morloc compiler will import into the generated code.

Perhaps most importantly, data structures are passed between the functions rather than strings. In the Nextflow code, a table of numbers is printed with comma delimiters, a space after each comma, and UNIX newlines characters. The Perl code and the Python code have to agree on these conventions or the code may silently break. The Python programmer must understand the implementation details of the Perl code. In morloc, the two functions may be written independently.

Full disclosure, the above morloc code will not currently work because Perl is not yet supported. Currently morloc supports Python, R, and C++. Incidentally, Nextflow does not support compiled languages like C++. The main reason they do not, I suspect, is that there is no demand. Anyone writing C++ code would prefer to write in their own ecosystem where they have better control over the build process.

In summary, morloc is polyglot because it allows composition of pure, idiomatic functions across languages. Nextlow is polyglot only in the sense that it can make system calls to executable files implemented in various languages.

built on 2024-03-13 21:43:54.026524143 UTC from file 2024-02-27-nextflow-is-not-polyglot