ZWA

A bioinformatics workflow is often idealized as a directed graph of pure functions. A pure function is a map from an input domain to an output codomain. Pure functions are desirable because they are predictable, testable, and can be easily parallelized or memoized for performance optimization. They help in reasoning about code and making it more robust and maintainable. The executable workflow languages (EWLs) commonly used in bioinformatics, however, deviate strongly from this ideal. The root cause of this deviance is that the executable nodes in the workflow have too many added levels of complexity. Here I will introduce the six layers of complexity, the seventh layer that workflow languages introduce, and the consequences all this complexity has on the field of bioinformatics.

Six layers of application complexity

The first layer of complexity arises because nodes are executables that operate on the entire system rather than functions confined to a program’s memory space (. The inputs consist of command line arguments and data accessed from the system. This interaction with the OS adds complexity, such as the need to parse arguments, resolve file paths, handle file opening and permission issues, consider portability and security, and address file handling challenges. The designer must choose whether input is from STDIN, from a file provided as an argument, or from a file set in a given configuration file.

The outputs include files written to the operating system and other effects, such as updates sent to databases.

The fifth layer of complexity is the representation of the result, and its effects, on the system. In a pure function, the result is well-defined. The designer must choose whether to write output to STDOUT, to a given filename, or to an automatically generated filename. When output files already exist, the designer must choose whether to overwrite them, stop and raise an error, or write to an alternative path. In many cases, output includes more than one file. These extra files may contain logging info, stored data for intermediate steps, temporary files, or collections of output files that cannot be stored by any one of the conventional bioinformatics formats. Overall, there are many reasonable combinations choices for any given tool.

The second layer of complexity involves the parsing of the data extracted from the system. Whereas a function operates on native data structures, the executable must parse input binary/text streams. For a given input, the data may be compressed, archived, or even encrypted. So an unwrapping step may be necessary. There are a few ways to see if unwrapping is needed. The file extension could be read (e.g., does the filename end with “.gz”), the data could be checked for binary magic numbers indicating filetype, the user could pass an argument specifying filetype, or the tool could assume the data is of one type (e.g., uncompressed). Once the raw data is unwrapped, it must be parsed into a data structure. In bioinformatics, there are many common formats. These include FASTA for sequence, GFF for features on a sequence, VCF/BCF for variants, SAM/BAM for alignments, PDB for structures, Newick (among others) for trees, CSV for tabular data, and many others. The format may be selected by user choice or may be detected by checking the filename extension or by parsing the file. The formats are mostly not well specified, so the parser must take user input to resolve ambiguities or make assumptions about formatting. The output data must also be formatted before being written to the disk and data and metadata may be encoded in many ways.

The third layer of complexity is that the parsed input data may be contaminated with superfluous data that is unused by the core function. This is a consequence of the input formats. Typically the input formats are self-describing. They may contain provenance information, comments, or data fields that are not used by this particular tool. This superfluous data should be separated from the data that is passed to the core algorithm. After running the core function, the data may be directly passed on. For example, a function that aligns sequences in a FASTA file may preserve the FASTA file headers. But sometimes the operation may invalidate the superfluous data. If instead of alignment, the algorithm alters the sequence by, for example, optimizing codon usage across an mRNA, then annotations in the header specifying the mRNA accession number are no longer correct – this is a new sequence. Since FASTA headers are not standardized, any operation depends on the content of the input header will be brittle. Alternatively, the header could be completely replaced, but this leads to loss of information. This problem tends to make it difficult to thread information through a pipeline.

The fourth layer of complexity is that even the decontaminated data may have more structure than is needed by the core function. It is the responsibility of the tool to map the core function over a larger data structure. For example, suppose a scientist develops algorithms to predict some feature of a sequence. Although this is a function of a single sequence, since the input is a FASTA file that may contain many entries, the scientist must map the prediction algorithm over all entries. This may be easily accomplished with a for-loop, but this misses an opportunity for parallelism. Now if parallelism is required in a pipeline, the pipeline designer is responsible for splitting the FASTA file and feeding each element into the tool and recombining the output. The situation may even be worse if the scientist attempts to parallelize the program. Then implementation is locked to a particular parallelization scheme, and potentially heavy dependencies and architecture assumptions.

⊕

The fifth layer of complexity arises when many functions interact on a system. The executable name of every program will typically be in the system namespace. When two applications share the same name one will mask the other. This can cause installations to silently break existing workflows on the system. The size of the system can be reduced by running the workflow in a container or by locally redefining the PATH variable (on UNIX). But this only reduces the size of the problem. If a workflows actually uses two tool with the same name, then it will have to rename one of them or specify the pathname for each. A drastic solution is to use absolute pathnames for all commands in the workflow. This requires binding the names of all commands to paths provided by the user (e.g., in a workflow config file). A common way to lessen this problem is to bundle many functions together in one monolithic program. This creates ad hoc namespaces with potentially consistent interfaces. A common convention is to have the first argument of a command specify the subcommand. For example, the command to clone a git repository starts with git clone, where git is the command name and clone the subcommand. These monolithic programs may wrap other programs, such as the Genome Analysis Toolkit (GATK) wraps the packages Picard and bwa. The monolith approach provides local consistency and lessens the namespace problem, it also raises a three new issues. First it requires installation of heavy toolboxes even when only one tool is needed. Second, it typically requires installation of all dependencies for all tools, even when most are unused. Third, monoliths are susceptible to feature creep leading to every more complex programs with high maintenance costs and high variability across versions. Fourth, the monoliths require complex user interfaces and complex code to resolve conflicts between arguments. Fifth, failure of any one part of the tool can cause the entire tool to break. Overall, monoliths tend to be brittle and bloated.

⊕ It is time to unmask the computing community as a Secret Society for the Creation and Preservation of Artificial Complexity – Dijkstra

Each level of complexity adds additional modes of failure. This is true since each layer reduces the input domain. The domain of the entire system is mapped down to just the input files handles and arguments, this domain is mapped to byte strings, these are mapped to parsed formats, these are mapped to internal data structures, these are mapped to used data, and finally these are mapped to inputs to particular functions. This domain contraction leads to failing states for all values in the original domain outside the contracted domain. For example, if a function is passed a text stream (the original domain) but a FASTA formatted file is expected (the contracted domain), then all inputs other than FASTA must raise (hopefully informative) errors. Further, the reduced domain may be too small and raise errors for reasonable values. For example, if the input domain is byte strings and the output domain is reals, most possible values will necessarily map to errors (e.g., “cat”), but depending on our parser, some values that could be interpreted as numeric may fail as well, such as “4.3e-4” or “1,344”. More generally, the shape of the contracted input domain is variable, which means the failure profile will vary for different implementations.

Each level of complexity also adds additional layers of ambiguity. The ambiguity may be addressed with layers of user input and more parameters or configuration files. Alternatively, the ambiguity may be addressed with a more opinionated solution where assumptions about the system, data format, data structure, data propagation, and output data format are all hardcoded into the program. The former increases cognitive load (more parameters and docs to read) while the latter decreases flexibility. Both approaches prevent otherwise identical tools from be substituted unless they happen to have made the same design choices. At each level, many design choices must be made. Each tool plots a different path through these layers. The user must relearn this path for every tool. Often the assumptions about formatting and structure and the propagation of superfluous data are not documented, so the user must learn by experimentation. This large design space also greatly increases the initial development time of the tool and its long term maintenance.

Thus, in the prevailing bioinformatics paradigm, complexity is pushed to the application. Each new tool must implement these five layers of complexity. Every application developer is responsible for managing a large software engineering project rather than the simple implementation of an algorithm. Since these layers of complexity contain so many degrees of freedom, the tools can become very idiosyncratic. No standards can be enforced without agreement across the community and the reimplementation of past tools. Further, since all tools are independently responsible for parsing and formatting, formats cannot be refined without refactoring all existing applications. Thus format conventions are dominated by the most commonly used legacy software. All of this complexity and inter-dependence leads to a brittle ecosystem that is highly resistant to innovation.

Workflow languages: the seventh layer

Workflow languages add a seventh layer of complexity. They take a set of complex, idiosyncratic tools and wrap each in new script or plugin that provides a standardized interface to the application. Then they provide a runtime to simplify file handling and distribution. To address the bloated dependencies of these complex tools, they add dependency handlers and containers. To improve performance, they offer coarse parallelization of applications by running lists of files on subprocesses or distributing them over the cloud. To address the complexity of the application UIs, and the application-level parallelization, they require config files that map fields in the config to arguments passed to the wrapped application. They deal with the complexity of the applications by wrapping the applications in a box and exposing a simpler interface to the user. This is brittle. Any change in the application can cause a conflict with the wrapper. The wrapper will need to be re-evaluated whenever the wrapped application is changed.

Workflow languages are lack most of the expressiveness of modern programming languages. First, compound data structures are generally generally limited to the standard formats accepted by the nodes. Second, the only useable types for most bioinformatics nodes are the input format names (such as FASTA, BAM, GFF, etc). There is no clear method for further parametrizing these types to capture the variation in these files. These fuzzy definitions prevent the use of true generics. For example, the FASTA format couples an annotation (the header) with the sequence, but since there is no means to specify or infer the type of the annotation, there is no general operation that can be done on it. So the header annotation is not a proper generic type, but rather is more like a void pointer in C. Since the annotation could be anything, and since we know nothing about it, we can in general do nothing with it, except pass it on. Third, since nodes have ill-defined domains/codomains, and since there are many possible parameterization choices the designers may have made, it is not possible in general for a node to be passed as an argument to another node. This means that writing functions that do generic things, like mapping a function over elements in a collection, are not generally possible.

An alternative way of dealing with complexity is to simplify the objects themselves. This is the approach taken by morloc. With morloc, all seven layers of complexity are cut away. The algorithm designer writes pure functions using native data structures. The workflow designer composes these functions. They can easily implement fine-scaled parallelization using a method of their choice. Conventional bioinformatics formats, if used at all, are only required once when loading outside data into the pipeline or writing final results. A single dedicated library may be used for each format. Functions in a morloc composition that share the same type signature may be freely substituted. The entire workflow may be statically typechecked. The rich type-level information allows automatic generation of user interfaces, serialization, interoperability, and serves as a form or machine-checked documentation.

built on 2024-03-13 21:43:54.026524143 UTC from file 2023-11-09-layers-of-complexity

Playing in the wrong design space

Contents:

Six layers of application complexity

Workflow languages: the seventh layer