Part 1: Inputs, outputs, and a case for simplicity

by Zebulun Arendsee

Contents:

This post reviews the installation and execution of the command line ARAGORN program, describes the programs inputs and outputs, explains the many options and finally discusses the challenges that ARAGORN faces as a file-oriented program.

Installation

The ARAGORN browser-based graphical user interface (http://www.trna.se/ARAGORN/).

ARAGORN can be accessed online, but we will focus on the command line tool. The official source code is available here. It is unfortunately not officially hosted on GitHub. There is, however, an in-progress effort by the super scientist deprekate (see here) to write a python package with C-bindings to ARAGORN.

ARAGORN is a single-file C program that can be easily compiled with the GNU C Compiler (gcc) as so:

gcc -o aragorn aragorn1.2.41.c

After all my bitter years of debugging bioinformatics installations, it is great to see one that compiles this cleanly. The code was written over two decades ago and has survived until today with little maintenance. The key to ARAGORN’s longevity is that the program is written in a stable language (ANSI C), has few dependencies (only the system libraries stdio and stdlib), and has no complex build environment. Going forward with the refactor, we want to be damn sure that we don’t lose this stability.

File input

The program can take a FASTA file input as its sole argument. For our case study, we retrieved a few bacterial tRNAs from tRNADB-CE:

>C181099742|CP024420|Alphaproteobacteria|Brucella
gggcgcctgtGCCCTTATAGCTCAGTTGGTAGAGCACCTGATTTGTAATCAGGGGGTCGGGAGTTCGAGTCTCTCTGGGGGCACCActtttcgaca
>C191077422|CP031664|Bacillota|Staphylococcus
tccattttatGGGGGCTTAGCTCAGCTGGGAGAGCGCCTGCTTTGCACGCAGGAGGTCAGCGGTTCGATCCCGCTAGTCTCCACCAtttatttttt
>C181212629|LS483426|Betaproteobacteria|Kingella
atcaatcagaGCGGGAATAGCTCAGTTGGTAGAGCGCAACCTTGCCAAGGTTGAGGTCGCGAGTTCGAGACTCGTTTCCCGCTCCActctgatttg
>C181189091|CP030174|Gammaproteobacteria|Klebsiella

A FASTA file is a list of entries where each entry includes a header describing the sequence followed by the sequence string itself. The sequence usually is wrapped into many lines and may contain characters of upper and/or lower case. The case may or may not have a specialized meaning. There may be ambiguous bases in the sequence and there may be gaps.

ARAGORN may be executed with no parameters:

$ ./aragorn
No sequence file specified, type aragorn -h for help

This is a nice touch, there are way too many bioinformatics tools that would segfault, dump raw Java error messages, or (more forgivably) stall waiting for STDIN. A minor issue though, is that the exit code is still 0. In fact, the exit code of ARAGORN is always 0. If we pass the program, for example, its own source code rather than a FASTA file, it will dutifully parse itself:

$ ./aragorn aragorn1.2.36.c
------------------------------
ARAGORN v1.2.36   Dean Laslett
------------------------------

Please reference the following paper if you use this
program as part of any published research.

Laslett, D. and Canback, B. (2004) ARAGORN, a
program for the detection of transfer RNA and
transfer-messenger RNA genes in nucleotide sequences.
Nucleic Acids Research, 32;11-16.


Searching for tRNA genes with no introns
Searching for tmRNA genes
Assuming circular topology, search wraps around ends
Searching both strands
Using standard genetic code


Unnamed sequence
8350 nucleotides in sequence
Mean G+C content = 9.1%

Nothing found in Unnamed sequence


,<max> -t -m -mt
13 nucleotides in sequence
Mean G+C content = 15.4%

Nothing found in ,<max> -t -m -mt


 <filename>
3 nucleotides in sequence
Mean G+C content = 0.0%

Nothing found in  <filename>


 is assumed to contain one or more sequences
436 nucleotides in sequence
Mean G+C content = 10.6%

...

Text between a ‘>’ and a newline in the input is interpreted as the header of a new FASTA record. The remaining lines until the next ‘>’ are interpreted as sequence. The sequence strings are internally translated to a 5 integer enumeration, where 0-3 map to the four nucleotides, 4 is an ambiguous base, and 5 is anything else. The inclusion of the catch-all means that all sequences are valid. Anything, even binaries, can be silently parsed.

Any pipeline managers that depend on exit codes to determine if a program successfully completed will miss these silent errors.

When ARAGORN is passed a valid FASTA file, we get the following result:

$ ./aragorn example.fa
------------------------------
ARAGORN v1.2.41   Dean Laslett
------------------------------

Please reference the following paper if you use this
program as part of any published research.

Laslett, D. and Canback, B. (2004) ARAGORN, a
program for the detection of transfer RNA and
transfer-messenger RNA genes in nucleotide sequences.
Nucleic Acids Research, 32;11-16.


Searching for tRNA genes with no introns
Searching for tmRNA genes
Assuming circular topology, search wraps around ends
Searching both strands
Using standard genetic code


C181099742|CP024420|Alphaproteobacteria|Brucella
96 nucleotides in sequence
Mean G+C content = 56.2%

1.

                 ca
                c
               a
             g-c
             c-g
             c-g
             c-g
             t+g
             t+g
             a-t     tg
            t   ctctc  a
     ga    a    !+!!!  g
    t  ctcg     gggag  c
   t   !!!!    c     tt
   g   gagc     t
    gta    a     g
            c-ggg
            c-g
            t-a
            g-c
            a-t
           t   a
           t   a
            tgt



    tRNA-Thr(tgt)
    76 bases, %GC = 55.3
    Sequence [11,86]

...

ARAGORN can also accept input from Genbank files. These files contain sequence along with extensive metadata including a description of the origin of the sequence and features that it contains. These annotations may include the locations of tRNAs. As an example, here is a link to the human reference mitochondrial genome: NC_012920.1.

Mitochondrial genomes encode their own tRNAs, so we can download this file (here with name “mito.gb”) and scan it with the -mtmam parameter for mammalian mitochondria:

$ ./aragorn -mtmam mito.gb
...
22.



               a
             c-g
             a-t
             g-c
             a-t
             g-c
             a-t
             a-t     g
            t   tttca a
     aa    a    ++!!! a
    a  tttg     ggagt a
    t  +!!+     t    t
     t gaat     g
      a    c    g
            t-at
            t-a
            a-t
            g-c
            c-g
           t   t
           t   g
            tgg



    mtRNA-Pro(tgg)
    68 bases, %GC = 33.8
    Sequence c[15956,16023]



tRNA Anticodon Frequency
AAA Phe       GAA Phe  1    CAA Leu       TAA Leu  1
AGA Ser       GGA Ser       CGA Ser       TGA Ser  1
ACA Cys       GCA Cys  1    CCA Trp       TCA Trp  1
ATA Tyr       GTA Tyr  1    CTA Pyl       TTA Stop
AAG Leu       GAG Leu       CAG Leu       TAG Leu  1
AGG Pro       GGG Pro       CGG Pro       TGG Pro  1
ACG Arg       GCG Arg       CCG Arg       TCG Arg  1
ATG His       GTG His  1    CTG Gln       TTG Gln  1
AAC Val       GAC Val       CAC Val       TAC Val  1
AGC Ala       GGC Ala       CGC Ala       TGC Ala  1
ACC Gly       GCC Gly       CCC Gly       TCC Gly  1
ATC Asp       GTC Asp  1    CTC Glu       TTC Glu  1
AAT Ile       GAT Ile  1    CAT Met  1    TAT Met
AGT Thr       GGT Thr       CGT Thr       TGT Thr  1
ACT Ser       GCT Ser  1    CCT Stop      TCT Stop
ATT Asn       GTT Asn  1    CTT Lys       TTT Lys  1

tRNA Codon Frequency
TTT Phe       TTC Phe  1    TTG Leu       TTA Leu  1
TCT Ser       TCC Ser       TCG Ser       TCA Ser  1
TGT Cys       TGC Cys  1    TGG Trp       TGA Trp  1
TAT Tyr       TAC Tyr  1    TAG Pyl       TAA Stop
CTT Leu       CTC Leu       CTG Leu       CTA Leu  1
CCT Pro       CCC Pro       CCG Pro       CCA Pro  1
CGT Arg       CGC Arg       CGG Arg       CGA Arg  1
CAT His       CAC His  1    CAG Gln       CAA Gln  1
GTT Val       GTC Val       GTG Val       GTA Val  1
GCT Ala       GCC Ala       GCG Ala       GCA Ala  1
GGT Gly       GGC Gly       GGG Gly       GGA Gly  1
GAT Asp       GAC Asp  1    GAG Glu       GAA Glu  1
ATT Ile       ATC Ile  1    ATG Met  1    ATA Met
ACT Thr       ACC Thr       ACG Thr       ACA Thr  1
AGT Ser       AGC Ser  1    AGG Stop      AGA Stop
AAT Asn       AAC Asn  1    AAG Lys       AAA Lys  1

Number of tRNA genes = 22
Number of D replacement loop tRNA genes = 1
tRNA GC range = 22.4% to 49.3%
Number of tmRNA genes = 0



NC_012920 Homo sapiens mitochondrion, complete genome.
16569 nucleotides in sequence
Mean G+C content = 44.4%
GenBank to Aragorn Comparison

22 annotated tRNA genes
22 detected tRNA genes

  GenBank                               Aragorn
  tRNA-Phe  (577,647)           mtRNA-Phe(gaa) [577,647]
  tRNA-Val  (1602,1670)         mtRNA-Val(tac) [1602,1670]
  tRNA-Leu  (3230,3304)         mtRNA-Leu(taa) [3229,3306]
  tRNA-Ile  (4263,4331)         mtRNA-Ile(gat) [4263,4331]
  tRNA-Gln c(4329,4400)         mtRNA-Gln(ttg) c[4329,4400]
  tRNA-Met  (4402,4469)         mtRNA-Met(cat) [4402,4469]
  tRNA-Trp  (5512,5579)         mtRNA-Trp(tca) [5512,5579]
  tRNA-Ala c(5587,5655)         mtRNA-Ala(tgc) c[5585,5656]
  tRNA-Asn c(5657,5729)         mtRNA-Asn(gtt) c[5657,5729]
  tRNA-Cys c(5761,5826)         mtRNA-Cys(gca) c[5760,5826]
  tRNA-Tyr c(5826,5891)         mtRNA-Tyr(gta) c[5826,5891]
  tRNA-Ser c(7446,7514)         mtRNA-Ser(tga) c[7446,7514]
  tRNA-Asp  (7518,7585)         mtRNA-Asp(gtc) [7518,7585]
  tRNA-Lys  (8295,8364)         mtRNA-Lys(ttt) [8295,8364]
  tRNA-Gly  (9991,10058)        mtRNA-Gly(tcc) [9990,10059]
  tRNA-Arg  (10405,10469)       mtRNA-Arg(tcg) [10404,10470]
  tRNA-His  (12138,12206)       mtRNA-His(gtg) [12138,12206]
  tRNA-Ser  (12207,12265)       D-loop mtRNA-Ser(gct) [12207,12265]
  tRNA-Leu  (12266,12336)       mtRNA-Leu(tag) [12266,12336]
  tRNA-Glu c(14674,14742)       mtRNA-Glu(ttc) c[14673,14743]
  tRNA-Thr  (15888,15953)       mtRNA-Thr(tgt) [15887,15954]
  tRNA-Pro c(15956,16023)       mtRNA-Pro(tgg) c[15956,16023]

Number of false negative genes = 0
Number of false positive genes = 0
Number of false positive D-replacement tRNA genes = 0
Number of false positive TV-replacement tRNA genes = 0

This produces structures for every observed tRNA (the first 21 elided), compares the inferred tRNAs to the ones annotated in the Genbank file, and generates tables describing the locations of the RNAs.

In addition to these free form textual outputs, ARAGORN offers a batch mode that consists of many tables separated by FASTA headers and annotation lines:

$ ./aragorn -w example.fa
>C181099742|CP024420|Alphaproteobacteria|Brucella
1 gene found
1   tRNA-Thr                       [11,86]      34      (tgt)
>C191077422|CP031664|Bacillota|Staphylococcus
1 gene found
1   tRNA-Ala                       [11,86]      34      (tgc)
>C181212629|LS483426|Betaproteobacteria|Kingella
1 gene found
1   tRNA-Gly                       [11,86]      34      (gcc)
>end    3 sequences 3 tRNA genes 0 tmRNA genes

Additional parameters allow access to variations of the batch output. For example, the flag -br adds the secondary structure of the tRNA in a computer readable form:

$ ./aragorn -br -w z.fa
>C181099742|CP024420|Alphaproteobacteria|Brucella
1 gene found
1   tRNA-Thr                       [11,86]      34      (tgt)
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgggagttcgagtctctctgggggcacca
(((((((ss((((dddddddd))))s(((((ccAAAcc)))))vvvvv(((((ttttttt))))))))))))

Adding the -svg parameter further appends the text of an SVG figure to the output. Yielding batch entries such as:

>C181212629|LS483426|Betaproteobacteria|Kingella
1 gene found
1   tRNA-Gly                       [11,86]      34      (gcc)
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgcgagttcgagactcgtttcccgctcca
(((((((ss((((dddddddd))))s(((((ccAAAcc)))))vvvvv(((((ttttttt))))))))))))
<svg xmlns='http://www.w3.org/2000/svg' version='1.1' width='9.6cm' height='10cm' viewBox='0 0 26 27'>
<title>tRNA-Gly(gcc)</title>
<g font-family='Courier New,Courier,monospace' font-size='1.4' text-anchor='middle' fill='black' stroke='none'>
<text x='12' y='25'>g</text><text x='13' y='25'>c</text><text x='14' y='25'>c</text><text x='11' y='24'>t</text>
<text x='15' y='24'>a</text><text x='11' y='23'>t</text><text x='15' y='23'>a</text><text x='12' y='22'>c</text>
<text x='13' y='22'>-</text><text x='14' y='22'>g</text><text x='12' y='21'>c</text><text x='13' y='21'>-</text>
<text x='14' y='21'>g</text><text x='12' y='20'>a</text><text x='13' y='20'>-</text><text x='14' y='20'>t</text>
<text x='12' y='19'>a</text><text x='13' y='19'>-</text><text x='14' y='19'>t</text><text x='12' y='18'>c</text>
<text x='13' y='18'>-</text><text x='14' y='18'>g</text><text x='15' y='18'>a</text><text x='16' y='18'>g</text>
<text x='4' y='17'>g</text><text x='5' y='17'>t</text><text x='6' y='17'>a</text><text x='11' y='17'>g</text>
<text x='17' y='17'>g</text><text x='3' y='16'>g</text><text x='7' y='16'>g</text><text x='8' y='16'>a</text>
<text x='9' y='16'>g</text><text x='10' y='16'>c</text><text x='16' y='16'>t</text><text x='3' y='15'>t</text>
<text x='15' y='15'>c</text><text x='21' y='15'>t</text><text x='22' y='15'>t</text><text x='4' y='14'>t</text>
<text x='7' y='14'>c</text><text x='8' y='14'>t</text><text x='9' y='14'>c</text><text x='10' y='14'>g</text>
<text x='16' y='14'>g</text><text x='17' y='14'>c</text><text x='18' y='14'>g</text><text x='19' y='14'>a</text>
<text x='20' y='14'>g</text><text x='23' y='14'>c</text><text x='5' y='13'>g</text><text x='6' y='13'>a</text>
<text x='11' y='13'>a</text><text x='16' y='13'>+</text><text x='23' y='13'>g</text><text x='12' y='12'>t</text>
<text x='16' y='12'>t</text><text x='17' y='12'>g</text><text x='18' y='12'>c</text><text x='19' y='12'>t</text>
<text x='20' y='12'>c</text><text x='23' y='12'>a</text><text x='13' y='11'>a</text><text x='14' y='11'>-</text>
<text x='15' y='11'>t</text><text x='21' y='11'>a</text><text x='22' y='11'>g</text><text x='13' y='10'>a</text>
<text x='14' y='10'>-</text><text x='15' y='10'>t</text><text x='13' y='9'>g</text><text x='14' y='9'>-</text>
<text x='15' y='9'>c</text><text x='13' y='8'>g</text><text x='14' y='8'>-</text><text x='15' y='8'>c</text>
<text x='13' y='7'>g</text><text x='14' y='7'>-</text><text x='15' y='7'>c</text><text x='13' y='6'>c</text>
<text x='14' y='6'>-</text><text x='15' y='6'>g</text><text x='13' y='5'>g</text><text x='14' y='5'>-</text>
<text x='15' y='5'>c</text><text x='15' y='4'>t</text><text x='16' y='3'>c</text><text x='17' y='2'>c</text>
<text x='18' y='2'>a</text>
</g><g fill='none' stroke='black' stroke-width='0.075'>
<line x1='7' y1='15' x2='7' y2='14.3'/><line x1='8' y1='15' x2='8' y2='14.3'/>
<line x1='9' y1='15' x2='9' y2='14.3'/><line x1='10' y1='15' x2='10' y2='14.3'/>
<line x1='17' y1='13' x2='17' y2='12.3'/><line x1='18' y1='13' x2='18' y2='12.3'/>
<line x1='19' y1='13' x2='19' y2='12.3'/><line x1='20' y1='13' x2='20' y2='12.3'/>
</g></svg>

The batch format is non-standard and would require a specialized parser. For the broadest interoperability, ARAGORN offers a FASTA-only mode. FASTA headers may be formatted without spaces and with or without gene numbering. So there are four output FASTA variants:

# default
$ ./aragorn -fo z.fa
>tRNA-Thr(tgt) [11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>tRNA-Ala(tgc) [11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>tRNA-Gly(gcc) [11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca

# space free
$ ./aragorn -fos z.fa
>tRNA-Thr(tgt)[11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>tRNA-Ala(tgc)[11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>tRNA-Gly(gcc)[11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca

# with numbering
$ ./aragorn -fon z.fa
>1-1 tRNA-Thr(tgt) [11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>2-1 tRNA-Ala(tgc) [11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>3-1 tRNA-Gly(gcc) [11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca

# space free with numbering
$ ./aragorn -fons z.fa
>1-1tRNA-Thr(tgt)[11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>2-1tRNA-Ala(tgc)[11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>3-1tRNA-Gly(gcc)[11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca

The space-free option is useful if the downstream tool breaks on space (which is unfortunately common). The gene numbers specify the FASTA entry index and the tRNA index, giving a handle that can link back to the original reference sequences in the input FASTA file.

Parameterization

Behold! The vibroelectrohammerax with attached GPS! – Midjourney

Next, we examine the flags and options that modify ARAGORN’s behavior. Running ./aragorn -h displays a usage message detailing the program’s options – 77 in total. In the following paragraphs, I argue that these options represent scope creep – that the code and interface of ARAGORN handle too many concerns. To be fair, my view is one of radical minimalism; by common standards, ARAGORN is focused and streamlined. My fundamental criticism is about the community design choice, not the ARAGORN tool itself.

Every new bioinformatics tools tend to be, well, a new tool. Each has an independent learning curve and few common conventions. This is clearly seen with command line argument parsing. UNIX has loose conventions for arguments: options may have a short form with one dash and one letter (e.g., -r and -f), short forms may be bundled (-r -f may be written as -rf), and long forms may be written with two dashes and multiple letters (e.g., --recursive and --force). Like many bioinformatics tools, ARAGORN does not follow the conventions. All arguments in ARAGORN have a single dash and no bundling is supported. ARAGORN also silently ignores unsupported arguments and allows mutually exclusive arguments to be passed in tandem, silently accepting the last.

ARAGORN’s 77 parameters alter input/output format, filter results, set algorithmic parameters, transform output, transform input, and switch functions. We will cover each set of parameters below.

Input/output parameters. As detailed above, the parameters passed to ARAGORN determine the basic output format (raw, batch, and FASTA) and specify variants of each. Downstream tools that use ARAGORN will need to be specialized for specific combinations of ARAGORN parameters.

Filter results. ARAGORN allows output tRNAs to be filtered by setting score thresholds to allow detection of low confidence RNAs and pseudogenes; by toggling acceptance of tRNAs overlapping on the opposite strand; by bounding the locations where introns may be inferred; and by a few other metrics. While these filtering steps may allow modest performance improvements, their primary purpose is to specialize the search. An alternative approach, and the one chosen for my refactor, is to filter less within the core algorithm and allow more downstream filtering by returning a data structure with rich structural information.

Transform outputs. The predicted tRNA structures may be further processed to identify the amino acid they carry. This requires translating the anticodon. However, the genetic code varies in small ways among certain branches of the tree of life. So ARAGORN contains flags to specify this genetic code. Codon translation, however, is a common problem shared by many tools. Therefore it is reasonable to delegate translation to a common outside library.

Transform inputs. A genome may be circular or linear and the matches may be on the plus or minus strand. Extending the search to both strands requires either reversing the search algorithm (tricky in general) or taking the reverse complement of the input sequence, running the original search, and then mapping the hits back to the plus strand. Dealing with circular genomes requires either altering the search algorithm to check the pattern indices and wrap around if they overflow or extending the input sequence far enough to match a full pattern and then remapping the patterns that overlap the right bound. These are non-trivial problems and we will return to them in the final part of this series.

Switch functions. A final pair of flags switches the program from a tRNA to tmRNA search. tRNA and tmRNA prediction certainly overlap, and the two functions benefit from shared code, but tRNAs and tmRNAs have different output types. tRNAs most specify their codon and tmRNAs their tag.

ARAGORN’s parameters make sense for a lone tool. But in a wider bioinformatics context, they represent duplicated work, excess complexity, and may cause interface conflicts.

The complexity of the file-oriented program

ARAGORN’s IO options are relatively user-friendly compared to other bioinformatics tools, contributing to its popularity. However, its file-oriented approach raises several problems.

Scope creep. As discussed in the parameterization section, bioinformatics tools often suffer from scope creep. It is not enough to just predict tRNAs, ARAGORN must also parse complex formats, generate a wide range of output formats, transform the input to support both strands and circular genomes, and translate codons across a curated set of genetic codes.

Lossy output. The FASTA output loses the input metadata and derived results including the tRNA structure and energy score. This is unavoidable, since FASTA is not designed to store metadata. Batch mode is richer but still loses most Genbank metadata. None of the formats provide easy access to tRNA sub-components, complicating tasks like comparing T-loop sequences across species.

Too many output types. ARAGORN offers raw, batch, and FASTA outputs, each with variations based on input options. This complicates caching and may require rerunning ARAGORN for different use cases. Which output should a downstream tool expect? Writing a parser to auto-detect output types is complex, and requiring specific flags for downstream tools is brittle. Directly wrapping ARAGORN in downstream tools limits pipeline flexibility and locks in one tRNA prediction algorithm.

Costly Formatting: Parsing FASTA files is straightforward but requires careful implementation for efficiency and validation. Genbank files are even more difficult to handle due to their syntactic complexity and incompletely standardized format. Output formatting is complex, as ARAGORN must display structures, requiring additional UI parameters and code branches, increasing maintenance costs and user cognitive load.

Output formatting must meet user needs while maintaining backwards compatibility. Variations in input format might break ARAGORN, such as including gaps in the input FASTA sequence. Seemingly minor changes in output format can disrupt pipelines dependent on ARAGORN. For instance, replacing “!” with “|” in the ASCII structures could break downstream programs parsing T-loop sequences. Fear of breaking downstream dependencies can hinder a tool’s evolution.

Lack of clarity: Textual representations can be ambiguous, such as ARAGORN’s headers:

>tRNA-Thr(tgt) [11,86]

Without documentation, the meaning of numbers and symbols is unclear. A user may guess that 11 and 86 represent the start and end of the tRNA sequence, but is this 0- or 1-based and does the range include the acceptor stem? Is “tgt” the codon or the anticodon? Extensive documentation is needed, but this is tedious both to make and to read.

Metadata is not easily propagated: ARAGORN’s input includes sequence and metadata, but output formats vary in what metadata they retain. In batch mode, the full FASTA header is passed to output tRNAs, but pure FASTA output loses links to original entries and predicted structures. Genbank input loses most original metadata.

In the morloc implementation, we replace all this complexity with a single function of one type of input and one type of output. The input is a DNA sequence given as a single string and the output is a list of 0 or more tRNA records and their locations. There is no input metadata – threading metadata from input to output is not the role of a tRNA predictor.

Conclusions

In morloc, complex behavior is specified through function composition rather than parameterization and runtime input/output polymorphism. This approach, detailed in Part Three, transforms ARAGORN into a simple function that can act as one piece in a larger analysis. The responsibility of ARAGORN becomes only the prediction of tRNAs from a string; all other behavior is the delegated to generic libraries. This allows ARAGORN developer to focus on writing pure functions in their simplest form and sharing these as their sole result. The researchers are free to focus on algorithmic design and not software development.

built on 2024-08-12 11:47:46.22895832 UTC from file 2024-08-04-aragorn-1