Contents:
- Part 0: ARAGORN and the transition to morloc
- Part 1: Inputs, outputs, and a case for simplicity
- Part 2: Refactoring the code
This post reviews the installation and execution of the command line ARAGORN program, describes the programs inputs and outputs, explains the many options and finally discusses the challenges that ARAGORN faces as a file-oriented program.
Installation
The ARAGORN browser-based graphical user interface (http://www.trna.se/ARAGORN/).
ARAGORN can be accessed online, but we will focus on the command line tool. The official source code is available here. It is unfortunately not officially hosted on GitHub. There is, however, an in-progress effort by the super scientist deprekate (see here) to write a python package with C-bindings to ARAGORN.
ARAGORN is a single-file C program that can be easily compiled with the GNU C
Compiler (gcc
) as so:
gcc -o aragorn aragorn1.2.41.c
After all my bitter years of debugging bioinformatics installations, it is great
to see one that compiles this cleanly. The code was written over two decades ago
and has survived until today with little maintenance. The key to ARAGORN’s
longevity is that the program is written in a stable language (ANSI C), has few
dependencies (only the system libraries stdio
and stdlib
), and has no
complex build environment. Going forward with the refactor, we want to be damn
sure that we don’t lose this stability.
File input
The program can take a FASTA file input as its sole argument. For our case study, we retrieved a few bacterial tRNAs from tRNADB-CE:
>C181099742|CP024420|Alphaproteobacteria|Brucella
gggcgcctgtGCCCTTATAGCTCAGTTGGTAGAGCACCTGATTTGTAATCAGGGGGTCGGGAGTTCGAGTCTCTCTGGGGGCACCActtttcgaca
>C191077422|CP031664|Bacillota|Staphylococcus
tccattttatGGGGGCTTAGCTCAGCTGGGAGAGCGCCTGCTTTGCACGCAGGAGGTCAGCGGTTCGATCCCGCTAGTCTCCACCAtttatttttt
>C181212629|LS483426|Betaproteobacteria|Kingella
atcaatcagaGCGGGAATAGCTCAGTTGGTAGAGCGCAACCTTGCCAAGGTTGAGGTCGCGAGTTCGAGACTCGTTTCCCGCTCCActctgatttg
>C181189091|CP030174|Gammaproteobacteria|Klebsiella
A FASTA file is a list of entries where each entry includes a header describing the sequence followed by the sequence string itself. The sequence usually is wrapped into many lines and may contain characters of upper and/or lower case. The case may or may not have a specialized meaning. There may be ambiguous bases in the sequence and there may be gaps.
ARAGORN may be executed with no parameters:
$ ./aragorn
No sequence file specified, type aragorn -h for help
This is a nice touch, there are way too many bioinformatics tools that would segfault, dump raw Java error messages, or (more forgivably) stall waiting for STDIN. A minor issue though, is that the exit code is still 0. In fact, the exit code of ARAGORN is always 0. If we pass the program, for example, its own source code rather than a FASTA file, it will dutifully parse itself:
$ ./aragorn aragorn1.2.36.c
------------------------------
ARAGORN v1.2.36 Dean Laslett
------------------------------
Please reference the following paper if you use this
program as part of any published research.
Laslett, D. and Canback, B. (2004) ARAGORN, a
program for the detection of transfer RNA and
transfer-messenger RNA genes in nucleotide sequences.
Nucleic Acids Research, 32;11-16.
Searching for tRNA genes with no introns
Searching for tmRNA genes
Assuming circular topology, search wraps around ends
Searching both strands
Using standard genetic code
Unnamed sequence
8350 nucleotides in sequence
Mean G+C content = 9.1%
Nothing found in Unnamed sequence
,<max> -t -m -mt
13 nucleotides in sequence
Mean G+C content = 15.4%
Nothing found in ,<max> -t -m -mt
<filename>
3 nucleotides in sequence
Mean G+C content = 0.0%
Nothing found in <filename>
is assumed to contain one or more sequences
436 nucleotides in sequence
Mean G+C content = 10.6%
...
Text between a ‘>’ and a newline in the input is interpreted as the header of a new FASTA record. The remaining lines until the next ‘>’ are interpreted as sequence. The sequence strings are internally translated to a 6 integer enumeration, where 0-3 map to the four nucleotides, 4 is an ambiguous base, and 5 is anything else. The inclusion of the catch-all means that all sequences are valid. Anything, even binaries, can be silently parsed.
Any pipeline managers that depend on exit codes to determine if a program successfully completed will miss these silent errors.
When ARAGORN is passed a valid FASTA file, we get the following result:
$ ./aragorn example.fa
------------------------------
ARAGORN v1.2.41 Dean Laslett
------------------------------
Please reference the following paper if you use this
program as part of any published research.
Laslett, D. and Canback, B. (2004) ARAGORN, a
program for the detection of transfer RNA and
transfer-messenger RNA genes in nucleotide sequences.
Nucleic Acids Research, 32;11-16.
Searching for tRNA genes with no introns
Searching for tmRNA genes
Assuming circular topology, search wraps around ends
Searching both strands
Using standard genetic code
C181099742|CP024420|Alphaproteobacteria|Brucella
96 nucleotides in sequence
Mean G+C content = 56.2%
1.
ca
c
a
g-c
c-g
c-g
c-g
t+g
t+g
a-t tg
t ctctc a
ga a !+!!! g
t ctcg gggag c
t !!!! c tt
g gagc t
gta a g
c-ggg
c-g
t-a
g-c
a-t
t a
t a
tgt
tRNA-Thr(tgt)
76 bases, %GC = 55.3
Sequence [11,86]
...
ARAGORN can also accept input from Genbank files. These files contain sequence along with extensive metadata including a description of the origin of the sequence and features that it contains. These annotations may include the locations of tRNAs. As an example, here is a link to the human reference mitochondrial genome: NC_012920.1.
Mitochondrial genomes encode their own tRNAs, so we can download this file (here
with name “mito.gb”) and scan it with the -mtmam
parameter for mammalian
mitochondria:
$ ./aragorn -mtmam mito.gb
...
22.
a
c-g
a-t
g-c
a-t
g-c
a-t
a-t g
t tttca a
aa a ++!!! a
a tttg ggagt a
t +!!+ t t
t gaat g
a c g
t-at
t-a
a-t
g-c
c-g
t t
t g
tgg
mtRNA-Pro(tgg)
68 bases, %GC = 33.8
Sequence c[15956,16023]
tRNA Anticodon Frequency
AAA Phe GAA Phe 1 CAA Leu TAA Leu 1
AGA Ser GGA Ser CGA Ser TGA Ser 1
ACA Cys GCA Cys 1 CCA Trp TCA Trp 1
ATA Tyr GTA Tyr 1 CTA Pyl TTA Stop
AAG Leu GAG Leu CAG Leu TAG Leu 1
AGG Pro GGG Pro CGG Pro TGG Pro 1
ACG Arg GCG Arg CCG Arg TCG Arg 1
ATG His GTG His 1 CTG Gln TTG Gln 1
AAC Val GAC Val CAC Val TAC Val 1
AGC Ala GGC Ala CGC Ala TGC Ala 1
ACC Gly GCC Gly CCC Gly TCC Gly 1
ATC Asp GTC Asp 1 CTC Glu TTC Glu 1
AAT Ile GAT Ile 1 CAT Met 1 TAT Met
AGT Thr GGT Thr CGT Thr TGT Thr 1
ACT Ser GCT Ser 1 CCT Stop TCT Stop
ATT Asn GTT Asn 1 CTT Lys TTT Lys 1
tRNA Codon Frequency
TTT Phe TTC Phe 1 TTG Leu TTA Leu 1
TCT Ser TCC Ser TCG Ser TCA Ser 1
TGT Cys TGC Cys 1 TGG Trp TGA Trp 1
TAT Tyr TAC Tyr 1 TAG Pyl TAA Stop
CTT Leu CTC Leu CTG Leu CTA Leu 1
CCT Pro CCC Pro CCG Pro CCA Pro 1
CGT Arg CGC Arg CGG Arg CGA Arg 1
CAT His CAC His 1 CAG Gln CAA Gln 1
GTT Val GTC Val GTG Val GTA Val 1
GCT Ala GCC Ala GCG Ala GCA Ala 1
GGT Gly GGC Gly GGG Gly GGA Gly 1
GAT Asp GAC Asp 1 GAG Glu GAA Glu 1
ATT Ile ATC Ile 1 ATG Met 1 ATA Met
ACT Thr ACC Thr ACG Thr ACA Thr 1
AGT Ser AGC Ser 1 AGG Stop AGA Stop
AAT Asn AAC Asn 1 AAG Lys AAA Lys 1
Number of tRNA genes = 22
Number of D replacement loop tRNA genes = 1
tRNA GC range = 22.4% to 49.3%
Number of tmRNA genes = 0
NC_012920 Homo sapiens mitochondrion, complete genome.
16569 nucleotides in sequence
Mean G+C content = 44.4%
GenBank to Aragorn Comparison
22 annotated tRNA genes
22 detected tRNA genes
GenBank Aragorn
tRNA-Phe (577,647) mtRNA-Phe(gaa) [577,647]
tRNA-Val (1602,1670) mtRNA-Val(tac) [1602,1670]
tRNA-Leu (3230,3304) mtRNA-Leu(taa) [3229,3306]
tRNA-Ile (4263,4331) mtRNA-Ile(gat) [4263,4331]
tRNA-Gln c(4329,4400) mtRNA-Gln(ttg) c[4329,4400]
tRNA-Met (4402,4469) mtRNA-Met(cat) [4402,4469]
tRNA-Trp (5512,5579) mtRNA-Trp(tca) [5512,5579]
tRNA-Ala c(5587,5655) mtRNA-Ala(tgc) c[5585,5656]
tRNA-Asn c(5657,5729) mtRNA-Asn(gtt) c[5657,5729]
tRNA-Cys c(5761,5826) mtRNA-Cys(gca) c[5760,5826]
tRNA-Tyr c(5826,5891) mtRNA-Tyr(gta) c[5826,5891]
tRNA-Ser c(7446,7514) mtRNA-Ser(tga) c[7446,7514]
tRNA-Asp (7518,7585) mtRNA-Asp(gtc) [7518,7585]
tRNA-Lys (8295,8364) mtRNA-Lys(ttt) [8295,8364]
tRNA-Gly (9991,10058) mtRNA-Gly(tcc) [9990,10059]
tRNA-Arg (10405,10469) mtRNA-Arg(tcg) [10404,10470]
tRNA-His (12138,12206) mtRNA-His(gtg) [12138,12206]
tRNA-Ser (12207,12265) D-loop mtRNA-Ser(gct) [12207,12265]
tRNA-Leu (12266,12336) mtRNA-Leu(tag) [12266,12336]
tRNA-Glu c(14674,14742) mtRNA-Glu(ttc) c[14673,14743]
tRNA-Thr (15888,15953) mtRNA-Thr(tgt) [15887,15954]
tRNA-Pro c(15956,16023) mtRNA-Pro(tgg) c[15956,16023]
Number of false negative genes = 0
Number of false positive genes = 0
Number of false positive D-replacement tRNA genes = 0
Number of false positive TV-replacement tRNA genes = 0
This produces structures for every observed tRNA (the first 21 elided), compares the inferred tRNAs to the ones annotated in the Genbank file, and generates tables describing the locations of the RNAs.
In addition to these free form textual outputs, ARAGORN offers a batch mode that consists of many tables separated by FASTA headers and annotation lines:
$ ./aragorn -w example.fa
>C181099742|CP024420|Alphaproteobacteria|Brucella
1 gene found
1 tRNA-Thr [11,86] 34 (tgt)
>C191077422|CP031664|Bacillota|Staphylococcus
1 gene found
1 tRNA-Ala [11,86] 34 (tgc)
>C181212629|LS483426|Betaproteobacteria|Kingella
1 gene found
1 tRNA-Gly [11,86] 34 (gcc)
>end 3 sequences 3 tRNA genes 0 tmRNA genes
Additional parameters allow access to variations of the batch output. For
example, the flag -br
adds the secondary structure of the tRNA in a computer
readable form:
$ ./aragorn -br -w z.fa
>C181099742|CP024420|Alphaproteobacteria|Brucella
1 gene found
1 tRNA-Thr [11,86] 34 (tgt)
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgggagttcgagtctctctgggggcacca
(((((((ss((((dddddddd))))s(((((ccAAAcc)))))vvvvv(((((ttttttt))))))))))))
Adding the -svg
parameter further appends the text of an SVG figure to the
output. Yielding batch entries such as:
>C181212629|LS483426|Betaproteobacteria|Kingella
1 gene found
1 tRNA-Gly [11,86] 34 (gcc)
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgcgagttcgagactcgtttcccgctcca
(((((((ss((((dddddddd))))s(((((ccAAAcc)))))vvvvv(((((ttttttt))))))))))))
<svg xmlns='http://www.w3.org/2000/svg' version='1.1' width='9.6cm' height='10cm' viewBox='0 0 26 27'>
<title>tRNA-Gly(gcc)</title>
<g font-family='Courier New,Courier,monospace' font-size='1.4' text-anchor='middle' fill='black' stroke='none'>
<text x='12' y='25'>g</text><text x='13' y='25'>c</text><text x='14' y='25'>c</text><text x='11' y='24'>t</text>
<text x='15' y='24'>a</text><text x='11' y='23'>t</text><text x='15' y='23'>a</text><text x='12' y='22'>c</text>
<text x='13' y='22'>-</text><text x='14' y='22'>g</text><text x='12' y='21'>c</text><text x='13' y='21'>-</text>
<text x='14' y='21'>g</text><text x='12' y='20'>a</text><text x='13' y='20'>-</text><text x='14' y='20'>t</text>
<text x='12' y='19'>a</text><text x='13' y='19'>-</text><text x='14' y='19'>t</text><text x='12' y='18'>c</text>
<text x='13' y='18'>-</text><text x='14' y='18'>g</text><text x='15' y='18'>a</text><text x='16' y='18'>g</text>
<text x='4' y='17'>g</text><text x='5' y='17'>t</text><text x='6' y='17'>a</text><text x='11' y='17'>g</text>
<text x='17' y='17'>g</text><text x='3' y='16'>g</text><text x='7' y='16'>g</text><text x='8' y='16'>a</text>
<text x='9' y='16'>g</text><text x='10' y='16'>c</text><text x='16' y='16'>t</text><text x='3' y='15'>t</text>
<text x='15' y='15'>c</text><text x='21' y='15'>t</text><text x='22' y='15'>t</text><text x='4' y='14'>t</text>
<text x='7' y='14'>c</text><text x='8' y='14'>t</text><text x='9' y='14'>c</text><text x='10' y='14'>g</text>
<text x='16' y='14'>g</text><text x='17' y='14'>c</text><text x='18' y='14'>g</text><text x='19' y='14'>a</text>
<text x='20' y='14'>g</text><text x='23' y='14'>c</text><text x='5' y='13'>g</text><text x='6' y='13'>a</text>
<text x='11' y='13'>a</text><text x='16' y='13'>+</text><text x='23' y='13'>g</text><text x='12' y='12'>t</text>
<text x='16' y='12'>t</text><text x='17' y='12'>g</text><text x='18' y='12'>c</text><text x='19' y='12'>t</text>
<text x='20' y='12'>c</text><text x='23' y='12'>a</text><text x='13' y='11'>a</text><text x='14' y='11'>-</text>
<text x='15' y='11'>t</text><text x='21' y='11'>a</text><text x='22' y='11'>g</text><text x='13' y='10'>a</text>
<text x='14' y='10'>-</text><text x='15' y='10'>t</text><text x='13' y='9'>g</text><text x='14' y='9'>-</text>
<text x='15' y='9'>c</text><text x='13' y='8'>g</text><text x='14' y='8'>-</text><text x='15' y='8'>c</text>
<text x='13' y='7'>g</text><text x='14' y='7'>-</text><text x='15' y='7'>c</text><text x='13' y='6'>c</text>
<text x='14' y='6'>-</text><text x='15' y='6'>g</text><text x='13' y='5'>g</text><text x='14' y='5'>-</text>
<text x='15' y='5'>c</text><text x='15' y='4'>t</text><text x='16' y='3'>c</text><text x='17' y='2'>c</text>
<text x='18' y='2'>a</text>
</g><g fill='none' stroke='black' stroke-width='0.075'>
<line x1='7' y1='15' x2='7' y2='14.3'/><line x1='8' y1='15' x2='8' y2='14.3'/>
<line x1='9' y1='15' x2='9' y2='14.3'/><line x1='10' y1='15' x2='10' y2='14.3'/>
<line x1='17' y1='13' x2='17' y2='12.3'/><line x1='18' y1='13' x2='18' y2='12.3'/>
<line x1='19' y1='13' x2='19' y2='12.3'/><line x1='20' y1='13' x2='20' y2='12.3'/>
</g></svg>
The batch format is non-standard and would require a specialized parser. For the broadest interoperability, ARAGORN offers a FASTA-only mode. FASTA headers may be formatted without spaces and with or without gene numbering. So there are four output FASTA variants:
# default
$ ./aragorn -fo z.fa
>tRNA-Thr(tgt) [11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>tRNA-Ala(tgc) [11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>tRNA-Gly(gcc) [11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca
# space free
$ ./aragorn -fos z.fa
>tRNA-Thr(tgt)[11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>tRNA-Ala(tgc)[11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>tRNA-Gly(gcc)[11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca
# with numbering
$ ./aragorn -fon z.fa
>1-1 tRNA-Thr(tgt) [11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>2-1 tRNA-Ala(tgc) [11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>3-1 tRNA-Gly(gcc) [11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca
# space free with numbering
$ ./aragorn -fons z.fa
>1-1tRNA-Thr(tgt)[11,86]
gcccttatagctcagttggtagagcacctgatttgtaatcagggggtcgg
gagttcgagtctctctgggggcacca
>2-1tRNA-Ala(tgc)[11,86]
gggggcttagctcagctgggagagcgcctgctttgcacgcaggaggtcag
cggttcgatcccgctagtctccacca
>3-1tRNA-Gly(gcc)[11,86]
gcgggaatagctcagttggtagagcgcaaccttgccaaggttgaggtcgc
gagttcgagactcgtttcccgctcca
The space-free option is useful if the downstream tool breaks on space (which is unfortunately common). The gene numbers specify the FASTA entry index and the tRNA index, giving a handle that can link back to the original reference sequences in the input FASTA file.
Parameterization
Behold! The vibroelectrohammerax with attached GPS! – Midjourney
Next, we examine the flags and options that modify ARAGORN’s behavior. Running
./aragorn -h
displays a usage message detailing the program’s options – 77 in
total. In the following paragraphs, I argue that these options represent scope
creep – that the code and interface of ARAGORN handle too many concerns. To be
fair, my view is one of radical minimalism; by common standards, ARAGORN is
focused and streamlined. My fundamental criticism is about the community design
choice, not the ARAGORN tool itself.
Every new bioinformatics tools tend to be, well, a new tool. Each has an
independent learning curve and few common conventions. This is clearly seen with
command line argument parsing. UNIX has loose conventions for arguments: options
may have a short form with one dash and one letter (e.g., -r
and -f
), short
forms may be bundled (-r -f
may be written as -rf
), and long forms may be
written with two dashes and multiple letters (e.g., --recursive
and
--force
). Like many bioinformatics tools, ARAGORN does not follow the
conventions. All arguments in ARAGORN have a single dash and no bundling is
supported. ARAGORN also silently ignores unsupported arguments and allows
mutually exclusive arguments to be passed in tandem, silently accepting the
last.
ARAGORN’s 77 parameters alter input/output format, filter results, set algorithmic parameters, transform output, transform input, and switch functions. We will cover each set of parameters below.
Input/output parameters. As detailed above, the parameters passed to ARAGORN determine the basic output format (raw, batch, and FASTA) and specify variants of each. Downstream tools that use ARAGORN will need to be specialized for specific combinations of ARAGORN parameters.
Filter results. ARAGORN allows output tRNAs to be filtered by setting score thresholds to allow detection of low confidence RNAs and pseudogenes; by toggling acceptance of tRNAs overlapping on the opposite strand; by bounding the locations where introns may be inferred; and by a few other metrics. While these filtering steps may allow modest performance improvements, their primary purpose is to specialize the search. An alternative approach, and the one chosen for my refactor, is to filter less within the core algorithm and allow more downstream filtering by returning a data structure with rich structural information.
Transform outputs. The predicted tRNA structures may be further processed to identify the amino acid they carry. This requires translating the anticodon. However, the genetic code varies in small ways among certain branches of the tree of life. So ARAGORN contains flags to specify this genetic code. Codon translation, however, is a common problem shared by many tools. Therefore it is reasonable to delegate translation to a common outside library.
Transform inputs. A genome may be circular or linear and the matches may be on the plus or minus strand. Extending the search to both strands requires either reversing the search algorithm (tricky in general) or taking the reverse complement of the input sequence, running the original search, and then mapping the hits back to the plus strand. Dealing with circular genomes requires either altering the search algorithm to check the pattern indices and wrap around if they overflow or extending the input sequence far enough to match a full pattern and then remapping the patterns that overlap the right bound. These are non-trivial problems and we will return to them in the final part of this series.
Switch functions. A final pair of flags switches the program from a tRNA to tmRNA search. tRNA and tmRNA prediction certainly overlap, and the two functions benefit from shared code, but tRNAs and tmRNAs have different output types. tRNAs most specify their codon and tmRNAs their tag.
ARAGORN’s parameters make sense for a lone tool. But in a wider bioinformatics context, they represent duplicated work, excess complexity, and may cause interface conflicts.
The complexity of the file-oriented program
ARAGORN’s IO options are relatively user-friendly compared to other bioinformatics tools, contributing to its popularity. However, its file-oriented approach raises several problems.
Scope creep. As discussed in the parameterization section, bioinformatics tools often suffer from scope creep. It is not enough to just predict tRNAs, ARAGORN must also parse complex formats, generate a wide range of output formats, transform the input to support both strands and circular genomes, and translate codons across a curated set of genetic codes.
Lossy output. The FASTA output loses the input metadata and derived results including the tRNA structure and energy score. This is unavoidable, since FASTA is not designed to store metadata. Batch mode is richer but still loses most Genbank metadata. None of the formats provide easy access to tRNA sub-components, complicating tasks like comparing T-loop sequences across species.
Too many output types. ARAGORN offers raw, batch, and FASTA outputs, each with variations based on input options. This complicates caching and may require rerunning ARAGORN for different use cases. Which output should a downstream tool expect? Writing a parser to auto-detect output types is complex, and requiring specific flags for downstream tools is brittle. Directly wrapping ARAGORN in downstream tools limits pipeline flexibility and locks in one tRNA prediction algorithm.
Costly Formatting: Parsing FASTA files is straightforward but requires careful implementation for efficiency and validation. Genbank files are even more difficult to handle due to their syntactic complexity and incompletely standardized format. Output formatting is complex, as ARAGORN must display structures, requiring additional UI parameters and code branches, increasing maintenance costs and user cognitive load.
Output formatting must meet user needs while maintaining backwards compatibility. Variations in input format might break ARAGORN, such as including gaps in the input FASTA sequence. Seemingly minor changes in output format can disrupt pipelines dependent on ARAGORN. For instance, replacing “!” with “|” in the ASCII structures could break downstream programs parsing T-loop sequences. Fear of breaking downstream dependencies can hinder a tool’s evolution.
Lack of clarity: Textual representations can be ambiguous, such as ARAGORN’s headers:
>tRNA-Thr(tgt) [11,86]
Without documentation, the meaning of numbers and symbols is unclear. A user may guess that 11 and 86 represent the start and end of the tRNA sequence, but is this 0- or 1-based and does the range include the acceptor stem? Is “tgt” the codon or the anticodon? Extensive documentation is needed, but this is tedious both to make and to read.
Metadata is not easily propagated: ARAGORN’s input includes sequence and metadata, but output formats vary in what metadata they retain. In batch mode, the full FASTA header is passed to output tRNAs, but pure FASTA output loses links to original entries and predicted structures. Genbank input loses most original metadata.
In the morloc
implementation, we replace all this complexity with a single
function of one type of input and one type of output. The input is a DNA
sequence given as a single string and the output is a list of 0 or more tRNA
records and their locations. There is no input metadata – threading metadata
from input to output is not the role of a tRNA predictor.
Conclusions
In morloc
, complex behavior is specified through function composition rather
than parameterization and runtime input/output polymorphism. This approach,
detailed in Part Three, transforms ARAGORN into a simple function that can act as
one piece in a larger analysis. The responsibility of ARAGORN becomes only the
prediction of tRNAs from a string; all other behavior is the delegated to
generic libraries. This allows ARAGORN developer to focus on writing pure
functions in their simplest form and sharing these as their sole result. The
researchers are free to focus on algorithmic design and not software
development.