Contents:
ARAGORN code
The Aragorn
program is written as a single 12000-line C file with dependencies
only on stdio
and stdlib
. The code is undocumented, but the style is
reasonably clear with little gratuitous obscurity or excess frameworking. The
code makes heavy use of pointer arithmetic, single-letter variable names, and
GOTO statements. They never use logical AND, rather opting for nested
if-statements. The programmer learned enough C to express their intent and gave
the bird to modern norms – I can respect that.
Most of the code is wavy lines of conditions speckled with many mutating pointers into a sequence. Here is one short function as an example:
void overlap(data_set *d, int sort[], int n, int it, csw *sw)
{ int i,j,flag,cross,crosstoo;
long a,b,e,f,a2,b2,e2,f2,psmax;
char sname[100],s[100];
= 0;
flag = 0;
cross = d->psmax;
psmax = ts[it].start;
a = ts[it].stop;
b if (b < a)
{ a2 = a - psmax;
= b;
b2 += psmax;
b = 1; }
cross = -1;
j while (++j < n)
{ i = sort[j];
if (i == it) continue;
= ts[i].start;
e = ts[i].stop;
f = 0;
crosstoo if (f < e)
{ e2 = e - psmax;
= f;
f2 += psmax;
f = 1; }
crosstoo if (a <= f)
if (b >= e)
goto OV;
if (crosstoo)
if (a <= f2)
if (b >= e2)
goto OV;
if (cross)
{ if (a2 <= f)
if (b2 >= e)
goto OV;
if (crosstoo)
if (a2 <= f2)
if (b2 >= e2)
goto OV; }
continue;
:
OVif (!flag) fputc('\n',sw->f);
(ts+i,sname,1,sw);
name(s,ts+i,sw,sname);
location(sw->f,"Overlap with %d: %s\n", j+1,s);
fprintf= 1; }
flag if (flag) fputc('\n',sw->f); }
There are few relevant particulars. First, notice the print statements at the
end of the function. IO and algorithmic logic are interwoven throughout the
program. Second, the return type is void
. Nothing is returned. Instead the
function is run for its side effects. One of these effects is the print
statements. Three mutable values are passed into the function (d
, sort
,and
sw
). Skimming the code I don’t see any place where these arguments actually
are mutated. But there is also a global value, ts
, which is a vector of
observed tRNAs, that is mutated in the location
function.
No allocated memory is freed in the program. This may be fine for an executable that always runs to completion on a machine with excess memory. However, failure to free allocated memory can cause memory leaks for long running programs. And since I am making a library, I do not know how the exported functions will be used. So I need to be sure to free any allocated memory.
The program also originally had no dynamic data structures. For example, to store inferred tRNAs, a single array of 3000 elements is allocated. As new tRNAs are predicted, they are set to elements in the array. If the array index is equal to 3000, then an error is raised.
The choice to use pure C has its perks, including portability between systems and very fast compile times. But I’ve moved to C++ by replacing static data structures with vectors and strings from the standard template library.
Without documentation, I am left to guess what much of the code does. For example, the matrix below:
int gc[6][6] = { { 0,0,0,0,0,0 },
{ 0,0,1,0,1,0 },
{ 0,1,0,0,1,0 },
{ 0,0,0,0,0,0 },
{ 0,1,1,0,1,0 },
{ 0,0,0,0,0,0 } };
I know that input sequence is mapped to 6 values: the four nucleotides (A, C, G,
T), an unknown base (N), and an invalid character. A 6-by-6 matrix would map
pairs of bases to one another. This matrix is symmetric, is named gc
, and maps
pairs of bases to binary values. G maps to C and N maps to G or C. Altogether,
this matrix appears to be designed to detect GC pairs. Easy enough.
Similarly, the matrix below recognizes canonical and wobble (G-T) base pairs:
int bp[6][6] = { { 0,0,0,1,1,0 },
{ 0,0,1,0,1,0 },
{ 0,1,0,1,1,0 },
{ 1,0,1,0,1,0 },
{ 1,1,1,1,1,0 },
{ 0,0,0,0,0,0 } };
The matrices are used in code like the following:
= sc;
s1 = sc + 16;
s2 = sc + 6;
sd = bp[*s1][*--s2];
j while (++s1 < sd) j += bp[*s1][*--s2];
if (j >= 6) { /* handle matched D-stem */ }
This code recognizes a hairpin. sc
is pointer to a current search location in
the sequence. j
is a count of the number of base-pair matches. s1
and s2
are pointers to the potential start and stop of the hairpin loop. The ++s1
and
--s2
operations are pre-increment and pre-decrement operations. These mutate
the pointers, shifting to the next pair of bases in the stem, and return the new
values to the expression. The while
loop counts the number of matches in the
stem. If six stem bases match, then the if
condition is met and block for a
matched D-stem is executed.
stem_1 loop stem_2
------ ------
xxx.................xxxxx
^ ^ ^
sc sd sc + 16
^ ^
s1 --> <-- s2
Refactor
I’ve talked in depth so far about what I want to remove from ARAGORN.
I have always argued that morloc
frees the programmer to write simpler
code. Within the morloc
ecosystem, a scientific programmer need only write the
core algorithms of scientific interest. They can ignore parsing and formatting,
user interfaces, parallelism (usually), metadata propagation, and similar
complexities. In this section, I will show how this works in practice.
A new data representation
What data should be returned? First, the RNA type and anticodon are certainly
needed (e.g., tRNA-Arg
and tct
). Second, the location is needed (the
sequence ID and start and stop indices). Third, the inferred structure may
optionally be returned as a connectivity matrix. There may be additional values
that would be useful to return, such as some measure of statistical
significance, the melting points of various structures, the G/C content, and
such.
The input is a configuration record and a sequence. So the basic type signatures will be something like this:
data AragornConfig = AragornConfig {
...
}
findTRNA :: AragornConfig -> DNA -> [(RNAType, Start, Stop, Anticodon, AminoAcid, Matrix)]
void bopt_fastafile(data_set *d, csw *sw)
{ int i,nt,flag,len;
int *s,*sf,*se,*sc,*swrap;
int seq[2*LSEQ+WRAP+1],cseq[2*LSEQ+WRAP+1],wseq[2*WRAP+1];
long gap,start,rewind,drewind,psmax,tmaxlen,vstart,vstop;
double sens;
FILE *f = sw->f;
= MAXTAGDIST + 20;
rewind if (sw->trna | sw->mtrna)
{ tmaxlen = MAXTRNALEN + sw->maxintronlen;
if (rewind < tmaxlen) rewind = tmaxlen; }
if (sw->tmrna)
if (rewind < MAXTMRNALEN) rewind = MAXTMRNALEN;
if (sw->peptide)
if (sw->tagthresh >= 5)
if (rewind < TSWEEP) rewind = TSWEEP;
->loffset = rewind;
sw->roffset = rewind;
sw= 2*rewind;
drewind ->ns = 0;
d->nf = 0;
d->nextseq = 0L;
d->nextseqoff = 0L; d
12241 12240 test-trna_sequence_cmp_arc_1.txt 958811 958748 test-trna_sequence_cmp_bac_1.txt 2188 2188 test-trna_sequence_fungi_1.txt 13919 13897 test-trna_sequence_phage_1.txt 1352 1350 test-trna_sequence_plant_1.txt 3852 3850 test-trna_sequence_plasmid_1.txt 974 974 test-trna_sequence_virus_1.txt
This not-so-simple picture is further complicated by the fact that tRNAs may have introns. Strangely, these introns are not spliced out by the same mechanisms that splice mRNAs. Introns may be present in tRNA bacteria, not just Eukaryotes.
tRNA introns (Schmidt 2019)
CCA acceptor stem - this may be encoded in the tRNA itself or it may be added in a post-processing step.
`Aragorn is reasonably tolerant of malformed input. All bytes are translated into either DNA bases, unknowns, or ambiguous bases.