Part 2: Refactoring the code

by Zebulun Arendsee

Contents:

ARAGORN code

The Aragorn program is written as a single 12000-line C file with dependencies only on stdio and stdlib. The code is undocumented, but the style is reasonably clear with little gratuitous obscurity or excess frameworking. The code makes heavy use of pointer arithmetic, single-letter variable names, and GOTO statements. They never use logical AND, rather opting for nested if-statements. The programmer learned enough C to express their intent and gave the bird to modern norms – I can respect that.

Most of the code is wavy lines of conditions speckled with many mutating pointers into a sequence. Here is one short function as an example:

void overlap(data_set *d, int sort[], int n, int it, csw *sw)
{ int i,j,flag,cross,crosstoo;
  long a,b,e,f,a2,b2,e2,f2,psmax;
  char sname[100],s[100];
  flag = 0;
  cross = 0;
  psmax = d->psmax;
  a = ts[it].start;
  b = ts[it].stop;
  if (b < a)
   { a2 = a - psmax;
     b2 = b;
     b += psmax;
     cross = 1; } 
  j = -1;
  while (++j < n)
   { i = sort[j];
     if (i == it) continue;
     e = ts[i].start;
     f = ts[i].stop;
     crosstoo = 0;
     if (f < e)
      { e2 = e - psmax;
        f2 = f;
        f += psmax;
        crosstoo = 1; }
     if (a <= f)
      if (b >= e)
       goto OV;
     if (crosstoo)
      if (a <= f2)
       if (b >= e2)
        goto OV;
     if (cross)
      { if (a2 <= f)
         if (b2 >= e)
          goto OV;
        if (crosstoo)
         if (a2 <= f2)
          if (b2 >= e2)
           goto OV; }
     continue;
     OV:
     if (!flag) fputc('\n',sw->f);
     name(ts+i,sname,1,sw);
     location(s,ts+i,sw,sname);
     fprintf(sw->f,"Overlap with %d: %s\n", j+1,s);
     flag = 1; }
 if (flag) fputc('\n',sw->f); }

There are few relevant particulars. First, notice the print statements at the end of the function. IO and algorithmic logic are interwoven throughout the program. Second, the return type is void. Nothing is returned. Instead the function is run for its side effects. One of these effects is the print statements. Three mutable values are passed into the function (d, sort,and sw). Skimming the code I don’t see any place where these arguments actually are mutated. But there is also a global value, ts, which is a vector of observed tRNAs, that is mutated in the location function.

No allocated memory is freed in the program. This may be fine for an executable that always runs to completion on a machine with excess memory. However, failure to free allocated memory can cause memory leaks for long running programs. And since I am making a library, I do not know how the exported functions will be used. So I need to be sure to free any allocated memory.

The program also originally had no dynamic data structures. For example, to store inferred tRNAs, a single array of 3000 elements is allocated. As new tRNAs are predicted, they are set to elements in the array. If the array index is equal to 3000, then an error is raised.

The choice to use pure C has its perks, including portability between systems and very fast compile times. But I’ve moved to C++ by replacing static data structures with vectors and strings from the standard template library.

Without documentation, I am left to guess what much of the code does. For example, the matrix below:

int gc[6][6] = { { 0,0,0,0,0,0 },
                 { 0,0,1,0,1,0 },
                 { 0,1,0,0,1,0 },
                 { 0,0,0,0,0,0 },
                 { 0,1,1,0,1,0 },
                 { 0,0,0,0,0,0 } };

I know that input sequence is mapped to 6 values: the four nucleotides (A, C, G, T), an unknown base (N), and an invalid character. A 6-by-6 matrix would map pairs of bases to one another. This matrix is symmetric, is named gc, and maps pairs of bases to binary values. G maps to C and N maps to G or C. Altogether, this matrix appears to be designed to detect GC pairs. Easy enough.

Similarly, the matrix below recognizes canonical and wobble (G-T) base pairs:

int bp[6][6] = { { 0,0,0,1,1,0 },
                 { 0,0,1,0,1,0 },
                 { 0,1,0,1,1,0 },
                 { 1,0,1,0,1,0 },
                 { 1,1,1,1,1,0 },
                 { 0,0,0,0,0,0 } };

The matrices are used in code like the following:

s1 = sc;
s2 = sc + 16;
sd = sc + 6;
j = bp[*s1][*--s2];
while (++s1 < sd) j += bp[*s1][*--s2];
if (j >= 6) { /* handle matched D-stem */ }

This code recognizes a hairpin. sc is pointer to a current search location in the sequence. j is a count of the number of base-pair matches. s1 and s2 are pointers to the potential start and stop of the hairpin loop. The ++s1 and --s2 operations are pre-increment and pre-decrement operations. These mutate the pointers, shifting to the next pair of bases in the stem, and return the new values to the expression. The while loop counts the number of matches in the stem. If six stem bases match, then the if condition is met and block for a matched D-stem is executed.

  stem_1 loop  stem_2  
   ------    ------
xxx.................xxxxx
   ^     ^         ^ 
   sc    sd        sc + 16
   ^               ^
   s1 -->      <-- s2

Refactor

I’ve talked in depth so far about what I want to remove from ARAGORN.

I have always argued that morloc frees the programmer to write simpler code. Within the morloc ecosystem, a scientific programmer need only write the core algorithms of scientific interest. They can ignore parsing and formatting, user interfaces, parallelism (usually), metadata propagation, and similar complexities. In this section, I will show how this works in practice.

A new data representation

What data should be returned? First, the RNA type and anticodon are certainly needed (e.g., tRNA-Arg and tct). Second, the location is needed (the sequence ID and start and stop indices). Third, the inferred structure may optionally be returned as a connectivity matrix. There may be additional values that would be useful to return, such as some measure of statistical significance, the melting points of various structures, the G/C content, and such.

The input is a configuration record and a sequence. So the basic type signatures will be something like this:

data AragornConfig = AragornConfig {
  ...
}

findTRNA :: AragornConfig -> DNA -> [(RNAType, Start, Stop, Anticodon, AminoAcid, Matrix)] 
void bopt_fastafile(data_set *d, csw *sw)
{ int i,nt,flag,len;
  int *s,*sf,*se,*sc,*swrap;
  int seq[2*LSEQ+WRAP+1],cseq[2*LSEQ+WRAP+1],wseq[2*WRAP+1];
  long gap,start,rewind,drewind,psmax,tmaxlen,vstart,vstop;
  double sens;
  FILE *f = sw->f;
  rewind = MAXTAGDIST + 20;
  if (sw->trna | sw->mtrna)
   { tmaxlen = MAXTRNALEN + sw->maxintronlen;
     if (rewind < tmaxlen) rewind = tmaxlen; }
  if (sw->tmrna)
   if (rewind < MAXTMRNALEN) rewind = MAXTMRNALEN;
  if (sw->peptide)
   if (sw->tagthresh >= 5)
    if (rewind < TSWEEP) rewind = TSWEEP;
  sw->loffset = rewind;
  sw->roffset = rewind;
  drewind = 2*rewind;
  d->ns = 0;
  d->nf = 0;
  d->nextseq = 0L;
  d->nextseqoff = 0L;

12241 12240 test-trna_sequence_cmp_arc_1.txt 958811 958748 test-trna_sequence_cmp_bac_1.txt 2188 2188 test-trna_sequence_fungi_1.txt 13919 13897 test-trna_sequence_phage_1.txt 1352 1350 test-trna_sequence_plant_1.txt 3852 3850 test-trna_sequence_plasmid_1.txt 974 974 test-trna_sequence_virus_1.txt

This not-so-simple picture is further complicated by the fact that tRNAs may have introns. Strangely, these introns are not spliced out by the same mechanisms that splice mRNAs. Introns may be present in tRNA bacteria, not just Eukaryotes.

tRNA introns (Schmidt 2019)

CCA acceptor stem - this may be encoded in the tRNA itself or it may be added in a post-processing step.

`Aragorn is reasonably tolerant of malformed input. All bytes are translated into either DNA bases, unknowns, or ambiguous bases.

built on 2024-08-12 11:47:46.22895832 UTC from file 2024-08-17-aragorn-2