Contents:
- Why I write and who you are
- Haskell - The language of terms
- Haskell - The language of types
- Haskell - Typeclasses
- Interlude: how typing simplifies development
- Parser combinators: Fast streaming of huge and complex files
under construction
Bioinformaticians tend to be wizards at regular expressions (regex). We can spin out expressions like
sed 's/^>\([^ ]*\).*\[\([^]]*\)\]$/>\1|\2/'
on the fly before our morning coffee.
But this magic has its limits. Let’s explore the parsing of tree files in newick format. Here is an example tree:
A diagram of the tree specified by the newick code on the left
((A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.6,G:0.7)H:0.8;
The leaf names are always preceded by a left parenthesis or a comma. I can latch onto the pattern and
$ read -r tree << EOF
((A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.6,G:0.7)H:0.8;
EOF
$ echo $tree | grep -Eo "[(,][A-Z]" | sed 's/^.//'
A
B
C
D
G
The read
clause stores the tree file in the variable tree
. Which I then echo
into grep
. The -E
flag turns on POSIX regular expression support and the
-o
flag tells grep to print each match on a new row.
But what if the leaf names aren’t just a single capitol letter? They should be
anything but a “,”, “:”, “;”, “(”, or “)”. To support this, I can adapt the
grep
regex as follows:
"[(,][^,:;()]+"
That works, but leaf names may be quoted. When quoted, they may contain any character other than an un-escaped end quote.
'[(,]([^,:;()]+|"(\\"|[^,:;()"])+")'
Lovely. This works for double quotes. But what if there are singly quotes also?
'[(,]([^,:;()]+|"(\\"|[^,:;()"])+"|'"'(\\'|[^,:;()'])+')"
Note the little switch in bash quotation. This is becoming a little hard to read, even for my old bioinformatician eyes.
What if we wanted to grep out the parent node of a particular leaf?
$ read -r tree << EOF
((A:0.1,B:0.2,(C:0.3,D:0.4)E:0.5)F:0.6,G:0.7)H:0.8;
EOF
$ echo $tree | ???
A F
B F
C E
D E
G H
This is not easy. While it may be possible with sufficient Perl regex magic, parsing recursive data structures such as this is generally not possible with regular expressions.
What alternative is there? Lucky for us, parsing is a highly developed field. I will introduce you to a type of parser called a recursive descent parser.
Haskell has several powerful parser libraries. I will use Attoparsec. This library is blazing fast. It sacrifices flexibility, but for the relatively simple needs of parsing bioinformatics formats, it should be fine.
Our parser will transform a newick string into a Tree
data structure. We can
then write functions of the tree to do whatever we wish. For the tree data
structure, we define a new Tree
algebraic data type:
You can find my full code for this demo on github.
data Tree
= Leaf Text -- a leaf and its name
| Node -- an internal node
Tree, Maybe Double)] -- node children and optional edge lengths
[(Maybe Text) -- optional node name (
This data structure can easily be explored through structural recursion. For example:
import Data.Maybe (fromMaybe)
-- collect the labels for all leaves
leafLabels :: Tree -> [Text]
Leaf label) = [label]
leafLabels (Node children _) = concatMap (leafLabels . fst) children
leafLabels (
-- collect all leaf/parent pairs
leafParents :: Tree -> [(Text, Text)]
Leaf _) = []
leafParents (Node children maybeName) = concatMap (f . fst) children where
leafParents ( f :: Tree -> [(Text, Text)]
Leaf leafName) = [(leafName, fromMaybe "-" maybeName)]
f (= leafParents tree f tree
I am introducing two new functions here concatMap
from Prelude and fromMaybe
from the core module Data.Maybe
. There signatures are as follows:
concatMap :: Foldable t => (a -> [b]) -> t a -> [b]
fromMaybe :: a -> Maybe a -> a
You should be able to guess what each function does from its signature. You
might need to review the Foldable
typeclass from the typeclass
post.
Here is the skeleton of our parser program:
import Data.Attoparsec.Text
import Control.Applicative ((<|>))
newickParser :: Parser Tree
= do
newickParser <- treeParser
(tree, _) <- char ';'
_ return tree
treeParser :: Parser (Tree, Maybe Double)
= leafParser <|> nodeParser treeParser
We first import the Attoparsec library for parsing Text
data
types. Previously, we’ve worked with String
types and I mentioned that they
are inefficient since the are implemented as linked lists of characters. The
Text
type is much more efficient and is used commonly in production
code when performance is important. Text
supports unicode characters, so our
parser will handle Chinese characters, accents, emoticons, and whatever else
might show up in our trees.
After the imports, I’ve defined four parser functions. The Parser
type is the
core monad used in Attoparsec. It wraps the objects we are parsing out of the
text. Every parser we write will have Parser a
as the return type.
The top-level parser newickParser
handles a complete newick file. It first
calls the treeParser
parser to get a tree, then it matches against the
required terminal semicolon of a newick file, and finally it returns the
tree. Since Parser
is a monad, we can use the do
notation to implement
parsers in a step-by-step iterative style.
treeParser
is composed of two parsers: leafParser
and nodeParser
. <|>
is
the “alternative” operator. It indicates that if leafParser
fails to match the
text, then nodeParser
is called instead.
This should give you a feel for the structure of a recursive descent parser. They are defined top down. Each parser is defined in terms of sub-parsers. All the way down to primitive matchers.
The Parser
monad stores an index into the text that is being parsed. As
parsers match patterns to text, the index increments, “consuming” input. When a
parser fails, any update to index is not automatically undone. That is, a “undo”
trace is not maintained. For this reason, you need to mindful of patterns that
may partially match. In these cases, you may use the try
function to store the
original index.
Moving on leafParser
and nodeParser
may be defined as follows:
leafParser :: Parser (Tree, Maybe Double)
= do
leafParser -- match a leaf label (e.g., `unicorn` or `'a weird bug'`)
<- nameParser
leaf
-- match a branch length (e.g., `:4.5`)
<- optional lengthParser
mayLength
-- return the leaf and the optional branch length
return (Leaf leaf, mayLength)
nodeParser :: Parser (Tree, Maybe Double)
= do
nodeParser -- match the opening parenthesis of a subtree
<- char '('
_
-- match a comma separated list of child trees
<- sepBy1 treeParser (char ',')
children
-- match the closing parenthesis of a subtree
<- char ')'
_
-- match the node label and length
<- optional nameParser
mayName <- optional lengthParser
mayLength
-- return the node and optional length
return (Node children mayName, mayLength)
nameParser
and lengthParser
will parse names (optionally quoted) and branch
lengths. I’ll define them shortly. The optional
function transforms a parser
to return a Maybe
value (where Nothing
indicates a failure).
The lengthParser
is simple:
lengthParser :: Parser Double
= char ':' >> double lengthParser
The nameParser
is by far the most complex piece of the parser since it has to
handle quotation:
nameParser :: Parser Text
= quotedName '"' <|> quotedName '\'' <|> unquotedName
nameParser where
unquotedName :: Parser Text
=
unquotedName -- take a string of legal characters
",():;'\"")
takeWhile1 (notInClass
quotedName :: Char -> Parser Text
= do
quotedName quoteChar
-- match opening quote
<- char quoteChar
_
-- match the name
-- the name is comprised of a concatenation of strings of unquoted
-- characters and single escaped characters
<- foldl1 (<>) <$> many1 (unescapedChar quoteChar <|> escapedChar)
name
-- match closing quote
<- char quoteChar
_ return name
unescapedChar :: Char -> Parser Text
=
unescapedChar quoteChar -- match text with characters other than foward slashes or quote characters
-> x /= '\\' && x /= quoteChar)
takeWhile1 (\x
escapedChar :: Parser Text
= do
escapedChar -- match any character that is preceded by a forward slash
<- char '\\' >> anyChar
c
-- convert the Char to Text and return
return $ singleton c
The >>
operator throws away the value stored in a monad while preserving the
monadic state. The operator has the type:
(>>) :: Monad m => m a -> m b -> m b
In contrast, the monadic bind operator has the type:
(>>=) :: Monad m => m a -> (a -> m b) -> m b
See the repo on github for the full code. I have simplified the programming a bit here. In the GitHub repo, I also account for whitespace.
Writing a parser that builds a well-defined data structure, as we have done here, requires more work than just spinning out a fast, specialized regular expression and munging the results with sed and awk. But the extra effort makes the code more robust, extensible, readable, and modular.