ZWA

import Data.Maybe (fromMaybe)

-- collect the labels for all leaves
leafLabels :: Tree -> [Text]
leafLabels (Leaf label) = [label]
leafLabels (Node children _) = concatMap (leafLabels . fst) children

-- collect all leaf/parent pairs
leafParents :: Tree -> [(Text, Text)]
leafParents (Leaf _) = []
leafParents (Node children maybeName) = concatMap (f . fst) children where
    f :: Tree -> [(Text, Text)]
    f (Leaf leafName) = [(leafName, fromMaybe "-" maybeName)]
    f tree = leafParents tree

I am introducing two new functions here concatMap from Prelude and fromMaybe from the core module Data.Maybe. There signatures are as follows:

concatMap :: Foldable t => (a -> [b]) -> t a -> [b]
fromMaybe :: a -> Maybe a -> a

You should be able to guess what each function does from its signature. You might need to review the Foldable typeclass from the typeclass post.

Here is the skeleton of our parser program:

import Data.Attoparsec.Text
import Control.Applicative ((<|>))

newickParser :: Parser Tree
newickParser = do
    (tree, _) <- treeParser
    _ <- char ';'
    return tree

treeParser :: Parser (Tree, Maybe Double)
treeParser = leafParser <|> nodeParser

We first import the Attoparsec library for parsing Text data types. Previously, we’ve worked with String types and I mentioned that they are inefficient since the are implemented as linked lists of characters. The Text type is much more efficient and is used commonly in production code when performance is important. Text supports unicode characters, so our parser will handle Chinese characters, accents, emoticons, and whatever else might show up in our trees.

After the imports, I’ve defined four parser functions. The Parser type is the core monad used in Attoparsec. It wraps the objects we are parsing out of the text. Every parser we write will have Parser a as the return type.

The top-level parser newickParser handles a complete newick file. It first calls the treeParser parser to get a tree, then it matches against the required terminal semicolon of a newick file, and finally it returns the tree. Since Parser is a monad, we can use the do notation to implement parsers in a step-by-step iterative style.

treeParser is composed of two parsers: leafParser and nodeParser. <|> is the “alternative” operator. It indicates that if leafParser fails to match the text, then nodeParser is called instead.

This should give you a feel for the structure of a recursive descent parser. They are defined top down. Each parser is defined in terms of sub-parsers. All the way down to primitive matchers.

The Parser monad stores an index into the text that is being parsed. As parsers match patterns to text, the index increments, “consuming” input. When a parser fails, any update to index is not automatically undone. That is, a “undo” trace is not maintained. For this reason, you need to mindful of patterns that may partially match. In these cases, you may use the try function to store the original index.

Moving on leafParser and nodeParser may be defined as follows:

leafParser :: Parser (Tree, Maybe Double)
leafParser = do
  -- match a leaf label (e.g., `unicorn` or `'a weird bug'`)
  leaf <- nameParser

  -- match a branch length (e.g., `:4.5`)
  mayLength <- optional lengthParser

  -- return the leaf and the optional branch length
  return (Leaf leaf, mayLength)

nodeParser :: Parser (Tree, Maybe Double)
nodeParser = do
  -- match the opening parenthesis of a subtree
  _ <- char '('

  -- match a comma separated list of child trees
  children <- sepBy1 treeParser (char ',')

  -- match the closing parenthesis of a subtree
  _ <- char ')'

  -- match the node label and length
  mayName <- optional nameParser
  mayLength <- optional lengthParser

  -- return the node and optional length
  return (Node children mayName, mayLength)

nameParser and lengthParser will parse names (optionally quoted) and branch lengths. I’ll define them shortly. The optional function transforms a parser to return a Maybe value (where Nothing indicates a failure).

The lengthParser is simple:

lengthParser :: Parser Double
lengthParser = char ':' >> double

The nameParser is by far the most complex piece of the parser since it has to handle quotation:

nameParser :: Parser Text
nameParser = quotedName '"' <|> quotedName '\'' <|> unquotedName
  where
  unquotedName :: Parser Text
  unquotedName =
    -- take a string of legal characters
    takeWhile1 (notInClass ",():;'\"")

  quotedName :: Char -> Parser Text
  quotedName quoteChar = do

    -- match opening quote
    _ <- char quoteChar

    -- match the name
    -- the name is comprised of a concatenation of strings of unquoted
    -- characters and single escaped characters
    name <- foldl1 (<>) <$> many1 (unescapedChar quoteChar <|> escapedChar)

    -- match closing quote
    _ <- char quoteChar
    return name

  unescapedChar :: Char -> Parser Text
  unescapedChar quoteChar =
    -- match text with characters other than foward slashes or quote characters
    takeWhile1 (\x -> x /= '\\' && x /= quoteChar)

  escapedChar :: Parser Text
  escapedChar = do
    -- match any character that is preceded by a forward slash
    c <- char '\\' >> anyChar

    -- convert the Char to Text and return
    return $ singleton c

The >> operator throws away the value stored in a monad while preserving the monadic state. The operator has the type:

(>>) :: Monad m => m a -> m b -> m b

In contrast, the monadic bind operator has the type:

(>>=) :: Monad m => m a -> (a -> m b) -> m b

See the repo on github for the full code. I have simplified the programming a bit here. In the GitHub repo, I also account for whitespace.

Writing a parser that builds a well-defined data structure, as we have done here, requires more work than just spinning out a fast, specialized regular expression and munging the results with sed and awk. But the extra effort makes the code more robust, extensible, readable, and modular.

built on 2024-08-17 18:13:29.106027925 UTC from file 2024-04-15-parsing

Parser combinators: Fast streaming of huge and complex files

Contents: