Contents:
- Why I write and who you are
- Haskell - The language of terms
- Haskell - The language of types
- Haskell - Typeclasses
- Interlude: how typing simplifies development
- Parser combinators: Fast streaming of huge and complex files
What is a boolean? In most languages, this is a complex question. In C, there is
no built in boolean type. Relational operators (==
, <
, etc) return int
values of 0 or 1. The conditional clause in an if
statement evaluates to true
if the condition has a binary value of 0. This “0” could be a NULL pointer, a 0
integer, an empty string (just a NULL), or a double that is 0. For floats, there
is the odd case of -0, which does not have a binary value of 0 but still
evaluates to false.
In Python, there is a built-in boolean type with just two possible values,
False
and True
. This simplicity is deceptive though. As with C, the bool
type is also an integer. The expressions 1 + True
is 2 and "ab"[True]
is
“b”. All objects may be used in a conditional clause and Python will coerce them
to a boolean. “True” objects include non-empty sets, non-empty lists, non-zero
numbers, and empty strings. Classes in Python3 can implement a __bool__
method
that allows them to be converted to boolean values. Some types may have multiple
zeroes. For instance, if you have a value that may be either None or an integer,
and you use it in a boolean, you cannot be sure why the else
branch is
entered.
In short, the boolean is heavily blessed in most languages. Special boolean behavior is cooked into the compiler. Learning all the quirks of the boolean is just one of many chores on the path to language mastery.
In Haskell, though, there is nothing to learn. The Bool
type is defined as
follows:
data Bool = True | False
There is no boolean primitive in Haskell. A boolean is simply a set of two
values, True
and False
. It is not an integer. It has no properties. It is
exactly equivalent to the logical notion of a boolean that every child
understands.
Haskell programmers often speak of how Haskell removes whole categories of errors. This is one such case. It is also the first of many examples I will show of why Haskell is simpler than languages such as Python.
Now let’s switch from booleans and talk about things that may fail. If you look up a key in a map, and the key does not exist, what is returned? Of course, there are many things one might do. Maybe you could raise an exception, as is done for dictionaries in Python. This halts further evaluation and unwinds the stack until caught. Or you could return a NULL value and hope the calling function handles it correctly. However, if a NULL is returned, how can you know whether a NULL is returned because the data structure stored a NULL or because the key did not exist?
In Haskell, the broad notion of values that may or may not exist, and
computations that may or may not succeed, is captured by the type Maybe
,
defined as follows:
data Maybe a = Nothing | Just a
Maybe
is a parameterized type with one type variable a
. Lowercase type terms
in type declarations always represent generics. The type of a
depends on
use. Nothing
and Just
are type constructors. Just
takes an argument. As
with Bool
, everything there is to know about Maybe
is contained in the type
signature. There is no special compiler handling, no coupled dictionary of
methods, no idiosyncratic implementations in every class. It is an minimal
concept. This featurelessness of Maybe
allow universal use.
But Maybe is not very expressive. What if we want to know more about why no
value is returned? What if we are writing a text parser and want to know where
and why the parse failed. In this case, we want to return one of two possible
values, either the result of a successful parse, or a description of a failing
parse. This concept is captured by the data type Either
:
data Either a b = Left a | Right b
There is no try-catch in Haskell. There is nothing special about code that may “fail”. What “failure” even is is subjective in Haskell. In most languages, the two paths a computation can take, the “exception” path were something goes wrong, and the successful path where nothing goes wrong, use different semantics. In Haskell, both paths are normal computation. The pass and fail branches are equal parts of the model in Haskell. Both are, in general, handled with equal care. A Haskell programmer can usually say with confidence that all corner cases are modeled within their program. If there are any cases that are not handled (i.e., partial functions), the programmer will know (if they pay any attention to compiler warnings). This leads us to another category of bugs that Haskell practically eliminates: edge cases.
In practice, you always know if a function you are using can fail because its failure is modeled in the return type as Maybe or Either.
So with three lines we define everything needed for handling conditionals, null values and exceptions. All this without any special handling by the compiler:
data Bool = True | False
data Maybe a = Nothing | Just a
data Either a b = Left a | Right b
We can define data structures in a similar fashion. Perhaps the simplest such data structure is the list:
If you have some background in computer science, you may recognize this as a linked list. It is not a memory efficient data structure since and also does not permit constant time random access. For these cases, you may use one of Haskell’s vector libraries. We will cover this later.
data List a = Cons a (List a) | Nil
The string “abc” may be stored as:
= Cons 'a' (Cons 'b' (Cons 'c' Nil)) abc
This is verbose, so Haskell adds syntatic sugar for lists allowing
representation with square brackets: ['a','b','c']
. The default String
type
in Haskell stores strings exactly like this.
More complex types may also be easily defined, such as binary trees:
data BinaryTree a = Leaf a | Node a (BinaryTree a) (BinaryTree a)
Haskell has special syntax for tuples, which are sequences of types of set
length and variable type. For example, a tuple of type (Int, String, Bool)
may
contain values such as (1, "yolo", True)
.
Long tuples, though, are confusing to work with. Usually “Records” are used instead. Records are just tuples with field names for easier access. Here is how you could define a tuple and a record that store the same information:
-- a tuple containing 3d coordinates
type Point = (Double, Double, Double)
-- a record that assigns names to each dimension
data PointRecord = PointRecord
pointX :: Double
{ pointY :: Double
, pointZ :: Double
, }
The type
keyword defines a type synonym. This is a name that exists purely for
convenience and that disappears early in the compilation process.
The fields of PointRecord
are specified in curly brackets. The compiler will
generate accessor functions named pointX
, pointY
and pointZ
for accessing
each field in the record. Since these go into the global namespace, they must be
unique. Which is why we likely cannot list the labels as just x
, y
and z
.
Annoying? Yes, yes it is. Why can’t we just define Point
as:
data Point = Point { x :: Double, y :: Double, z :: Double}
And access with foo.x
syntax? Well, actually we can with a new Haskell
extension. In truth, records are one of the warts in Haskell. Ask any Haskeller
what they dislike about Haskell, and awkward records is likely to be one of the
items they list.
One last thing, why do we repeat Point
? Why not just write:
data Point = {pointX :: Double, pointY :: Double, pointZ :: Double}
The reason is that the first Point
is the name of the type. It is the name
that goes in type signatures. The second Point
is the name of the data
constructor. It is the name used to build or match against the record. While
records usually have only one data constructor, they may have more, for example:
data Point
= Point1D {x1D :: Double}
| Point2D {x2D :: Double, y2D :: Double}
| Point3D {x3D :: Double, y3D :: Double, z3D :: Double}
We may define points in different dimensions:
= Point2D {x1D = 1} -- make a 1D point
x1 = Point2D {x2D = 1, y2D = 2} -- make a 2D point x2
And write functions that access different dimensions:
Point1D x) = x
getX (Point2D x _) = x
getX (Point3D x _ _) = x getX (
By conventions, when there is a single data constructor for a type, the type constructor and data constructor will have the same name.
Before I go on, I would like to introduce the concepts of “sum types” and “product types”. These terms are often used in practice and understanding the intuition behind them may be useful. And if it is not useful, it is at least beautiful.
Types describe sets and sets have sizes. The size of the boolean type is 2 since
it contains two values: True
and False
. The size of the set of days is
7. What about Maybe
and Either
? The type Maybe
can be Nothing
or Just a
. Nothing
is a atom with just one value. Just a
has a size that is equal
to the size of a
. So the size of Maybe a
is 1 + ||a||
, where the double
bars represent size (or cardinality). What about Either a b
? It can be Left a
or Right b
, so we can just add the sizes of a
and b
:
||Either a b|| = ||a|| + ||b||
Since the sizes of these types are a summation over the sizes of their different options, we call them “sum types”.
The elements in a tuple or record, though, are free to be anything. The tuple
(a,b)
will have will have ||a|| * ||b||
different possible values. So we
call this a “product type”.
This may all seem mathy for the sake of being mathy. But it leads to an interesting idea. If two types have the same size, then each element in one type can be uniquely mapped to an element in the other. This means they are “isomorphic”. One can be translated into the other without loss of information.
Haskell adds more syntactic sugar than is really necessary. Implementing sum
and product
types really only requires two constructors (operators). From a
mathematical perspective, it might be more elegant to use the type syntax a + b
rather than a | b
and a * b
rather than (a, b)
. Phrased like this, what
are the algebraic laws governing this system? Thinking along these lines will
turn you into a computer scientist. You might ask wonder, “what is subtraction,
exponentiation, derivation and integration at the type level”? This is why
people hate Haskellers.
OK, so that about covers everything you need to know about data declarations. The next layer of abstraction is to combine the data types with functions.
Type signatures
To review, we declare data with the data
keyword. For example:
data Weather = Sunny | Rainy | Cloudy
Functions of data may be defined as follows:
goOutsideToPlay :: Weather -> Bool
This type signature describes functions that map values from the Weather
type to values in the Bool
type.
A function signature generally maps to many possible functions. Just for fun,
what is the size of the goOutsideToPlay
type? That is, how many possible
functions are there between Weather
and Bool
. We can make a table of all
possible functions. Each function is a particular list of input/output pairs
(with True/False reduced to 0/1 for convenience):
Sunny 0 0 0 0 1 1 1 1
Rainy 0 1 0 1 0 1 0 1
Cloudy 0 0 1 1 0 0 1 1
These functions range from never going outside to always going outside. We see
that there are 8 different functions. Or 2^3. Or to phrase it differently,
||Bool||^||Weather||. In the previous section, we discussed “sum types” and
“product types”. Functions can be described as “exponential types”. The size of
the function a -> b
is ||b||||a||. Cool, huh? Useful? Well, probably not
to you. But these mathematical rules can actually be used within the compiler to
optimize code. The extreme simplicity of the Haskell language allows the
compiler (and us humans) to better reason about the code.
Some functions signatures are specific enough to have only one reasonable implementation. In these cases, implementations may be generated automatically. This principle will be applied in the following section to generate many typeclasses instances for a given data type.
What if we want to make functions of multiple arguments? Well. You can’t. But you can make a function that returns a function. Playing outside may dependent on more than just the weather. Maybe it also depends on the child’s age. Now for a given child, you might want a function that tells you when they specifically will go outside. This can be written as follows:
goOutsideToPlayAge :: Int -> (Weather -> Bool)
Then you can write goOutsideToPlayAge 14
to get a function for children of age
14. Or you can serially apply arguments: (goOutsideToPlayAge 14) Sunny
. That
is a lot of unnecessary parentheses. They can be skipped since the ->
operator
is right associative and function application is left associative. So we can
write the function signature as:
goOutsideToPlayAge :: Int -> Weather -> Bool
This signature is similar to the C prototype:
bool goOutsideToPlayAge(int, Weather);
But it is more flexible, since it allows convenient partial application, which is very useful when coupled with higher order functions. I will introduce these soon.
Being comfortable with type signatures is an essential skill for the Haskell programmer. This skill is best learned by seeing many examples. So I will show a few here.
The simplest function is the identity function:
id :: a -> a
As seen before, generic types are written in lowercase. It is conventional to
use a
, b
, etc as generic type variables. Single letter variables are frowned
upon in many communities, but they are common in Haskell. This is especially
true for generics. The a
in the identity function may be anything. It has no
distinguishing features. If we really wanted to call it something, we would have
to use something vague, like anything
or whatever
or thing
. By
conventions, we stick with a
unless we have a good reason to do otherwise.
Here are a few more functions, you can probably guess what they do:
fst :: (a, b) -> a
snd :: (a, b) -> b
onFst :: (a -> c) -> (a, b) -> c
In fact, all of these functions have a single non-pathological implementation.
The only non-pathological way to get a generic a
from (a, b)
is to take the
first value from the tuple.
Of course, most functions are not so simple. Here are few examples of functions that return the first element in a list:
head :: [a] -> a
headMay :: [a] -> Maybe a
headEither :: [a] -> Either String a
Since a list may be of 0 length, the first function head
is a partial
function. That is, one possible input maps (domain) to no output (codomain). So
head
could crash the program at runtime. The compiler knows this and will
raise warnings. Now notice what I did here, even without reading the
implementation of head
, I already know that it is partial. I know exactly how
to use (and when to avoid) head
.
headMay
is a total function. It handles the empty case by returning Nothing
and Just a
otherwise.
headEither
returns some sort of string describing the error. This is probably
a boring string since it can’t contain any information that Nothing
would not
capture equally well.
Now, notice all through here I assumed that head
was returning the first
element. What if the implementation actually returns the second element? This
would mean that the function is mis-named, which the type-system obviously
won’t catch. The head
function actually has a name that is narrow than the
signature. You could write the following functions with the same signature:
lastElement :: [a] -> a
middleElement :: [a] -> a
secondElement :: [a] -> a
A type system more expressive than vanilla Haskell might allow further refinement of the signature, for example:
head :: xs:[a]{ len xs > 0 } -> x:a{ x == xs[0] }
With these type annotations, the signature will uniquely map to one function for
each type a
. In languages like Idris, this kind of signature is possible.
Thinking along these lines is another reason why people hate Haskellers. For
most practical purposes, the type-level ambiguity of what head
means is
acceptable. The signatures narrow the space of possible implementations and our
mental model, using hints from the function name and documentation, does the
rest.
Remember when I said simple functions may have only one possible non-pathological implementation? There is exactly one pathological implementation for each of these functions I showed. They could each be undefined. That is, they could map all possible input to “bottom”. The “bottom” type is an empty/undefined member that exists in all types. Since we know nothing about the input, we cannot do any specific branching. So either we do the one correct operation or we are undefined on everything.
The undefined value is quite useful. You can substitute the implementation of
the body of any function for undefined
. It will typecheck since undefined
is
a member of all types. I will often write all the type signatures for a program
before writing the bodies of the functions. For example:
align :: [String] -> Matrix Char
= undefined
align
msaToDist :: Matrix Char -> Matrix Double
= undefined
msaToDist
distToTree :: Matrix Double -> Tree
= undefined
distToTree
makeTree :: [String] -> Tree
= distToTree . msaToDist . align makeTree
Here I outline, at the type level, my approach to building a phylogenetic
tree. The makeTree
function composes the three steps of my tree building
algorithm. First align the sequences to get a character matrix, then build a
distance matrix from the character matrix, and finally make a tree from the
distance matrix. I can typecheck this and if it passes, I can be sure that my
wiring is correct. Then I can implement each piece at my leisure.
Type classes (a foreshadowing)
Lets talk about a few more complex functions:
sum :: [a] -> a
generalSum :: f a -> a
sort :: [a] -> [a]
All of these signatures are broken.
In sum
we want to add together all values in a list. We want this function to
work for anything numeric. But addition does not apply to all things. So a
is,
again, too generic. There is not, for instance, an obvious way to add two
Person
objects.
In generalSum
, we want to generalize sum
to work on any collection that can
be iterated over. This might include a set of numbers, a binary tree of numbers,
or values in a hashmap of numbers. But f
is too generic. Not all types are
collections and certainly not all types take a single parameter.
In sort
, we want to take a list and return a sorted list. But any sort
function will require comparison between elements, and how do we compare things
we know nothing about? We need to specify that for any type in a
a comparator
function exists.
The problem is that generics are too generic. We need to be able to limit generics to types that have specific required traits.
One solution to this problem would be to always pass in higher order functions that provide the required functionality.
Here is one such signature for sum
, written as sum'
.:
sum' :: a -> (a -> a -> a) -> [a] -> a
A reasonable sum of an empty list is 0. So whatever type a
is, it needs a zero
element. We also need a function that takes two members from set a
and returns
(adds) a new member also in set a. This function has type (a -> a -> a)
. Taking the summation is then a simple process of iteratively applying the
addition operator to elements in the list and the accumulated value. You may
know this as a “reduce” operation. In the Haskell world, this is more commonly
known as a “fold”.
The name sum
is actually a bit too general for the function we have developed.
The operation here needn’t be addition. It could also be multiplication. Or it
could be string concatenation with an empty string as the “zero” element. Or it
could be list appending, with the empty list as the “zero” element.
Now let’s look at generalSum
. It needs to be passed the addition functions as
well as a reduce/fold function for operating on the generic collection f
:
generalSum' :: ((b -> c -> c) -> c -> f b -> c)
-> a
-> (a -> a -> a)
-> f a
-> a
Wow. That is a long signature. The first element in the signature is a general
fold function. It takes a function (b -> c -> c)
which combines a value from
the initial collection with the accumulator and yields a new accumulator. The
second argument c
in the fold signature, is the initial accumulator. f b
is the original collection.
The second, third, and fourth arguments of generalSum'
will all be applied to
the fold function to yield the final summation.
The signatures are really awkward, though. Maybe we could roll all the helper functions into a named record and pass that in instead? We could collect the zero element and the reducing function needed for summing in one record and the folding function in another record:
data Foldable f a b = Foldable {fold :: ((a -> b -> b) -> b -> f a -> b)}
data Summable a = Summable {zero :: a, add :: a -> a -> a}
Now we can write:
generalSum'' :: Foldable f a b -> Summable a -> f a -> a
Now we can define a Foldable
record for each type of collection we use and a
Summbable
record for each type we want to sum.
“Summbable” is a bit to specific a word. We could say “Multiplicatable”
if the operator were *
rather than +
. Or “Appendable” if the operator were
list or string appending and the zero element []
or ""
. What word is there
for the class of things with an empty/zero element and a closed binary operator?
Well, there isn’t one in English. So we must either choose a deceptively
specific word like “Summable” or choose a precise but possibly unfamiliar (and
doubtlessly pretentious) mathematical term. The Haskell community has opted for
the second path and borrowed the term “Monoid” from abstract algebra.
Passing in records of required functions “specializes” the generics without any
builtin shenanigans. No type-level magic. No special methods baked into the
objects. No over-riding or inheritance or (in this case) multiple inheritance
patterns. The functions (folding, adding) and the zero element must by provided
for generalSum''
to be called.
Haskell streamlines the process of providing these records through “typeclasses”. Under the compiler’s hood, though, there is no magic here, just record passing.