Contents:
- Why I write and who you are
- Haskell - The language of terms
- Haskell - The language of types
- Haskell - Typeclasses
- Interlude: how typing simplifies development
- Parser combinators: Fast streaming of huge and complex files
Bob, Alice and Charlie - by Midjourney
under construction
The enterprise mean
Let’s say you, Alice, are developing a mathematics library for internal use in your data science company.
Here is your first crack at the problem:
def mean(xs):
return sum(xs) / len(xs)
You submit this to your nit-picking co-workers and Bob points out that it breaks on generators. The length of an iterator is not defined. This isn’t an obscure edge case, Pythonic code favors generators over lists when possible. One solution would be to force evaluation the iterator, turning it into a list:
def mean(xs):
= list(xs)
xs return sum(xs) / len(xs)
This will work. But it is not ideal. If xs
is very large, then mean
may
encounter memory problems. Instead, we could walk through the list and measure
its length and sum incrementally:
def mean(xs):
= 0
length = 0
total for x in xs:
+= 1
length += x
total return (total / length)
But what if there is missing data? In computing a mean, a reasonable choice is
to ignore missing data. And how might the user represent this missing data?
None
might be a reasonable choice. For floats, there is nan
. So we can add
those cases in:
def mean(xs):
= [x for x in xs if x != float("nan") and x != None]
xs return sum(xs) / len(xs)
But wait, float("nan") == float("nan")
returns false. What was the convention
for NaNs? Ah, there is a specialized function for this in the math library:
def mean(xs):
= [x for x in xs if not math.isnan(x) and x != None]
xs return sum(xs) / len(xs)
But there is another case too, infinity. And, for some reason, sum
returns an
error on infinity. But logically, the sum of a list with infinity as an element
ought to also be infinite. So we can add that in as well.
def mean(xs):
= [x for x in xs if not math.isnan(x) and x != None]
xs if(any((math.isinf(x) for x in xs))):
return x
return sum(xs) / len(xs)
Now does our function handle every type that it should? The comprehension in the
first line should work for any iterator. But is that efficient? Should we really
be transforming every input into a linked list? If we are passed a numpy
array,
this would be terribly memory inefficient. I could extend mean
to cover this
case as well. But I’m sure you can imagine what it would look like. The
fundamental problem here is that the domain and codomain of the function are
fuzzy. It is the role of the function developer to anticipate all reasonable
inputs and to provide handling for all unreasonable inputs. The role of the user
is to guess the domain of the function.
We could add type annotations to the Python function. For example:
def mean(xs: Iterator[float]) -> float:
...
But that is now too specific. numpy
arrays do not store floats. Also, type
annotations are not binding in Python. The user is perfectly free to send in any
data they like. So if we are exposing this as a library, we still need to handle
all possible inputs.
In Haskell, no such fuzziness exists. The problem would be written as follows:
mean :: (Foldable f, Num a) => f a -> a
= sum xs / fromIntegral (length xs) mean xs
It certainly took for far less time to write this than the Python code above. This function will work over any collection that is foldable and over any element type that numeric. There are no edge cases to consider. The typechecker statically guarantees that no inappropriate data can enter the function. I can use this function with confidence that it is correct.
But is the Haskell function the one we want? We can begin exploring the Haskell design space. Perhaps we want to handle missing values. How might they be represented? What if we want to avoid division by zero errors? The difference is that in Haskell we spend our time thinking about design rather than frantically addressing edge cases, looking up syntactic conventions, and addressing weird gotchas. To be fair, Haskellers can spend far too much time thinking about design, but that is a temptation that can be resisted.
The enterprise punctuator
Now let’s say you, Alice, are assigned a new new problem.
def ejoin(xs, delimiter=None):
"""
Must work have equal support for strings and bytes
If delimiter is None, join with no delimiter (empty string)
"""
# if xs is not iterable
if (! hasattr(xs, "__iter__")):
if (isinstance(xs, bytes)):
return xs
else:
return str(xs)
if delimiter is None:
try:
= list(xs)
xs if (isinstance(xs[0], bytes)):
= b''
delimiter else:
= ''
delimiter except IndexError:
return None
return delimiter.join(xs)
If an empty list is given and no delimiter, should the returned type be str
or bytes
? Or should it be None? Or should it raise an error? Or is this
problem so serious that no solution is possible and we simply cannot use the
default? If ejoin
can return None
, then all code using ejoin
will need to
handle this case (or risk raising the dreaded “NoneType” error).
What about the awkward xs = list(xs)
line? We don’t know if xs
is a list,
or a set, or a tuple, or a generator. Only lists and tuples can be indexed
into. So we need to convert this to a list.
There are a few additional error cases. What if xs
is heterogeneous? Maybe it
is a mix of bytes
and str
? Perhaps this is a sufficiently unlikely case,
that we don’t need to worry about it. What if xs
is made up of, say, literal
floats? Could ejoin([1,2,3], delimiter=",")
yield the string 1,2,3
? That
might be intuitive. Probably most programmers would argue against it. But maybe
you are generating reports from heterogeneous lists of things that may be
evaluated to strings (or bytes). In this case, perhaps we should 1) try to
convert things to strings if possible and 2) handle het lists.
And if we are going to handle het lists, then which type has priority, str or bytes? Which type defines the output type? This choice is pretty arbitrary. Maybe, because it is arbitrary we need to dial back the generality.
This problem seems undecidable. Perhaps instead we should write a pair of functions, one for bytes and one for strings:
def ejoinStr(xs, delimiter=None):
if not hasattr(xs, "__iter__"):
= [xs]
xs
for i in range(len(xs)):
if (isinstance(xs[i], bytes)):
= xs[i].decode()
xs[i] else:
= str(xs[i])
xs[i]
if delimiter is None:
= ""
delimiter elif (isinstance(delimiter, bytes)):
= delimiter.decode()
delimiter
return delimter.join(xs)
def ejoinBytes(xs, delimiter=None):
if not hasattr(xs, "__iter__"):
= [xs]
xs
for i in range(len(xs)):
if not (isinstance(xs[i], bytes)):
= str(xs[i]).encode("ascii")
xs[i]
if delimiter is None:
= b''
delimiter elif (isinstance(delimiter, str)):
= delimiter.encode("ascii")
delimiter
return delimiter.join(xs)
ejoin :: [String] -> String -> String
= ""
ejoin [] _ = x
ejoin [x] _ = foldl1 (\x y -> x <> b <> y) xs
ejoin xs b
-- A String is just a special case of a Monoid, so we can generalize to:
ejoin :: Monoid a => [a] -> a -> a
= mempty
ejoin [] _ = x
ejoin [x] _ = foldl1 (\x y -> x <> b <> y) xs
ejoin xs b
-- A list is just a special case of a Foldable, so generalize to:
ejoin :: (Foldable f, Monoid a) => f a -> a -> a
= case take 2 xs of
ejoin xs b -> mempty
[] -> x
[x] -> foldl1 (\x y -> x <> b <> y) xs _
The body of the Haskell functions are essentially boilerplate. In this case, no
decisions are made there. On a Haskell team, the entire discussion of how to
implement ejoin
would involve the type signature.
If we wanted to adapt the Haskell ejoin
function to handle anything that can
be turned to a string, we could simple add the Show typeclass:
ejoin :: (Foldable f, Show b) => f b -> String -> String
= case take 2 xs of
ejoin xs b -> ""
[] -> show x
[x] -> foldl1 (\x y -> show x <> b <> show y) xs _
However, this is a pretty pointless change, it would be more natural to write:
ejoinStr = ejoin . map show
That is, rather than specialize ejoin
to work specifically with the String
type, we could just compose it with with map show
, which would convert each
element in the collection to a string.
We could do the same with the Python ejoin
function, but there is a critical
difference. There is nothing in the ejoin
function that tells us the input
has been sanitized. It is the responsibility of the user of the ejoin
function to clean its input. And without reading the code or the (likely
incomplete, unreliable, or annoyingly verbose) documentation, you will not know
(or remember) what special handling the function has. Maybe you can try to
enforce a convention across your team. But this only solves the local problem.
What about third party library code?
Everyone has a different opinion on what punctuate
ought to do. An individual
programmer may not even have a consistent view. In one place in the code they
may want punctuate to work with strings, in another bytes. And what the
function actually does is dependent entirely on the code itself. The only set
thing in the API is the number of default values of the arguments.
You are a Python programmer and you need a simple function that takes a list of strings and returns a punctuated string.
def punctuate(xs):
return ",".join(xs)
You wouldn’t normally even need a dedicated function for this, but the larger program you are building needs to punctuate many different lists, so having one function improves consistency. For example, upper management recently decided that there should be spaces after the commas. Since all punctuation was handled by this one function, you could easily swap out “,” for “,”.
However, a few weeks down the line, another programmer on your team raised a
concern. In one, rather kludgy in your opinion, section of his code, he sends
punctuate
a string instead of a list of strings. Since a python string is
just a list of characters, punctuate
would iterate through each character and
insert commas. So the string “cat” turns to “c, a, t”. While it is tempting to
tell him that this is his problem and that he should fix his code, you worry
that you will run into this same problem later. If you try to punctuate a
single string, what is the reasonable action? You could raise an error, which
would be a bit passive aggressive, or you treat the single string as a list of
length one. The latter choice seems more reasonable. So you rewrite the
function as follows.
def punctuate(xs):
if(isinstance(xs, str)):
return xs
else:
return ", ".join(xs)
Awhile later, another programmer messages that your punctuate code is breaking
on byte strings. In python you can specify b'asdf'
as a raw character vector
of type bytes
. This is more memory efficient. Your punctuate function should
probably handle this. So you grudgingly reimplement as follows:
def punctuate(xs):
if(! isinstance(xs, list)):
return xs
else:
if(isinstance(xs[0]), bytes):
return b', '.join(xs)
else:
return ", ".join(xs)
But this doesn’t seem right. The else
statements here assume the inputs are
of type list
and str
. Maybe that is fine. If the wrong types are sent in,
then they will raise errors. Again, we have a design choice. Maybe we could try
to convert any input to “str”. This would be a nice behavior for the very
common case where we need to delimit a list of floats for csv files. So maybe
we can write:
def punctuate(xs):
if(! isinstance(xs, list)):
return xs
else:
if(isinstance(xs[0]), bytes):
return b', '.join(xs)
if(isinstance(xs[0]), str:
return ", ".join(xs)
else:
xs = [str(x) for x in xs]
return ", ".join(xs)
Seeing this, Alice in the neighboring cubicle tells you emphatically that you
are overusing isinstance
and you need to catch any exceptions that might be
raised. Oh, and don’t use so many return statements. And your checks are leaky,
she says. You check that xs[0]
is of types bytes
, implicitly assuming the
list is homogenous, but is it really? What if the function gets the list
['I-am-str,b'I-byte']
. Well, checking for all that would not be pythonic, so
let’s through all this into try-except statements:
def punctuate(xs):
if(! isinstance(xs, list)):
out = xs
else:
try:
out = ", ".join(xs)
except TypeError:
try:
out = b', '.join(xs)
except TypeError:
try:
xs = [str(x) for x in xs]
out = ", ".join(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
return out
Great, so now it first tries to join strings, then it tries bytes, and if
nothing works it converts everything to strings. But then Alice points out that
the list
type check is too specific, it fails on sets and tuples. Instead you
should check that the input is iterable:
def punctuate(xs):
if(! hasattr(xs, '__iter__')):
out = xs
else:
try:
out = ", ".join(xs)
except TypeError:
try:
out = b', '.join(xs)
except TypeError:
try:
xs = [str(x) for x in xs]
out = ", ".join(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
return out
But more bugs appear. By returning anything that isn’t an iterable collection,
you sometimes return things that aren’t even strings, this breaks downstream
code. For example, "(" + punctuate(x) + ")"
breaks when x=1
, a single
integer. Now, maybe they shouldn’t be sending this in. So can we convert xs
to a string? Sure, but the some things stringify in weird ways, for example,
the bytes
values b'asdf'
stringifies literally to “b’asdf’”. So we need
another check:
def punctuate(xs):
if(! hasattr(xs, '__iter__')):
if(isinstance(xs, bytes)):
out = xs
else:
try:
out = str(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
else:
try:
out = ", ".join(xs)
except TypeError:
try:
out = b', '.join(xs)
except TypeError:
try:
xs = [str(x) for x in xs]
out = ", ".join(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
return out
Then Alice helpfully points out the repetition of “,”. Shouldn’t that come in as an argument? What if we want to punctuate without the space (as in CSV files) or we want to use TABs or semicolons instead? So … forward march:
def punctuate(xs, delimiter):
if(! hasattr(xs, '__iter__')):
if(isinstance(xs, bytes)):
out = xs
else:
try:
out = str(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
else:
try:
out = delimiter.join(xs)
except TypeError:
try:
out = delimiter.join(xs)
except TypeError:
try:
xs = [str(x) for x in xs]
out = delimiter.join(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
except TypeError:
raise TypeError("The delimiter and list elements don't match. Maybe???")
return out
Now this introduces a new type of possible error. The type of the delimiter may not match the type of the strings. So we need try-except to catch this TypeError. Now should we correct the types? We could. Maybe someone should be able to specify the delimiter using a string value and still have it work for bytes? It is becoming hard to trace which errors might rise at which places.
def punctuate(xs, delimiter):
if(! hasattr(xs, '__iter__')):
if(isinstance(xs, bytes)):
out = xs
else:
try:
out = str(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
else:
try:
out = delimiter.join(xs)
except TypeError:
if(isinstanceof(delimiter, bytes) and isinstanceof(xs[0], str)):
delimiter = delimiter.decode('ascii')
elif(isinstanceof(delimiter, str) and isinstanceof(xs[0], bytes)):
delimiter = delimiter.encode('bytes')
elif(! isinstanceof(delimiter, str)):
delimiter = str(delimiter)
try:
out = delimiter.join(xs)
except TypeError:
try:
xs = [str(x) for x in xs]
out = delimiter.join(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
except TypeError:
raise TypeError("The delimiter and list elements don't match. Maybe???")
return out
Alice, after staring at the code for several long minutes, asks how empty lists are handled. They should be empty strings, but should they be byte strings or ascii strings? Maybe we can infer the output type by looking at the type of the delimiter?
def punctuate(xs, delimiter):
if(! hasattr(xs, '__iter__')):
if(isinstance(xs, bytes)):
out = xs
else:
try:
out = str(xs)
if(out == "" and isinstance(delimiter, bytes)):
out = out.encode("bytes")
except ValueError:
raise ValueError("Can only punctuate stringable things")
else:
try:
out = delimiter.join(xs)
except TypeError:
try:
if(isinstanceof(delimiter, bytes) and isinstanceof(xs[0], str)):
delimiter = delimiter.decode('ascii')
elif(isinstanceof(delimiter, str) and isinstanceof(xs[0], bytes)):
delimiter = delimiter.encode('bytes')
elif(! isinstanceof(delimiter, str)):
delimiter = str(delimiter)
except IndexError:
if(isinstance(delimter, bytes):
return b''
else:
return ''
try:
out = delimiter.join(xs)
except TypeError:
try:
xs = [str(x) for x in xs]
out = delimiter.join(xs)
except ValueError:
raise ValueError("Can only punctuate stringable things")
except TypeError:
raise TypeError("The delimiter and list elements don't match. Maybe???")
return out
Phew! OK. Is all that correct now? Is our enterprise punctuation finally online? There’s surely some cleaner way to refactor that. You show the final code to Alice.
“It’s ugly”, she says, “But you’ve already spent too much time on it. Just add some documentation and move onto the next thing.”
def punctuate(xs, delimiter):
"""
Punctuate a list of string-like things with the given delimiter
This will raise errors if xs contains byte strings mingled with anything
else. If the delimiter is bytes and the list contains strings, or vice
versa, then the delimiter will be automatically cast to match the list
type.
@param xs : homogenous list of things that can be converted to a string
@param delimiter : an ascii string (or something that can be converted to one) or a bytes string
@return either a bytes string if the xs contains bytes or a char string otherwise
"""
...
Ugh. Did that documentation catch everything? Will anyone read it? Alice raises an eyebrow and says, everything is OK, if anyone needs to understand the function, they can always read the code.
Now, maybe this all seems a little absurd. Maybe only strings need to be supported. Maybe only strings and bytes. Maybe only bytes, if you are working with memory-sensitive ASCII data, such as biological sequence data. Maybe it is OK if the function doesn’t raise errors sometimes. Maybe we could drop the handling for mismatched delimiter/list types? Maybe there is a smart third party tool we could use for unifying strings?
There are many approaches. Might one say, there are many ways to do it?
What is the “design space” for punctuate
? That is, how many different
reasonable ways could we write it? If a programmer sees punctuate
in a
library, what hypotheses might they form about how it works? We can describe
the design space as a decision tree.
Should the function just “do the right thing”? If that is our intent, then we should add handling for punctuating things that aren’t strings and recasting.
So what is the fundamental problem here?
We are creating an ad hoc type system. Rather than allowing the compiler to figure out the types, we have to explicitly check and branch to the correct behaviour.
We are creating ad hoc polymorphism.
We are creating documentation of the types and behavior that may not accurately describe the function.
Our function has an output type that is dependent on the input values. This means a programmer using the function cannot be sure what type will be returned. This uncertainty prevents readability.
The performance of all these manual checks and branching is bad
The code is complex, unreadable, and bug prone. The particular way we implemented it is one of dozens of equally reasonable implementations.
Since any argument to a function may take any type, you have to form a hypothesis about how the function will be used in the future.
punctuate
is a very trivial function. If the source code is available, it is
not too hard to read through it and figure out what cases are handled. But more
complex functions may have completely non-obvious type requirements.
Constraints on the types may be set deep within libraries imported by the
library you are importing the function from. The source code may not even be
available, for instance if you are interacting with a function hidden behind an
API.
You can handle the bytes and strings consistently by redefining join
:
But there are two problems with this, the first is that we are redefining out
in each iteration of the loop. This leads to very bad performance.
The second problem is that we have to return None
when passed an empty list.
This is because we do not know the expected return type. If we returned the
string type ""
, then this could break code expecting bytes. If we returned
b''
, we would break string code. But now the returned None
is forcing the
calling context to handle this extra type.
In Python, if we decided we want to allow bytes to be handled in one function, how does this change propagate? Which library functions handle bytes? When do I need to convert? What functions may raise exceptions, silently recast my data to strings, or return None’s. How might these None’s be handled?
With languages like Python and R, I have to experiment with the functions to figure out how they work. Understanding the code is a scientific endeavour. Understanding the code from first principles is like understanding biology given chemistry. There are little gotcha’s and inconsistencies everywhere. Ground truth is hard to find. So we program under a fuzzy model, a collection of assumptions about the code, and these assumption prevent confidence in the correctness of our programs.
This is in sharp contrast to strongly typed language like Haskell. Here, most assumptions are verified by the compiler. Not all assumptions are verified; there are limits to the expressiveness of Haskell’s type system. But in practice, the unverified assumptions are limited to a small set of edge cases. This frees a Haskell programmer to focus on the design and logic of the program.
In Haskell,
punctuate1 :: [String] -> String -> String
puncutate1 [] _ = ""
puncutate1 [x] _ = x
punctuate1 xs b = foldl (x y -> x <> b <> y) "" xs
This is a total function. No exception handling is needed and there is no ambiguity about what it is doing. It is impossible for any developers elsewhere on the time to misuse this function. Further, there are few other ways to write this function. It is very near to mathematically minimal implementation. However, in Haskell, it is common to generalize one step further:
In Haskell
punctuate :: (Foldable f, Monoid a) => f a -> a -> a
= case length xs of
punctuate xs b 0 -> mempty
1 -> head xs
-> foldl1 (\x y -> x <> b <> y) xs _
We define punctuate
for any foldable collection of monoids.
This punctuate
function is richer than what we set out to find. We could
“punctuate” a tree. On our quest to add a few commas between strings, we
stumbled on the abstract notion of folding a constant across a collection. What
does this mean? How else might it be used? Is the problem of punctuation
isomorphic to some seemingly unrelated computational problem?