The enterprise mean

Let’s say you, Alice, are developing a mathematics library for internal use in your data science company.

Here is your first crack at the problem:

def mean(xs):
    return sum(xs) / len(xs)

You submit this to your nit-picking co-workers and Bob points out that it breaks on generators. The length of an iterator is not defined. This isn’t an obscure edge case, Pythonic code favors generators over lists when possible. One solution would be to force evaluation the iterator, turning it into a list:

def mean(xs):
    xs = list(xs)
    return sum(xs) / len(xs)

This will work. But it is not ideal. If xs is very large, then mean may encounter memory problems. Instead, we could walk through the list and measure its length and sum incrementally:

def mean(xs):
    length = 0
    total = 0
    for x in xs:
        length += 1
        total += x
    return (total / length)

But what if there is missing data? In computing a mean, a reasonable choice is to ignore missing data. And how might the user represent this missing data? None might be a reasonable choice. For floats, there is nan. So we can add those cases in:

def mean(xs):
    xs = [x for x in xs if x != float("nan") and x != None]
    return sum(xs) / len(xs)

But wait, float("nan") == float("nan") returns false. What was the convention for NaNs? Ah, there is a specialized function for this in the math library:

def mean(xs):
    xs = [x for x in xs if not math.isnan(x) and x != None]
    return sum(xs) / len(xs)

But there is another case too, infinity. And, for some reason, sum returns an error on infinity. But logically, the sum of a list with infinity as an element ought to also be infinite. So we can add that in as well.

def mean(xs):
    xs = [x for x in xs if not math.isnan(x) and x != None]
    if(any((math.isinf(x) for x in xs))):
        return x
    return sum(xs) / len(xs)

Now does our function handle every type that it should? The comprehension in the first line should work for any iterator. But is that efficient? Should we really be transforming every input into a linked list? If we are passed a numpy array, this would be terribly memory inefficient. I could extend mean to cover this case as well. But I’m sure you can imagine what it would look like. The fundamental problem here is that the domain and codomain of the function are fuzzy. It is the role of the function developer to anticipate all reasonable inputs and to provide handling for all unreasonable inputs. The role of the user is to guess the domain of the function.

We could add type annotations to the Python function. For example:

def mean(xs: Iterator[float]) -> float:
    ...

But that is now too specific. numpy arrays do not store floats. Also, type annotations are not binding in Python. The user is perfectly free to send in any data they like. So if we are exposing this as a library, we still need to handle all possible inputs.

In Haskell, no such fuzziness exists. The problem would be written as follows:

mean :: (Foldable f, Num a) => f a -> a
mean xs = sum xs / fromIntegral (length xs)

It certainly took for far less time to write this than the Python code above. This function will work over any collection that is foldable and over any element type that numeric. There are no edge cases to consider. The typechecker statically guarantees that no inappropriate data can enter the function. I can use this function with confidence that it is correct.

But is the Haskell function the one we want? We can begin exploring the Haskell design space. Perhaps we want to handle missing values. How might they be represented? What if we want to avoid division by zero errors? The difference is that in Haskell we spend our time thinking about design rather than frantically addressing edge cases, looking up syntactic conventions, and addressing weird gotchas. To be fair, Haskellers can spend far too much time thinking about design, but that is a temptation that can be resisted.

The enterprise punctuator

Now let’s say you, Alice, are assigned a new new problem.

def ejoin(xs, delimiter=None):
    """
    Must work have equal support for strings and bytes
    If delimiter is None, join with no delimiter (empty string)
    """

    # if xs is not iterable
    if (! hasattr(xs, "__iter__")):
        if (isinstance(xs, bytes)):
            return xs
        else:
            return str(xs)

    if delimiter is None:
        try:
            xs = list(xs)
            if (isinstance(xs[0], bytes)):
                delimiter = b''
            else:
                delimiter = ''
        except IndexError:
            return None

    return delimiter.join(xs)

If an empty list is given and no delimiter, should the returned type be str or bytes? Or should it be None? Or should it raise an error? Or is this problem so serious that no solution is possible and we simply cannot use the default? If ejoin can return None, then all code using ejoin will need to handle this case (or risk raising the dreaded “NoneType” error).

What about the awkward xs = list(xs) line? We don’t know if xs is a list, or a set, or a tuple, or a generator. Only lists and tuples can be indexed into. So we need to convert this to a list.

There are a few additional error cases. What if xs is heterogeneous? Maybe it is a mix of bytes and str? Perhaps this is a sufficiently unlikely case, that we don’t need to worry about it. What if xs is made up of, say, literal floats? Could ejoin([1,2,3], delimiter=",") yield the string 1,2,3? That might be intuitive. Probably most programmers would argue against it. But maybe you are generating reports from heterogeneous lists of things that may be evaluated to strings (or bytes). In this case, perhaps we should 1) try to convert things to strings if possible and 2) handle het lists.

And if we are going to handle het lists, then which type has priority, str or bytes? Which type defines the output type? This choice is pretty arbitrary. Maybe, because it is arbitrary we need to dial back the generality.

This problem seems undecidable. Perhaps instead we should write a pair of functions, one for bytes and one for strings:

def ejoinStr(xs, delimiter=None):
    if not hasattr(xs, "__iter__"):
        xs = [xs]
    
    for i in range(len(xs)): 
        if (isinstance(xs[i], bytes)):
            xs[i] = xs[i].decode()
        else:
            xs[i] = str(xs[i])
    
    if delimiter is None:
        delimiter = ""
    elif (isinstance(delimiter, bytes)):
        delimiter = delimiter.decode()
    
    return delimter.join(xs)

def ejoinBytes(xs, delimiter=None):
    if not hasattr(xs, "__iter__"):
        xs = [xs]

    for i in range(len(xs)): 
        if not (isinstance(xs[i], bytes)):
            xs[i] = str(xs[i]).encode("ascii")
   
    if delimiter is None:
        delimiter = b''
    elif (isinstance(delimiter, str)):
        delimiter = delimiter.encode("ascii")
   
    return delimiter.join(xs)

ejoin :: [String] -> String -> String 
ejoin [] _ = ""
ejoin [x] _ = x
ejoin xs b = foldl1 (\x y -> x <> b <> y) xs

-- A String is just a special case of a Monoid, so we can generalize to:
ejoin :: Monoid a => [a] -> a -> a
ejoin [] _ = mempty
ejoin [x] _ = x
ejoin xs b = foldl1 (\x y -> x <> b <> y) xs

-- A list is just a special case of a Foldable, so generalize to:
ejoin :: (Foldable f, Monoid a) => f a -> a -> a
ejoin xs b = case take 2 xs of
    [] -> mempty
    [x] -> x
    _   -> foldl1 (\x y -> x <> b <> y) xs

The body of the Haskell functions are essentially boilerplate. In this case, no decisions are made there. On a Haskell team, the entire discussion of how to implement ejoin would involve the type signature.

If we wanted to adapt the Haskell ejoin function to handle anything that can be turned to a string, we could simple add the Show typeclass:

ejoin :: (Foldable f, Show b) => f b -> String -> String
ejoin xs b = case take 2 xs of
    [] -> ""
    [x] -> show x
    _   -> foldl1 (\x y -> show x <> b <> show y) xs

However, this is a pretty pointless change, it would be more natural to write:

ejoinStr = ejoin . map show

That is, rather than specialize ejoin to work specifically with the String type, we could just compose it with with map show, which would convert each element in the collection to a string.

We could do the same with the Python ejoin function, but there is a critical difference. There is nothing in the ejoin function that tells us the input has been sanitized. It is the responsibility of the user of the ejoin function to clean its input. And without reading the code or the (likely incomplete, unreliable, or annoyingly verbose) documentation, you will not know (or remember) what special handling the function has. Maybe you can try to enforce a convention across your team. But this only solves the local problem. What about third party library code?

Everyone has a different opinion on what punctuate ought to do. An individual programmer may not even have a consistent view. In one place in the code they may want punctuate to work with strings, in another bytes. And what the function actually does is dependent entirely on the code itself. The only set thing in the API is the number of default values of the arguments.

You are a Python programmer and you need a simple function that takes a list of strings and returns a punctuated string.

def punctuate(xs):
    return ",".join(xs)

You wouldn’t normally even need a dedicated function for this, but the larger program you are building needs to punctuate many different lists, so having one function improves consistency. For example, upper management recently decided that there should be spaces after the commas. Since all punctuation was handled by this one function, you could easily swap out “,” for “,”.

However, a few weeks down the line, another programmer on your team raised a concern. In one, rather kludgy in your opinion, section of his code, he sends punctuate a string instead of a list of strings. Since a python string is just a list of characters, punctuate would iterate through each character and insert commas. So the string “cat” turns to “c, a, t”. While it is tempting to tell him that this is his problem and that he should fix his code, you worry that you will run into this same problem later. If you try to punctuate a single string, what is the reasonable action? You could raise an error, which would be a bit passive aggressive, or you treat the single string as a list of length one. The latter choice seems more reasonable. So you rewrite the function as follows.

def punctuate(xs):
    if(isinstance(xs, str)):
        return xs
    else:
        return ", ".join(xs)

Awhile later, another programmer messages that your punctuate code is breaking on byte strings. In python you can specify b'asdf' as a raw character vector of type bytes. This is more memory efficient. Your punctuate function should probably handle this. So you grudgingly reimplement as follows:

def punctuate(xs):
    if(! isinstance(xs, list)):
        return xs
    else:
        if(isinstance(xs[0]), bytes):
            return b', '.join(xs)
        else: 
            return ", ".join(xs)

But this doesn’t seem right. The else statements here assume the inputs are of type list and str. Maybe that is fine. If the wrong types are sent in, then they will raise errors. Again, we have a design choice. Maybe we could try to convert any input to “str”. This would be a nice behavior for the very common case where we need to delimit a list of floats for csv files. So maybe we can write:

def punctuate(xs):
    if(! isinstance(xs, list)):
        return xs
    else:
        if(isinstance(xs[0]), bytes):
            return b', '.join(xs)
        if(isinstance(xs[0]), str: 
            return ", ".join(xs)
        else:
            xs = [str(x) for x in xs]
            return ", ".join(xs)

Seeing this, Alice in the neighboring cubicle tells you emphatically that you are overusing isinstance and you need to catch any exceptions that might be raised. Oh, and don’t use so many return statements. And your checks are leaky, she says. You check that xs[0] is of types bytes, implicitly assuming the list is homogenous, but is it really? What if the function gets the list ['I-am-str,b'I-byte']. Well, checking for all that would not be pythonic, so let’s through all this into try-except statements:

def punctuate(xs):
    if(! isinstance(xs, list)):
        out = xs
    else:
        try:
            out = ", ".join(xs)
        except TypeError:
            try:
                out = b', '.join(xs)
            except TypeError:
                try:
                    xs = [str(x) for x in xs]
                    out = ", ".join(xs)
                except ValueError:
                    raise ValueError("Can only punctuate stringable things")
    return out

Great, so now it first tries to join strings, then it tries bytes, and if nothing works it converts everything to strings. But then Alice points out that the list type check is too specific, it fails on sets and tuples. Instead you should check that the input is iterable:

def punctuate(xs):
    if(! hasattr(xs, '__iter__')):
        out = xs
    else:
        try:
            out = ", ".join(xs)
        except TypeError:
            try:
                out = b', '.join(xs)
            except TypeError:
                try:
                    xs = [str(x) for x in xs]
                    out = ", ".join(xs)
                except ValueError:
                    raise ValueError("Can only punctuate stringable things")
    return out

But more bugs appear. By returning anything that isn’t an iterable collection, you sometimes return things that aren’t even strings, this breaks downstream code. For example, "(" + punctuate(x) + ")" breaks when x=1, a single integer. Now, maybe they shouldn’t be sending this in. So can we convert xs to a string? Sure, but the some things stringify in weird ways, for example, the bytes values b'asdf' stringifies literally to “b’asdf’”. So we need another check:

def punctuate(xs):
    if(! hasattr(xs, '__iter__')):
        if(isinstance(xs, bytes)):
            out = xs
        else:
            try:
                out = str(xs)
            except ValueError:
                raise ValueError("Can only punctuate stringable things")
    else:
        try:
            out = ", ".join(xs)
        except TypeError:
            try:
                out = b', '.join(xs)
            except TypeError:
                try:
                    xs = [str(x) for x in xs]
                    out = ", ".join(xs)
                except ValueError:
                    raise ValueError("Can only punctuate stringable things")
    return out

Then Alice helpfully points out the repetition of “,”. Shouldn’t that come in as an argument? What if we want to punctuate without the space (as in CSV files) or we want to use TABs or semicolons instead? So … forward march:

def punctuate(xs, delimiter):
    if(! hasattr(xs, '__iter__')):
        if(isinstance(xs, bytes)):
            out = xs
        else:
            try:
                out = str(xs)
            except ValueError:
                raise ValueError("Can only punctuate stringable things")
    else:
        try:
            out = delimiter.join(xs)
        except TypeError:
            try:
                out = delimiter.join(xs)
            except TypeError:
                try:
                    xs = [str(x) for x in xs]
                    out = delimiter.join(xs)
                except ValueError:
                    raise ValueError("Can only punctuate stringable things")
                except TypeError:
                    raise TypeError("The delimiter and list elements don't match. Maybe???")
    return out

Now this introduces a new type of possible error. The type of the delimiter may not match the type of the strings. So we need try-except to catch this TypeError. Now should we correct the types? We could. Maybe someone should be able to specify the delimiter using a string value and still have it work for bytes? It is becoming hard to trace which errors might rise at which places.

def punctuate(xs, delimiter):
    if(! hasattr(xs, '__iter__')):
        if(isinstance(xs, bytes)):
            out = xs
        else:
            try:
                out = str(xs)
            except ValueError:
                raise ValueError("Can only punctuate stringable things")
    else:
        try:
            out = delimiter.join(xs)
        except TypeError:
            if(isinstanceof(delimiter, bytes) and isinstanceof(xs[0], str)):
                delimiter = delimiter.decode('ascii')
            elif(isinstanceof(delimiter, str) and isinstanceof(xs[0], bytes)):
                delimiter = delimiter.encode('bytes')
            elif(! isinstanceof(delimiter, str)):
                delimiter = str(delimiter)
            try:
                out = delimiter.join(xs)
            except TypeError:
                try:
                    xs = [str(x) for x in xs]
                    out = delimiter.join(xs)
                except ValueError:
                    raise ValueError("Can only punctuate stringable things")
                except TypeError:
                    raise TypeError("The delimiter and list elements don't match. Maybe???")
    return out

Alice, after staring at the code for several long minutes, asks how empty lists are handled. They should be empty strings, but should they be byte strings or ascii strings? Maybe we can infer the output type by looking at the type of the delimiter?

def punctuate(xs, delimiter):
    if(! hasattr(xs, '__iter__')):
        if(isinstance(xs, bytes)):
            out = xs
        else:
            try:
                out = str(xs)
                if(out == "" and isinstance(delimiter, bytes)):
                    out = out.encode("bytes")
            except ValueError:
                raise ValueError("Can only punctuate stringable things")
    else:
        try:
            out = delimiter.join(xs)
        except TypeError:
            try:
                if(isinstanceof(delimiter, bytes) and isinstanceof(xs[0], str)):
                    delimiter = delimiter.decode('ascii')
                elif(isinstanceof(delimiter, str) and isinstanceof(xs[0], bytes)):
                    delimiter = delimiter.encode('bytes')
                elif(! isinstanceof(delimiter, str)):
                    delimiter = str(delimiter)
            except IndexError:
                if(isinstance(delimter, bytes):
                    return b''
                else:
                    return ''
            try:
                out = delimiter.join(xs)
            except TypeError:
                try:
                    xs = [str(x) for x in xs]
                    out = delimiter.join(xs)
                except ValueError:
                    raise ValueError("Can only punctuate stringable things")
                except TypeError:
                    raise TypeError("The delimiter and list elements don't match. Maybe???")
    return out

Phew! OK. Is all that correct now? Is our enterprise punctuation finally online? There’s surely some cleaner way to refactor that. You show the final code to Alice.

“It’s ugly”, she says, “But you’ve already spent too much time on it. Just add some documentation and move onto the next thing.”

def punctuate(xs, delimiter):
    """
    Punctuate a list of string-like things with the given delimiter

    This will raise errors if xs contains byte strings mingled with anything
    else. If the delimiter is bytes and the list contains strings, or vice
    versa, then the delimiter will be automatically cast to match the list
    type.

    @param xs : homogenous list of things that can be converted to a string 
    @param delimiter : an ascii string (or something that can be converted to one) or a bytes string 
    @return either a bytes string if the xs contains bytes or a char string otherwise
    """
    ...

Ugh. Did that documentation catch everything? Will anyone read it? Alice raises an eyebrow and says, everything is OK, if anyone needs to understand the function, they can always read the code.

Now, maybe this all seems a little absurd. Maybe only strings need to be supported. Maybe only strings and bytes. Maybe only bytes, if you are working with memory-sensitive ASCII data, such as biological sequence data. Maybe it is OK if the function doesn’t raise errors sometimes. Maybe we could drop the handling for mismatched delimiter/list types? Maybe there is a smart third party tool we could use for unifying strings?

There are many approaches. Might one say, there are many ways to do it?

What is the “design space” for punctuate? That is, how many different reasonable ways could we write it? If a programmer sees punctuate in a library, what hypotheses might they form about how it works? We can describe the design space as a decision tree.

Should the function just “do the right thing”? If that is our intent, then we should add handling for punctuating things that aren’t strings and recasting.

So what is the fundamental problem here?

We are creating an ad hoc type system. Rather than allowing the compiler to figure out the types, we have to explicitly check and branch to the correct behaviour.
We are creating ad hoc polymorphism.
We are creating documentation of the types and behavior that may not accurately describe the function.
Our function has an output type that is dependent on the input values. This means a programmer using the function cannot be sure what type will be returned. This uncertainty prevents readability.
The performance of all these manual checks and branching is bad
The code is complex, unreadable, and bug prone. The particular way we implemented it is one of dozens of equally reasonable implementations.
Since any argument to a function may take any type, you have to form a hypothesis about how the function will be used in the future.

punctuate is a very trivial function. If the source code is available, it is not too hard to read through it and figure out what cases are handled. But more complex functions may have completely non-obvious type requirements. Constraints on the types may be set deep within libraries imported by the library you are importing the function from. The source code may not even be available, for instance if you are interacting with a function hidden behind an API.

You can handle the bytes and strings consistently by redefining join:

But there are two problems with this, the first is that we are redefining out in each iteration of the loop. This leads to very bad performance.

The second problem is that we have to return None when passed an empty list. This is because we do not know the expected return type. If we returned the string type "", then this could break code expecting bytes. If we returned b'', we would break string code. But now the returned None is forcing the calling context to handle this extra type.

In Python, if we decided we want to allow bytes to be handled in one function, how does this change propagate? Which library functions handle bytes? When do I need to convert? What functions may raise exceptions, silently recast my data to strings, or return None’s. How might these None’s be handled?

With languages like Python and R, I have to experiment with the functions to figure out how they work. Understanding the code is a scientific endeavour. Understanding the code from first principles is like understanding biology given chemistry. There are little gotcha’s and inconsistencies everywhere. Ground truth is hard to find. So we program under a fuzzy model, a collection of assumptions about the code, and these assumption prevent confidence in the correctness of our programs.

This is in sharp contrast to strongly typed language like Haskell. Here, most assumptions are verified by the compiler. Not all assumptions are verified; there are limits to the expressiveness of Haskell’s type system. But in practice, the unverified assumptions are limited to a small set of edge cases. This frees a Haskell programmer to focus on the design and logic of the program.

In Haskell,

punctuate1 :: [String] -> String -> String
puncutate1 [] _ = ""
puncutate1 [x] _ = x
punctuate1 xs b = foldl (x y -> x <> b <> y) "" xs

This is a total function. No exception handling is needed and there is no ambiguity about what it is doing. It is impossible for any developers elsewhere on the time to misuse this function. Further, there are few other ways to write this function. It is very near to mathematically minimal implementation. However, in Haskell, it is common to generalize one step further:

In Haskell

punctuate :: (Foldable f, Monoid a) => f a -> a -> a
punctuate xs b = case length xs of
    0 -> mempty
    1 -> head xs
    _ -> foldl1 (\x y -> x <> b <> y) xs

We define punctuate for any foldable collection of monoids.

This punctuate function is richer than what we set out to find. We could “punctuate” a tree. On our quest to add a few commas between strings, we stumbled on the abstract notion of folding a constant across a collection. What does this mean? How else might it be used? Is the problem of punctuation isomorphic to some seemingly unrelated computational problem?

built on 2024-08-17 18:13:29.106027925 UTC from file 2024-03-19-illusion-of-simplicity

Interlude: how typing simplifies development

Contents:

The enterprise mean

The enterprise punctuator