Joshua L. Laughner - When ahead-of-time type checking is useful to data science

When ahead-of-time type checking is useful to data science

My work in atmospheric composition falls in between typical data science and software engineering. Some of the time, I'm focused on data analysis to understand how the chemical species in the atmosphere change over time. Other times, I'm working on prototyping better algorithms to measure those changes. But then, once I've finished that prototyping, it's time to turn that algorithm into something that can run reliably over and over.

For the first two cases, languages like Python or R are quite useful because of their flexibility: I can hack together a small program to get done what I need to without worrying overly much about how that program is designed. The problem comes when it's time to move on to that third step. Usually by that point, the algorithm has either grown to such a size in itself that it's hard to keep in mind how all the components interact, or it needs to be fitted into an existing framework. In either case, it puts a lot of onus on the person writing the code to remember how everything works together and to test the program well enough to ensure they haven't broken anything. This is where a stricter language that can help make sure you've haven't made any obvious mistakes, through ahead-of-time (AOT) type checking.

So, let's talk about what "ahead of time type checking" is, when it is useful, and why those of us in data science seem to have gotten away from using it.

What is ahead-of-time type checking?

This is when you have a tool - usually the compiler for a language - that can verify before you ever run your program that certain aspects of your code are correct. This includes things like function names, input and output types for functions, and fields on structure/classes. As a simple example, consider this pair of functions (using Rust syntax):

pub fn read_rh(rh_file: &str) -> Vec<f64> {
    // pretend this reads a file
}   

pub fn demo_relative_humidity() {
    read_relative_humidity("humidity.txt");
}

Clearly when we wrote the second function, we forgot that the first one abbreviated "pressure" in its name. An easy mistake to make! If we ask cargo (the Rust package manager and compiler wrapper) to check this code, it tells us:

error[E0425]: cannot find function `read_relative_humidity` in this scope
  --> src/lib.rs:45:9
   |
45 |         read_relative_humidity("humidity.txt");
   |         ^^^^^^^^^^^^^^^^^^^^^^ not found in this scope

So we know we mistyped the function name.

A crucial aspect of AOT type checking is that it checks all your code, not just the parts that run. Why does this matter? Consider the following Python example:

def read_temperature(temperature_file):
    try:
        # Normally, this should return a list of floats
        return read_text_file(temperature_file)
    except IOError:
        return "0.0 0.0"

If we never tested a case where read_text_file raises an exception, then we'll never catch that this function returns a string when it shouldn't. (While that problem is blatant in this case, consider what would happen if the except block instead called a function that has its own complex logic so it wasn't clear what it returned.) If this were written in Rust:

pub fn read_temperature(temperature_file: &str) -> Vec<f64> {
    // this is Rust's equivalent of a simple try-except
    if let Ok(vals) = read_txt_file(temperature_file) {
        return vals;
    } else {
        return "0.0, 0.0";
    }
}   

fn read_txt_file(p: &str) -> Result<Vec<f64>, String> {
    // pretend this reads a text file
}

and cargo checked, we get:

error[E0308]: mismatched types
 --> src/lib.rs:6:20
  |
2 |     pub fn read_temperature(temperature_file: &str) -> Vec<f64> {
  |                                                        -------- expected `Vec<f64>` because of return type
...
6 |             return "0.0, 0.0";
  |                    ^^^^^^^^^^ expected `Vec<f64>`, found `&str`

Even if read_txt_file never failed in any of our testing, we still get this error.

When is AOT type checking useful?

There are downsides to AOT type checking. First, it requires more up front work to describe the expected types. You probably noticed in the Rust examples the type annotations like &str and Vec<f64>. Learning the type vocabulary necessary to describe these types does raise the barrier to entry for these languages.

Second, it also requires more up front work to really think through how to represent your data. Do you define a structure with set fields, or use a dictionary/map with arbitrary keys? Does everything get packed into one structure, or split out among different ones? And so on.

Third, they are less flexible. In Python, if I wrote a function to calculate potential temperature, I could write:

def calc_pot_temp(t, p):
    return t * (1000.0/p)**0.286

and this would work if t and p were integers, floats, numpy arrays, Pandas series, xarray dataarrays, and probably a bunch of other types. This is great if I'm just focused on data analysis or prototyping, because I'm usually working with small enough programs that I can quickly check that calling calc_pot_temp with my t and p gives a reasonable result.

However, if this were part of a larger program, calls to it might be spread out over dozens or hundreds of .py files. Now I have to make sure that each time I call this function, the variables given to it are valid types (as well as the right units, but that's a rant for another time). If this were written in Rust instead, it might be:

fn calc_pot_temp(t: f64, p: f64) -> f64 {
    return t * (1000.0/p).powf(0.286)
}

In this case I've given up flexbility for a guarantee that any time this function is called, it will get only floating point numbers. If I want to call it with an array, I would have to iterate over the array elements, write an array version of the function, or get into generics to make it work.

To generalize, I usually think of program development as being broken down into three phases:

Initial development, where you're just trying to get to a working example.
Testing and refinement, where you're taking that first prototype and making it reliable for all your use cases.
Maintenance and extension, where you're now making sure that it keeps working with new data and giving it new capabilities.

How much time each of these takes depends on the language you use and what you're doing. For example, if you were doing a one-time data analysis, then once you get it working for your data, it's done. Using a flexible language (like Python) you can probably code this up much faster than an ahead-of-time, rigorous language like Rust, and you only have a little testing to do:

A bar plot of Python and Rust vs. time, with Rust taking longer for initial development.

If this becomes an analysis that you do regularly, you start adding in maintenance time. As long as this stays small, then maintenance is probably about the same for both languages:

A bar plot of Python and Rust vs. time, with Rust taking longer for initial development, but both the same for testing and maintenance.

But now imagine that this tool needs to handle more edge cases: missing data, different inputs, sometimes matching with other files, etc. Initial development will take longer for both languages. It will be tempting to use the more flexible language, get through the initial development faster to have a working tool, then test all the current use cases. With the stricter language, you'd have to spend more time considering current and future uses and structuring the code to handle them. But, while it will take longer to get to a working prototype, often you spend less time testing because you've been considering the complexity all along. Plus, the AOT type checking means you won't spend any time dealing with the classic "oops forgot that function could return None" or "oh right that needs three inputs, not two" mistakes. Then, when it comes time to update the tool for a new type of data, you don't have to worry as much about changes you make breaking the existing uses. If you add a new required input to one of your functions, for example, the AOT checked language tells you automatically if you forgot to add that input anywhere.

A bar plot of Python and Rust vs. time, with Rust taking twice as long for initial development, but much less time in testing and maintenance.

In my experience, this pattern continues: the more complex the program we're making the more benefit there is to AOT checking to make testing and maintenance easier. If you ever work on a program (like a model or similar) used by a whole community, or even one that just has a team working on and using it, the complexity of how everything interacts massively overwhelms the ease of prototyping you get from flexible languages.

A bar plot of Python and Rust vs. time, with Rust taking more time for initial development, but Python going off the chart for maintenance time.

So when is a strictly AOT-checked language useful? First, it has to be for a program you'll use more than once. For single-use programs (yes, we all write them) the longer initial development time for a strict AOT language never pays off. But, if it's a program meant to be reused, and it has moved beyond being a reasonably small one, that you can easily keep in mind how every part of it interacts with every other part, then an AOT language becomes really useful. This is especially true if many people are contributing to the program, not all of whom are equally well versed in it.

Another case is when you're distributing this program to some group of colleagues and it has to just work. This might be something as big as a community-driven model for your fields, or as small as some convenience tools. Once it's out there, if something doesn't work, you'll get bug reports (and inevitably at the least convenient time for you). Plus, if this is something that generates data, you don't want versions to change too quickly. Otherwise, with each version, your users might have to regenerate their data, and their users have to redo their analysis... the more you can get right from the start, the happier everyone will be.

If AOT is so useful, why don't we learn it as data scientists?

I see two answers here. First, for a long time the options for a fully AOT type checked language were things like Fortran, C, C++, C#, and Java. For all their computational power and type checking, Fortran, C, and C++ had too many ways to shoot yourself in the foot through bad memory management to make the benefit from type checking worth it. C# and Java, while much less prone to memory errors (but still sometimes running into memory leaks) need runtimes (Mono or the JRE, respectively) that usually aren't available on the high performance computing clusters we use. Arguably, languages like Rust and Go are the earliest examples of languages fully AOT type checked that both provide some form of automatic memory management and either compile directly to machine code or have a very lightweight runtime.

Second, probably the majority of data scientists will not be writing programs that require the level of maintenance that makes an AOT language worth it. I'll admit that my position is a little unusual in that respect. However, I will still argue that there are many, many cases where a data scientist (or scientist in general) needs to contribute to a large, complicated project. In these cases, trading longer initial development for easier long-term maintenance bring enormous benefits - not just to the software team maintaining the project, but to the person trying to figure out how to contribute. If you're constantly worrying about how your change interacts with the rest of the program, that distracts you from making sure the scientific part of your work is correct.

What about gradual typing?

Astute readers may have noticed a distinct lack of discussion of Julia, Python's mypy type checker, and gradual typing in general. This was not an oversight - I'm less convinced of the benefits of gradual typing than many of my colleagues.

The idea behind gradual typing is that you can start with completely dynamically typed code, as in:

def calc_pot_temp(t, p):
    return t * (1000.0/p)**0.286

and over time add type annotations that can be checked either at run time or ahead of time:

def calc_pot_temp(t: float, p: float) -> float:
    return t * (1000.0/p)**0.286

Python has the mypy tool, which can use the (relatively) new type annotations to check that calls to annotated functions have the right type. So if I did:

calc_pot_temp(273.0, "1000")

it would print error: Argument 2 to "calc_pot_temp" has incompatible type "str"; expected "float" [arg-type].

Julia is a bit different; the type annotations are enforced by the language itself, but at runtime. So if we converted this example to Julia:

function calc_pot_temp(t::Float64, p::Float64)::Float64 
    return t * (1000.0/p)^0.286
end

function demo()
    println(calc_pot_temp(273.0, "1000.0"))
end

then calling demo() would produce an error:

MethodError: no method matching calc_pot_temp(::Float64, ::String)
Closest candidates are:
  calc_pot_temp(::Float64, !Matched::Float64)

Julia does have tools for AOT type checking, but (like Python) not as part of the base language.

So why am I not sold on this? Two reasons.

First, gradual typing systems all have to accept a certain amount of ambiguity at the interface between their typed and untypes sections. If we go back to the Python example, consider another function that calls our calc_pot_temp function:

def calc_500mb_pot_temp(t):
    return calc_pot_temp(t)

Here the input t is untypes, so a type checker can't provide any useful feedback. For all it knows, t could very well be a float, but it also might be a string, list, or any other type. But, because it could be the right type, it allows it. mypy does have a strict mode that disallows untyped arguments, but this still allows inputs with the type annotation Any, and leaves us with the same problem.

If we reversed this, and said that any case like this where the type checking can't prove that the types match, then we can fully enforce AOT checking, but now every input and output must be type annotated. That's true in a language built to be AOT checked, but in this case, we're probably dealing with a type annotation system not really meant for annotating every single type.

There is an argument that this ambiguity is a good thing: allow there to be ambiguity at a specific part of the code so that parts on the untyped side can be flexible and easy to develop, parts on the typed side can be safer and cleaner, and all the messy untyped-to-typed handling happens in clear interfaces. That way, you have the dual benefit of letting the data specialists work in the flexible, untyped space and the software team move code into the persnickity, typed space when needed. My problem with this argument is that it's just too easy to allow an untyped value to slip back into the typed part. All it takes is one, unannotated return value for a function, and bam, untyped value. (mypy does check for this in its strict mode, though, so I could be convinced this is less of an issue.)

My second gripe with gradual typing is that is encourages a looseness in structuring data that affects a whole program. Without AOT checking, there is a lot less incentive to build well-defined data structures, rather than using dynamic structures like dictionaries. In prototyping or single use code, that makes sense - a dictionary is pretty clear about what it's elements are and it saves time defining a custom class and keeping it up to date. But as a code base gets bigger, having a statically defined structure, whose fields can be AOT checked by the compiler to make sure no part of the code calls a nonexistent one reaps huge benefits. It's much easier to rename for one - no worries about missing one use of the renamed field, the compiler will tell you if you did. Additionally, if this structure needs different fields in different cases, an AOT checked language forces you to think about how to handle that. In a non-AOT checked language, this results in lots of checks for if key in dict or similar.

Where this affects gradual typing is that, if a data structure is defined in the untyped part, it's going to be very loose and hard to deal with in the typed part. If it's defined in the typed part, that's better, but poses a problem if the developers working only on the untyped part need to add new data to it.

Conclusion

Generally, when a student going into a data-centric field asks what programming language to learn, I tell them to do three:

A general purpose, flexible language that can do analysis and modestly sized programs (Python, R, Matlab, Julia)
A shell scripting language (so Bash), since the number of times we end up working in terminals is quite high
A high performance, ahead-of-time checked language to use for bigger projects (Rust, Go, Fortran, C++)

I understand that many of the languages in category 3 are intimidating, but the benefits for projects of large enough scale far outweigh the downsides. For anyone in a data science field, once you've mastered the first two languages, I strongly encourage us all to learn at least one language in the third category. Modern languages in that category, like Rust and Go, have addressed a lot of the issues with older such languages and, although they aren't yet that common in data science, I recommend you give one a try.