Joshua L. Laughner - My wishlist for unit-aware computing

My wishlist for unit-aware computing

I think all data scientists know the pain of dealing with units. Most of the time, we have to rely on documentation for functions to tell us what units it expects inputs given in and returns outputs in. Sometimes we're lucky and find packages that use a unit-aware package like Pint or Unitful. Even then, there's almost always a grinding of gears between those packages and the rest of the language ecosystem that pretends units don't exist.

Having now been dealing with (lack of) units in programming for 10+ years, I have a pretty clear wishlist for how a new, STEM-focused programming language should deal with units. I'm more focused on what a compiled language might look like, since I worry more about units in big programs where it's quite difficult to remember what units different parts of the code expect, but these can apply to a scripting type language as well. Note that for most of the examples, I'll focus on defining dimensionality (e.g. length, mass, pressure) rather than units.

Unit/dimensionality support is built into the language

This is the most basic requirement for me. While packages like Pint and Unitful do the best job they can representing units, as third party packages they suffer from two limitations:

Units have to be specified with awkward syntax, requiring a function call or macro, a separate type definition, or reinterpreting a string.
Other packages have to be correctly designed to support them.

My argument is that, if unit definition is built into the language, then you could have something like this pseudo-Python:

def gravity_force(r: float[Length], m1: float[Mass], m2: float[Mass]):
  ...

By designing the language syntax to include units, defining the dimensionality of our inputs can be as ergonomic as possible.

Additionally, if this is part of the type, then you can't have packages that throw away units because they're not designed to handle them (or, at least, this would be a lot less likely). For example, I use Pandas a lot in my work, but the best I can do for units in a Pandas dataframe is to store the values as objects; I certainly can't assign units to a column as a whole:

from pint import Quantity as Q
import pandas as pd

df = pd.DataFrame({'lengths': [Q(0.0, 'm'), Q(100.0, 'm')]})
df.lengths.units  # raises: AttributeError: 'Series' object has no attribute 'units'
df.lengths.iloc[0].units  # this returns <Unit('meter')>

# And this is allowed, though it doesn't really make sense!
pd.DataFrame({'mass': [Q(0.0, 'kg'), Q(5.0, 'seconds')]})

(Also I haven't benchmarked this, but I have to wonder if it loses performance since the datatype of the "lengths" column is now object instead of float64.)

This is the sort of making-it-work-sort-of issue that having units/dimensionality built into a language from the start should limit.

I know at least one language (F#) includes unit support like this, but being a .NET language seems to have limited its adoption in the sciences.

Another potential advantage is that we might be able to worry less about units and more about dimensionality. For example if we want to calculate frequency from wavelength:

def freq_from_wavelen(wavelen: float[Length]) -> float[1/Time]:
  c = 2.998e8 m/s
  return c / wavelen

it doesn't matter what units wavelen is in, it can always be converted so that it and c can be divided appropriately.

I suspect that this might allow some optimization in compiled languages. If we only really need dimensionality, then we might be able to just convert the value to the base units and only actually store the bits of the numeric value in memory. If so, any conversion needed to or from base units could be inserted by the compiler as a simple mathematical operation.

Dimensionality is part of types and checked ahead of time

Of my four points, this is the one most focused on a compiled language, but it can also apply to a gradually typed language.

This, to me, just makes sense. How often do we write code where we don't know the units (or at least the dimensionality) of an input ahead of time? If we're doing actual calculations, pretty much never. Once a numeric value enters our program, to know what it's useful for we have to know what quantity it represents. So, rather than relying on comments and other documentation, why not make that part of the type?

As with any ahead-of-time type, there will be an interface point where data first enters our program where we don't know its type. For instance, if we read data from a .csv or netCDF file, the pure numeric values initially won't have a unit. There would have to be a mechanism to cast or reinterpret raw values into ones with dimensionality. Also, there would need to be a way to interpret strings as units, so that netCDF attributes, .csv headers, etc. can be used to determine the units of a value.

I don't see this as a significant issue, since we already have to deal with this in strongly typed languages. After all, when you read a .csv file in say, Rust or C++, there's no guarantee that the strings in a particular column of a .csv file can be interpreted as numbers. So there has to be a step where numeric values without dimension get converted to known dimensionality values. If we were working in a unit-aware version of Rust, that might look something like:

fn get_current_pressure() -> Result<f64[Pressure], DimError> {
  let p: f64[Dimless] = get_sensor_value();
  let unit: String = get_sensor_unit();
  let p_with_dim = DynamicUnits::new(p, &unit)?; // this might fail if we don't know how to interpret `unit`
  p_with_dim.try_into()
}

The idea is that, when we first read in some raw value, we could store it as a unitless/dimensionless value. Then we parse the unit string into a "dynamic" (i.e. runtime) unit, whose dimensions we don't know until the program runs. But, we know what dimensionality we expect this to be, so at the end we try to convert to that known dimensionality. This could fail, of course, so we allow for that. Once we're out of this function, the dynamic nature of the inputs have been resolves, and we have our nice type safety back.

There might be other cases where a more generic approach is required. An easy example is if we were just copying values from one file to another. In that case, we never do calculations on the values, so we really don't care what dimensionality those values have. We might want to do some work, like converting to base units before writing them out, so some mechanism to define methods available no matter what dimensionality a value has would be necessary. How this is handled would probably need testing to determine what works best, either a distinct type like DynamicUnits above or some kind of generics system.

All non-integer numeric types have units

This also relates to making sure that all packages deal with units. Basically, the idea is that there is no such thing as a plain float or decimal type. Any numeric type, other than integers, will have dimensions, even if it is something like a ratio that is considered "dimensionless". This would ensure that every package built for this language handles units. In the Pandas + Pint example in the first section, this would mean that Pandas couldn't have a simple float64 column with no unit, it would have to be float64<D>, where the D is any dimension.

I specifically exclude integers however, because (a) integers are usually associated with counting, which is essentially unitless and (b) they're needed for purely programmatic purposes like indices, which shouldn't have units. Since physical quantities are usually continuous, enforcing units on floats/decimals/other non-integer types but not integers should be a good compromise between programming needs and representing the physical world.

Units include a "secondary" component

This comes from one of my pet peeves working in carbon cycle science. When fluxes of carbon into or out of the atmosphere are defined, a lot of time they're in units of "mass of carbon" or "mass of CO₂". The problem is, to compare these requires a conversion: 12 g carbon = 44 g CO2. But, if you try to represent this in say Pint or Unitful, both would have units of mass, so there's no way to handle that conversion automatically.

Now, you might say "Why not just define 'grams of carbon' and 'grams of CO2' as new units and define that conversion?" That's fine, except that now you also need to define every other unit of mass. Metric prefixes might be handled, sure, but pounds, tons, metric tons, stones and the rest aren't. If, instead, the dimensionality system allowed something like:

def net(emissions: float[Mass/CO2], uptake: float[Mass/C]):
  emissions = emissions * 12 g C / 44 g CO2
  ...

we should get all these conversions for free. That is, the system should recognize that C and CO2 are specific "subtypes" of mass; they aren't directly comparable, but can each use any mass units. If we want to compare them, we have to do the proper unit conversions, as shown in the example. Whether that should be done by multiplication (as in the example) or a predefined conversion method is, to me, an open question (though I like the straightforwardness of the multiplication approach, and these conversions could always be defined as constants).

Further, this isn't just useful for chemists dealing with different species. In atmospheric physics, we have different forms of temperature. There's the usual absolute temperature, but also potential temperature (the temperature of a bit of air if it we brought to surface pressure), virtual temperature and - just to make our heads hurt - combinations of virtual and potential temperature. All of these have units of Kelvin, but shouldn't be compared or added together. Implementing secondary components for units/dimensionality would also let quantities like this - with the same unit but different meanings - be differentiated by the type system, which is crucial for accurate code.