Joshua L. Laughner - Python: top 5 pros and cons

Python: top 5 pros and cons

I tend to use a number of different programming languages for my work and, unsurprisingly, have developed quite a few opinions on what the strengths and weaknesses of these languages are. Let's start with my primary language, Python.

Top 5 good things about Python:

#5: Python classes only have private members by convention
#4: Python's error handling encourages catching only what needs caught
#3: Python's iterators, context managers, and magic methods in general
#2: Python's syntax is minimal and clean
#1: Python has a mature family of open source data science packages

Top 5 bad things about Python:

#5: Terse code is often dense code
#4: There's at least two package managers
#3: There's no convention to indicate arguments are mutated
#2: There's a lot of types
#1: It takes effort to make Python fast

And to wrap up:

Honorable mentions
Conclusions

Top 5 pros

5. Python classes only have private members by convention

In some languages which implement object-oriented programming (such as C++ or Java), classes can formally define certain properties or methods as private. That means that those properties/methods can't be accessed outside the class that defines them. Now, in those languages that makes sense - being able to properly protect your internal methods and keep them firmly separate from the public interface makes for safe code. It does, however, mean that if you need to access those properties for some reason, you're out of luck.

In Python, objects can have "private" methods: by convention, any method or property starting with an underscore is private. Unlike C++, Java, or their kin, Python has no mechanism to enforce that privacy. You can access them if you need - though you are accepting the risk that doing so might cause things to break.

4. Python's error handling encourages catching only what needs caught

Python's error handling is really powerful. In any language that has the throw/catch exception approach (like Python), you usually only want to catch errors that you expect. If you just catch any error, you risk hiding bugs in your code because the error gets caught, rather than properly stopping the program.

There are certain some cases where it makes sense to catch all errors, such as when you have a server that absolutely cannot crash. But in data science and analysis, those sorts of cases are rare.

This is easy to do in Python, as all you have to do in 99% of cases is filter on the error type right in the try-except statement:

try
    function_that_fails_alot()
except TypeError
    # fallback
except ValueError
    # different fallback

Because the except blocks allow you to specify which kind of error triggers them, it cuts out a good bit of boilerplate. If I wrote this in Matlab, it might look something like:

try
    function_that_fails_alot()
catch err
    if strcmp(err.identifier, 'load_error:wrong_type')
        % fallback
    elseif strcmp(err.identifier, 'load_error:bad_value')
        % different fallback
    else
        rethrow(err)
    end
end

While it's not that much more code, it is clunkier. We have to repeatedly say which error we're checking against, and we have to remember to rethrow the error at the end - or it's like it never happened!

In theory, Python's way does have a weakness - error types can be reused quite a bit, so filtering on a ValueError might catch both the intended error and 1 or 2 others. However, that's still a lot better than not filtering at all, and in my opinion keeping the error handling concise enough to write quickly incentivizes coders to use it more.

Additionally, it's super easy to create new, custom error types. Simply doing:

class MyCustomError(Exception):
    pass

is all you need to have a new type that takes an error message as its positional argument.

As a bonus, Python's try blocks can have two optional parts: else (which holds code run only if there are no errors, but which should not be protected by the try) and finally (which holds code that has to run no matter what). I only use else rarely, as I don't often have cases where code should only run if there were no errors (usually if I recover from an error I want all the following code to run). But, when I need it, it's really useful. I use finally much more - there's lots of cases to e.g. ensure a file gets closed properly whether or not there's an error.

3. Python's iterators, context managers, and magic methods in general

Iteration in Python recognizes that you only sometimes actually need the index to your array/vector/whatever; often you just need each element in the sequence. Python lets you do that: for element in sequence will cause element to take on each value in sequence in turn. If you do need the index, the enumerate function provides both the index and element. This may seem like a minor thing, after all it's not that much more code to have:

for idx = 1:length(sequence)
    element = sequence[idx]
    ...
end

but, like the streamlined try-except syntax, when you need to code up an analysis quickly, any streamlining helps. Plus it usually makes code more readable - when every line does something meaningful, you can focus on the purpose of the code and not get caught up in the fiddly bits.

Python's context manages are another great shortcut as they let you worry less about making sure any clean up tasks happen. The most common example is opening a file:

with open(new_file, 'w') as f:
    ...

Any time you are writing to a file, you always want to make sure to close it properly so that all the data gets written and the file isn't left in a corrupted state. Even when just reading a file, you want to close it when done so that the OS knows you're done reading it and it doesn't need to worry if another process wants to write to it. You can do:

f = open(new_file, 'w')
...
f.close()

but if there's an error it won't get closed. To get around that, you can do:

try:
    f = open(new_file, 'w')
    ...
finally:
    f.close()

which is better (the finally block always gets called, whether or not there's an error), but you still have to know to call the close() method. Using with open(...) is behaviorally identical to try-finally, but it "knows" what method to call to clean up f. From a design standpoint, this is great. It lets package authors build in exactly what should happen to clean up resources into their objects, and you just have to use the with syntax to automatically get that correct behavior.

More generally (and why I lumped these two together), this kind of functionality is really easy to implement in Python (once you're familiar with OOP) through "magic methods:" to implement iteration, all you need to define are the __iter__ and __next__ methods on a class; to implement context management you just need the __enter__ and __exit__ methods. You can do some cool stuff with these, for example, I wrote a class that used context managers to automatically print how long a block of code took to run, which was a nice little progress tracker.

There is an even easier way to implement an iterator - the yield statement - but that's a separate topic.

2. Python's syntax is minimal and clean

If any of these sections is going to get me griped at, it's this one. Many people dislike Python's syntax, usually because it uses whitespace to define code blocks. But in my opinion, the whitespace is a good thing. You should be intenting your code anyway so why have extra characters cluttering up the source code? In general, Python avoids using characters it doesn't need, so if and for blocks don't need their conditions or iterators enclosed in parentheses (like C does), functions and classes just start with def or class, respectively, and modules are just defined by the file.

The other part of Python syntax that I like is it supports brevity in different ways. For example, list or dictionary comprehensions let you do things like:

numbers = [1, 2, 3, 4, 5]
facs = [factorial(n) for n in numbers]
facs_dict = {n: factorial(n) for n in numbers}

The second line is a list comprehension and the third line a dictionary comprehension. Basically these let you create a new list or dictionary in one line by iterating over another sequence and calling any code you need - they're incredibly flexible. While you could always do this:

numbers = [1, 2, 3, 4, 5]
facs_dict = dict()
for n in numbers:
    facs_dict[n] = factorial(n)

the dictionary comprehension condenses three lines into one, and (in my opinion) communicates clearly that we are creating a new dictionary. In the latter example, it takes a little extra time to parse that those last three lines are all about creating a new dictionary.

That's not to say list or dictionary comprehensions are always the right approach - sometimes breaking things down in a for loop is easier to read. On the other hand, as your code gets more complicated, figuring out which three lines out of even 50 deal with creating a new dictionary can be harder than seeing one line that does the same thing. It's all about the context!

1. Python has a mature family of open source data science packages

This one is pretty self explanatory: Python has been around, and been used in data science long enough, that the packages needed for data science have had time to mature. Numpy, SciPy, and Matplotlib are all staple packages that have had most of their rough edges sanded away after years of use. Newer packages like Pandas and xarray add new capabilities to connect data with coordinates (though xarray is still technically in its 0.x release cycle). And, Python has become a leading language for machine learning with packages like TensorFlow, Keras, and PyTorch.

Top 5 cons

5. Terse code is often dense code

Okay, this one is the flip side of the #2 pro. While I love how compact you can write Python code, you can end up with things like this:

x, y = [int(s) for s in line.split(',')]

I wrote that, and I'll freely admit it took me more than a few minutes to figure out what the heck I was doing there. In one line, its:

Spliting the string line on commas
Converting each substring to an integer
Assigning the first integer to x and the second to y (and there must be exactly two integers)

To an experienced Python coder, this is straightforward enough to parse, but for a newcomer, it's anything but straightforward. And that's certainty not the most complicated one-liner I've ever written!

4. There's at least two package managers

If you use Python, you probably know that the main package manager is pip. But if you're in data science, you've probably at least heard of the conda package manager. Now, while I use conda a lot - its ability to install non-Python packages into its environments is really useful - the fact that there are two separate package managers for installing Python packages is... confusing. It also means that Python developers have to put more effort into making their code public, which isn't great.

It's easy to excuse this shortfall. After all, Python was first released in 1991, early enough that the transition to the modern internet was still going on. But, when stacked up against modern languages like Go, Rust, or Julia that have a package manager built in from the beginning, having this split in package availability is a definite downside.

3. There's no convention to indicate arguments are mutated

If you use Python, you've probably been bitten by this at some point:

def double_it(a_list):
    for i, v in enumerate(a_list):
        a_list[i] = 2*v

    return a_list

list1 = [1, 2, 3]
list2 = double_it(list1)

assert list1[0] == 1  # not true

Without looking at the source code (or reading the documentation), you wouldn't know that double_it changes the input list. Since it returns a new list, you can easily be lulled into thinking that it won't touch the original. Or, if you think that's not fair (after all, it's returning a list, which implies it won't touch the original), what about list.sort() vs. numpy_array.reshape(...)? The former acts on list in place, the latter returns a new array. Even standard libraries disagree!

Given that Python is a dynamically typed language, we'll never see Rust-like explicit mutability declarations, but some indicator that arguments are modified would be nice. Julia has the convention that functions ending in ! mutate some of the arguments (usually the first). It's not enforced by the compiler, but it's usually applied consistently enough to tell you whether your input is going to be changed in place or not.

2. There's a lot of types

This may seem like an odd point to bring up for a dynamically typed language, but there's a lot of ways to represent data. You could use lists, tuples, dictionaries, and sets just within Python's core language. Then Numpy adds multi-dimensional arrays and masked arrays, Pandas adds series and dataframes, Xarray adds DataArrays and Datasets.... Each of these does have their place, but it's quite easy to create a bit of a mess where different parts of your code use different types, and you have to convert between them.

In an ideal world, you could ignore the types and use duck-typing principles: just try what you need to do and see if it works. In practice of course, different types have different methods (e.g. Pandas dataframes have `loc` and `iloc` to index while `xarray` data arrays have `sel` and `isel`), so you really do have to know the type.

This can be mitigated with coding discipline and foresight: pick a useful way to store the data and stick to that. Usually though, in an evolving project, you'll find yourself using other types where they are a more natural fit to just get things done.

1. It takes effort to make Python fast

Now, notice that I did not say Python is slow. Python can be fast, if code is written to take advantage of fast libraries. Multiply two 100-million element vectors for a for loop in Python, and it'll take 20-30 seconds. Multiple them as Numpy arrays, and it'll take 0.2 seconds (which compares favorably with C and Fortran). And before you say "But Numpy uses C/Fortran libraries," so what? All you see is the Python interface.

This does mean that what you can make fast is limited to what you can write as fast operations. Now, often, those are not only the fastest way to do something, but the easiest to write and read. Multiplying two arrays as just a * b is a lot easier than a for loop. Splitting a dataframe with groupby is way easier than doing it manually. The problem is when you need to do something more complicated, you either have to spend extra brain power to figure out how to write it in the fast way - or recognize that you can't. And when you can't - yes, Python is slow.

If you want to see how Python stacks up against other languages, I wrote some benchmark tests for that 100-million element vector multiplication exercise [over on GitHub](https://github.com/joshua-laughner/ArrayMultTiming).

Honorable mentions

Pros

PEP8 is a style guide that gives useful rules for keeping Python formatting consistent. It explicitly says that following it comes after being consistent within a function, module, or project, but having an officially recognized set of ideal format rules helps keep us all on the same page, which makes it easier to read each others' code.
Modules can be scripts by using if __name__ == '__main__' to protect code that should only run when the file is invoked as a script. I use this to give myself code accessible as a command line program (which I can run in screen overnight if it's long) and as a module in a notebook. Huzzah for less code duplication!
Flexible importing means you can import installed packaged, modules or packages in your current directory, or modules/packages in any of your sys.path directories. While eventually moving toward proper package organization is best, having flexibility is great when trying to develop quickly (or when just learning).

Cons

Parallelization isn't that easy in Python. The multiprocessing library is easy enough if you just need to spawn multiple tasks that never exchange data, but to have them distribute and reassemble data takes a moderate bit of effort to get right. This may have gotten easier with Python 3.8's shared memory functionality which I have yet to test, but is unlikely to be as easy as Matlab's parfor loops for example.
Multiple inheritance doesn't come up that often in data work, but since Python classes can inherit from multiple parents, you can end up with some complicated inheritance rules. Once you understand them, they make sense, but until you do there's some confusing edge cases.

Conclusions

I really like Python as a language. For data analysis, it has the right mix of flexibility and concision to make it perfect to handle things from a quick-and-dirty plotting script to a fleshed out, well-structure analysis package. It has its downsides, but most of them can (in my opinion) be mitigated with good discipline