I tend to use a number of different programming languages for my work and, unsurprisingly, have developed quite a few opinions on what the strengths and weaknesses of these languages are. Let's start with my primary language, Python.
Top 5 good things about Python:
Top 5 bad things about Python:
And to wrap up:
In some languages which implement object-oriented programming (such as C++ or Java), classes can formally define certain properties or methods as private. That means that those properties/methods can't be accessed outside the class that defines them. Now, in those languages that makes sense - being able to properly protect your internal methods and keep them firmly separate from the public interface makes for safe code. It does, however, mean that if you need to access those properties for some reason, you're out of luck.
In Python, objects can have "private" methods: by convention, any method or property starting with an underscore is private. Unlike C++, Java, or their kin, Python has no mechanism to enforce that privacy. You can access them if you need - though you are accepting the risk that doing so might cause things to break.
Python's error handling is really powerful. In any language that has the throw/catch exception approach (like Python), you usually only want to catch errors that you expect. If you just catch any error, you risk hiding bugs in your code because the error gets caught, rather than properly stopping the program.
This is easy to do in Python, as all you have to do in 99% of cases is filter on the error
type right in the try-except
statement:
try
function_that_fails_alot()
except TypeError
# fallback
except ValueError
# different fallback
Because the except
blocks allow you to specify which kind of error triggers them, it cuts out
a good bit of boilerplate. If I wrote this in Matlab, it might look something like:
try
function_that_fails_alot()
catch err
if strcmp(err.identifier, 'load_error:wrong_type')
% fallback
elseif strcmp(err.identifier, 'load_error:bad_value')
% different fallback
else
rethrow(err)
end
end
While it's not that much more code, it is clunkier. We have to repeatedly say which error we're checking
against, and we have to remember to rethrow
the error at the end - or it's like it never happened!
In theory, Python's way does have a weakness - error types can be reused quite a bit, so filtering on a
ValueError
might catch both the intended error and 1 or 2 others. However, that's still a lot better than
not filtering at all, and in my opinion keeping the error handling concise enough to write quickly incentivizes
coders to use it more.
Additionally, it's super easy to create new, custom error types. Simply doing:
class MyCustomError(Exception):
pass
is all you need to have a new type that takes an error message as its positional argument.
As a bonus, Python's try
blocks can have two optional parts: else
(which holds code run only if there are
no errors, but which should not be protected by the try
) and finally
(which holds code that has to run
no matter what). I only use else
rarely, as I don't often have cases where code should only run if there
were no errors (usually if I recover from an error I want all the following code to run). But, when I need it,
it's really useful. I use finally
much more - there's lots of cases to e.g. ensure a file gets closed properly
whether or not there's an error.
Iteration in Python recognizes that you only sometimes actually need the index to your array/vector/whatever;
often you just need each element in the sequence. Python lets you do that: for element in sequence
will cause
element
to take on each value in sequence
in turn. If you do need the index, the enumerate
function
provides both the index and element. This may seem like a minor thing, after all it's not that much more
code to have:
for idx = 1:length(sequence)
element = sequence[idx]
...
end
but, like the streamlined try-except
syntax, when you need to code up an analysis quickly, any streamlining
helps. Plus it usually makes code more readable - when every line does something meaningful, you can focus on
the purpose of the code and not get caught up in the fiddly bits.
Python's context manages are another great shortcut as they let you worry less about making sure any clean up tasks happen. The most common example is opening a file:
with open(new_file, 'w') as f:
...
Any time you are writing to a file, you always want to make sure to close it properly so that all the data gets written and the file isn't left in a corrupted state. Even when just reading a file, you want to close it when done so that the OS knows you're done reading it and it doesn't need to worry if another process wants to write to it. You can do:
f = open(new_file, 'w')
...
f.close()
but if there's an error it won't get closed. To get around that, you can do:
try:
f = open(new_file, 'w')
...
finally:
f.close()
which is better (the finally
block always gets called, whether or not there's an error), but you still have to
know to call the close()
method. Using with open(...)
is behaviorally identical to try-finally
, but it "knows"
what method to call to clean up f
. From a design standpoint, this is great. It lets package authors build in exactly what should happen to clean up
resources into their objects, and you just have to use the with
syntax to automatically get that correct behavior.
More generally (and why I lumped these two together), this kind of functionality is really easy to implement in Python
(once you're familiar with OOP) through "magic methods:" to implement iteration, all you need to define are the __iter__
and __next__
methods on a class; to implement context management you just need the __enter__
and __exit__
methods.
You can do some cool stuff with these, for example, I wrote a class that used context managers to automatically print
how long a block of code took to run, which was a nice little progress tracker.
yield
statement - but that's a separate topic.
If any of these sections is going to get me griped at, it's this one. Many people dislike Python's syntax, usually because
it uses whitespace to define code blocks. But in my opinion, the whitespace is a good thing. You should be intenting your
code anyway so why have extra characters cluttering up the source code? In general, Python avoids using characters it doesn't
need, so if
and for
blocks don't need their conditions or iterators enclosed in parentheses (like C does), functions and
classes just start with def
or class
, respectively, and modules are just defined by the file.
The other part of Python syntax that I like is it supports brevity in different ways. For example, list or dictionary comprehensions let you do things like:
numbers = [1, 2, 3, 4, 5]
facs = [factorial(n) for n in numbers]
facs_dict = {n: factorial(n) for n in numbers}
The second line is a list comprehension and the third line a dictionary comprehension. Basically these let you create a new list or dictionary in one line by iterating over another sequence and calling any code you need - they're incredibly flexible. While you could always do this:
numbers = [1, 2, 3, 4, 5]
facs_dict = dict()
for n in numbers:
facs_dict[n] = factorial(n)
the dictionary comprehension condenses three lines into one, and (in my opinion) communicates clearly that we are creating a new dictionary. In the latter example, it takes a little extra time to parse that those last three lines are all about creating a new dictionary.
for
loop is easier to read. On the other hand, as your code gets more complicated, figuring out which three
lines out of even 50 deal with creating a new dictionary can be harder than seeing one line that does the same thing.
It's all about the context!
This one is pretty self explanatory: Python has been around, and been used in data science long enough, that the packages needed for data science have had time to mature. Numpy, SciPy, and Matplotlib are all staple packages that have had most of their rough edges sanded away after years of use. Newer packages like Pandas and xarray add new capabilities to connect data with coordinates (though xarray is still technically in its 0.x release cycle). And, Python has become a leading language for machine learning with packages like TensorFlow, Keras, and PyTorch.
Okay, this one is the flip side of the #2 pro. While I love how compact you can write Python code, you can end up with things like this:
x, y = [int(s) for s in line.split(',')]
I wrote that, and I'll freely admit it took me more than a few minutes to figure out what the heck I was doing there. In one line, its:
line
on commasx
and the second to y
(and there must be exactly two integers)To an experienced Python coder, this is straightforward enough to parse, but for a newcomer, it's anything but straightforward. And that's certainty not the most complicated one-liner I've ever written!
If you use Python, you probably know that the main package manager is pip
. But if you're in data science, you've probably at
least heard of the conda
package manager. Now, while I use conda
a lot - its ability to install non-Python packages into
its environments is really useful - the fact that there are two separate package managers for installing Python packages is...
confusing. It also means that Python developers have to put more effort into making their code public, which isn't great.
It's easy to excuse this shortfall. After all, Python was first released in 1991, early enough that the transition to the modern internet was still going on. But, when stacked up against modern languages like Go, Rust, or Julia that have a package manager built in from the beginning, having this split in package availability is a definite downside.
If you use Python, you've probably been bitten by this at some point:
def double_it(a_list):
for i, v in enumerate(a_list):
a_list[i] = 2*v
return a_list
list1 = [1, 2, 3]
list2 = double_it(list1)
assert list1[0] == 1 # not true
Without looking at the source code (or reading the documentation), you wouldn't know that double_it
changes the input list. Since
it returns a new list, you can easily be lulled into thinking that it won't touch the original. Or, if you think that's not fair
(after all, it's returning a list, which implies it won't touch the original), what about list.sort()
vs. numpy_array.reshape(...)
?
The former acts on list
in place, the latter returns a new array. Even standard libraries disagree!
Given that Python is a dynamically typed language, we'll never see Rust-like explicit mutability declarations, but some indicator that
arguments are modified would be nice. Julia has the convention that functions ending in !
mutate some of the arguments (usually the
first). It's not enforced by the compiler, but it's usually applied consistently enough to tell you whether your input is going to be
changed in place or not.
This may seem like an odd point to bring up for a dynamically typed language, but there's a lot of ways to represent data. You could use lists, tuples, dictionaries, and sets just within Python's core language. Then Numpy adds multi-dimensional arrays and masked arrays, Pandas adds series and dataframes, Xarray adds DataArrays and Datasets.... Each of these does have their place, but it's quite easy to create a bit of a mess where different parts of your code use different types, and you have to convert between them.
This can be mitigated with coding discipline and foresight: pick a useful way to store the data and stick to that. Usually though, in an evolving project, you'll find yourself using other types where they are a more natural fit to just get things done.
Now, notice that I did not say Python is slow. Python can be fast, if code is written to take advantage of fast libraries. Multiply two 100-million element vectors for a for loop in Python, and it'll take 20-30 seconds. Multiple them as Numpy arrays, and it'll take 0.2 seconds (which compares favorably with C and Fortran). And before you say "But Numpy uses C/Fortran libraries," so what? All you see is the Python interface.
This does mean that what you can make fast is limited to what you can write as fast operations. Now, often, those are not only the fastest
way to do something, but the easiest to write and read. Multiplying two arrays as just a * b
is a lot easier than a for
loop. Splitting
a dataframe with groupby
is way easier than doing it manually. The problem is when you need to do something more complicated, you either
have to spend extra brain power to figure out how to write it in the fast way - or recognize that you can't. And when you can't - yes, Python
is slow.
if __name__ == '__main__'
to protect code that should only run when the file is invoked as a script.
I use this to give myself code accessible as a command line program (which I can run in screen overnight
if it's long) and as a module in a notebook. Huzzah for less code duplication!sys.path
directories. While eventually moving toward proper package organization is best, having flexibility is great when trying to develop
quickly (or when just learning).multiprocessing
library is easy enough if you just need to spawn multiple tasks that never
exchange data, but to have them distribute and reassemble data takes a moderate bit of effort to get right. This may have gotten easier with
Python 3.8's shared memory functionality
which I have yet to test, but is unlikely to be as easy as Matlab's parfor
loops for example.I really like Python as a language. For data analysis, it has the right mix of flexibility and concision to make it perfect to handle things from a quick-and-dirty plotting script to a fleshed out, well-structure analysis package. It has its downsides, but most of them can (in my opinion) be mitigated with good discipline