Introduction and starter code
Let's try to use count if aggregation in Pandas on a very simple table like this:
letter | number |
---|---|
A | 1 |
A | 2 |
A | 2 |
B | 1 |
B | 1 |
We may for example compute how many times number 2 occurs in all groups. The first thing we find out is that there is no count if function, so probably the easiest thing to accomplish our task would be to just use filter instead of an aggregation like:
mask = df["number"] == 2df.loc[mask].groupby(by=["letter"])["number"].count()
This is not ideal nor good at all! We do get number of occurences of 2 in A, but we get no result for B. Also, using count with filter disables us from e.g. counting sizes of groups or computing average at the same time. Let's improve this starter code.
Improvements
Named aggregations
When using SQL clause GROUP BY
, one is able to specify multiple aggregations
at the same time:
SELECTletter, COUNTIF(number = 1) AS cnt_ones, COUNTIF(number = 2) AS cnt_twos, COUNT(*) AS cnt_all, AVERAGE(number) AS avg_numberFROMdataset.tableGROUP BYletter
Exactly the same is possible to do in Python.
One does not have to group by multiple times using different filters and aggregations,
it is possible to group by only once and specify multiple aggregations exactly like in SQL.
This is done using function .agg
called on a DataFrame and as it is possible to specify
the column which should be aggregated, the aggregation function and the name which should appear
in the table, this is called named aggregations.
df.groupby(by=["letter"])\.agg(cnt_all=("number", "count"),avg_number=("number", "mean"))\.reset_index()
As some group byes may be very slow it would be waste of resources to group by multiple times and this way you group by only once. However, we still do not have the count if function.
Custom aggregation function
Apart from count
, mean
, sum
etc., it is possible to pass completely custom function
to named aggregation. So when there is no count if, let's create one.
Aggregation function needs to have only one argument and that is the array of values of a group.
With our example dataframe, there would be two arrays [1, 2, 2]
for A and [1, 1]
for B.
"Count if number equals two" could look like:
def count_if_equals_two(arr):return list(arr).count(2)
Count values using Numpy
The only way to count number of occurrences of an element in the list is to go through the list
and count all the values, there is no shortcut.
However, instead of changing the array to list, we can instead keep it array and use np.unique()
.
This function by itself returns unique elements in an array, however it is possible to pass return_counts=True
and it will return also counts for every unique value.
def count_if_equals_two(arr):unique, counts = np.unique(arr, return_counts=True)d = dict(zip(unique, counts))return d[2]
Use defaultdict
The current version of count_if_equals_two
has a bug.
If there was no number 2 in number
column, running the code would produce KeyError
as there is no key 2 in the dictionary.
This can be fixed by checking that the number is in dictionary:
def count_if_equals_two(arr):unique, counts = np.unique(arr, return_counts=True)d = dict(zip(unique, counts))if 2 in d.keys():return d[2]else:return 0
However, this way you need to open the dictionary twice, the first time you look if the key is in the dictionary, the second time you get value for that key.
Instead, we can use defaultdict
from collections
library.
Defauldict is basically the same as ordinary dictionary,
but you set default value for keys, so if anyone asks for a value
of a key that is not present, the default value is returned.
So the better version, using defaultdict
, where the default value is zero, looks like:
def count_if_equals_two(arr):unique, counts = np.unique(arr, return_counts=True)d = defaultdict(lambda: 0, zip(unique, counts))return d[2]
Use partial
The function is not really a general count if now, so let's fix it and write a more general version:
def count_if_equals(arr, number=0):unique, counts = np.unique(arr, return_counts=True)d = defaultdict(lambda: 0, zip(unique, counts))return d[number]
The problem is that aggregation functions needs to take only one argument
and it is not possible to pass more arguments in the .agg
function called on a DataFrame.
Luckily, it is possible to fix an argument of a function in place, create a new function in the process.
partial
is used exactly for that.
For example, let's have a simple addition:
def add(x, y):return x+y
For our imaginary usecase we would need to pass only one argument,
but always add 2, so using partial
:
add_two = partial(add, y=2)
This way, we created a new function called add_two
which takes only a single argument.
Similarly, we can set the value of argument of aggregation function
and still pass to .agg
a function with a single argument.
Final code
Let's look at the final code.
We need to import defaultdict
and partial
,
but we have general count if aggregation function and I think that the solution is pretty elegant.
from collections import defaultdictfrom functools import partialdef count_if_equals(arr, number=0):unique, counts = np.unique(arr, return_counts=True)d = defaultdict(lambda: 0, zip(unique, counts))return d[number]grouped = df.groupby(by=["letter"])\.agg(cnt_all=("number", "count"),cnt_ones=("number", partial(count_if_equals, number=1)),cnt_twos=("number", partial(count_if_equals, number=2)),avg_number=("number", "mean"))\.reset_index()