Introduction and starter code

Let's try to use count if aggregation in Pandas on a very simple table like this:

letter	number
A	1
A	2
A	2
B	1
B	1

We may for example compute how many times number 2 occurs in all groups. The first thing we find out is that there is no count if function, so probably the easiest thing to accomplish our task would be to just use filter instead of an aggregation like:

mask = df["number"] == 2
df.loc[mask].groupby(by=["letter"])["number"].count()

This is not ideal nor good at all! We do get number of occurences of 2 in A, but we get no result for B. Also, using count with filter disables us from e.g. counting sizes of groups or computing average at the same time. Let's improve this starter code.

Improvements

Named aggregations

When using SQL clause GROUP BY, one is able to specify multiple aggregations at the same time:

SELECT
  letter
  , COUNTIF(number = 1) AS cnt_ones
  , COUNTIF(number = 2) AS cnt_twos
  , COUNT(*) AS cnt_all
  , AVERAGE(number) AS avg_number
FROM
  dataset.table
GROUP BY
  letter

Exactly the same is possible to do in Python. One does not have to group by multiple times using different filters and aggregations, it is possible to group by only once and specify multiple aggregations exactly like in SQL. This is done using function .agg called on a DataFrame and as it is possible to specify the column which should be aggregated, the aggregation function and the name which should appear in the table, this is called named aggregations.

df.groupby(by=["letter"])\
  .agg(cnt_all=("number", "count"),
       avg_number=("number", "mean"))\
  .reset_index()

As some group byes may be very slow it would be waste of resources to group by multiple times and this way you group by only once. However, we still do not have the count if function.

Custom aggregation function

Apart from count, mean, sum etc., it is possible to pass completely custom function to named aggregation. So when there is no count if, let's create one. Aggregation function needs to have only one argument and that is the array of values of a group. With our example dataframe, there would be two arrays [1, 2, 2] for A and [1, 1] for B.

"Count if number equals two" could look like:

def count_if_equals_two(arr):
    return list(arr).count(2)

Count values using Numpy

The only way to count number of occurrences of an element in the list is to go through the list and count all the values, there is no shortcut. However, instead of changing the array to list, we can instead keep it array and use np.unique(). This function by itself returns unique elements in an array, however it is possible to pass return_counts=True and it will return also counts for every unique value.

def count_if_equals_two(arr):
    unique, counts = np.unique(arr, return_counts=True)
    d = dict(zip(unique, counts))
    return d[2]

Use defaultdict

The current version of count_if_equals_two has a bug. If there was no number 2 in number column, running the code would produce KeyError as there is no key 2 in the dictionary. This can be fixed by checking that the number is in dictionary:

def count_if_equals_two(arr):
    unique, counts = np.unique(arr, return_counts=True)
    d = dict(zip(unique, counts))
    if 2 in d.keys():
        return d[2]
    else:
        return 0

However, this way you need to open the dictionary twice, the first time you look if the key is in the dictionary, the second time you get value for that key.

Instead, we can use defaultdict from collections library. Defauldict is basically the same as ordinary dictionary, but you set default value for keys, so if anyone asks for a value of a key that is not present, the default value is returned.

So the better version, using defaultdict, where the default value is zero, looks like:

def count_if_equals_two(arr):
    unique, counts = np.unique(arr, return_counts=True)
    d = defaultdict(lambda: 0, zip(unique, counts))
    return d[2]

Use partial

The function is not really a general count if now, so let's fix it and write a more general version:

def count_if_equals(arr, number=0):
    unique, counts = np.unique(arr, return_counts=True)
    d = defaultdict(lambda: 0, zip(unique, counts))
    return d[number]

The problem is that aggregation functions needs to take only one argument and it is not possible to pass more arguments in the .agg function called on a DataFrame.

Luckily, it is possible to fix an argument of a function in place, create a new function in the process. partial is used exactly for that. For example, let's have a simple addition:

def add(x, y):
    return x+y

For our imaginary usecase we would need to pass only one argument, but always add 2, so using partial:

add_two = partial(add, y=2)

This way, we created a new function called add_two which takes only a single argument. Similarly, we can set the value of argument of aggregation function and still pass to .agg a function with a single argument.

Final code

Let's look at the final code. We need to import defaultdict and partial, but we have general count if aggregation function and I think that the solution is pretty elegant.

from collections import defaultdict
from functools import partial

def count_if_equals(arr, number=0):
    unique, counts = np.unique(arr, return_counts=True)
    d = defaultdict(lambda: 0, zip(unique, counts))
    return d[number]

grouped = df.groupby(by=["letter"])\
            .agg(cnt_all=("number", "count"),
                 cnt_ones=("number", partial(count_if_equals, number=1)),
                 cnt_twos=("number", partial(count_if_equals, number=2)),
                 avg_number=("number", "mean"))\
            .reset_index()