TLDR
When you save DataFrame with nested column (list of dicts, JSON) to a file
then the column is saved as a string and you need to use .apply(eval)
on the column when you load it from the file.
If you don't then it will stay a string, will not be an object and
you will not be able to .explode
the list and .apply(pd.Series)
to unnest it.
Loading JSON from a file
Recently, I was downloading JSON data using an API.
For this, I was using the requests
library.
There was a lot of data I needed and using an API I could download
only a few hundred rows at once.
So I iteratively called an API, loading the response to DataFrame,
concatenating it.
I ended up with a DataFrame that contained one string column and one
nested column (list of dicts).
Usually, when you want to unnest such DataFrame,
you successively use .explode("name_of_nested_column")
and .apply(pd.Series)
.
Exploding takes elements from the list and creates a row for each of them.
For example, if you had
index | nested_column |
---|---|
1 | [1, 2, 3] |
and you would explode the nested column, you would end up with:
index | nested_column |
---|---|
1 | 1 |
1 | 2 |
1 | 3 |
Applying Series on a dictionary column flattens the dictionary,
converting keys to column names and filling these new columns
with dictionary values. For example, applying Series to dictionary_column
in:
index | dictionary_column |
---|---|
1 | {"A":3, "B":5} |
2 | {"A":11, "B":15, "C":19} |
returns following table:
A | B | C |
---|---|---|
3 | 5 | NaN |
11 | 15 | 19 |
This is fine when you load the data e.g. using the earlier mentioned
requests
library. However, it won't work if you are loading the
data from .csv file (which is likely as you should save the data
after download). It won't work because the data are saved as a string
in the table. To convert it to a dictionary, you need to .apply(eval)
,
which says to the interpreter that anything inside the string is actually
Python code which should be run and as there is just list of dictionaries
inside the string, it is transferred from string to list of dictionaries.
Together, you can chain these methods for example like this:
df = pd.read_csv("../data/dataset.csv") # Load the datasetdf["nested_column"] = df["nested_column"].apply(eval) # Convert string to list of dictsdf = df.explode("nested_column") # Expand the list in individual rowsdf = pd.concat([df.drop(["nested_column"], axis=1), # All cols except the nested onedf["nested_column"].apply(pd.Series)], axis=1) # All newly created cols