TLDR

When you save DataFrame with nested column (list of dicts, JSON) to a file then the column is saved as a string and you need to use .apply(eval) on the column when you load it from the file. If you don't then it will stay a string, will not be an object and you will not be able to .explode the list and .apply(pd.Series) to unnest it.

Loading JSON from a file

Recently, I was downloading JSON data using an API. For this, I was using the requests library. There was a lot of data I needed and using an API I could download only a few hundred rows at once. So I iteratively called an API, loading the response to DataFrame, concatenating it. I ended up with a DataFrame that contained one string column and one nested column (list of dicts).

Usually, when you want to unnest such DataFrame, you successively use .explode("name_of_nested_column") and .apply(pd.Series). Exploding takes elements from the list and creates a row for each of them. For example, if you had

index	nested_column
1	[1, 2, 3]

and you would explode the nested column, you would end up with:

index	nested_column
1	1
1	2
1	3

Applying Series on a dictionary column flattens the dictionary, converting keys to column names and filling these new columns with dictionary values. For example, applying Series to dictionary_column in:

index	dictionary_column
1	{"A":3, "B":5}
2	{"A":11, "B":15, "C":19}

returns following table:

A	B	C
3	5	NaN
11	15	19

This is fine when you load the data e.g. using the earlier mentioned requests library. However, it won't work if you are loading the data from .csv file (which is likely as you should save the data after download). It won't work because the data are saved as a string in the table. To convert it to a dictionary, you need to .apply(eval), which says to the interpreter that anything inside the string is actually Python code which should be run and as there is just list of dictionaries inside the string, it is transferred from string to list of dictionaries.

Together, you can chain these methods for example like this:

df = pd.read_csv("../data/dataset.csv")    # Load the dataset
df["nested_column"] = df["nested_column"].apply(eval)   # Convert string to list of dicts
df = df.explode("nested_column")   # Expand the list in individual rows
df = pd.concat([df.drop(["nested_column"], axis=1),   # All cols except the nested one
                df["nested_column"].apply(pd.Series)], axis=1)   # All newly created cols