Creating a New Column in a Pandas DataFrame Based on Unique Values of an Existing Column
In this article, we will explore how to create new columns in a pandas DataFrame based on the unique values of an existing column. This is commonly achieved through one-hot encoding, where each value in the original column becomes a separate category in the new column.
Understanding One-Hot Encoding
One-hot encoding is a technique used in machine learning and data analysis to convert categorical variables into numerical variables. The idea behind this technique is to create a new column for each unique value in the original column, with a 1 indicating presence and 0 indicating absence.
Exploring the Data
Let’s start by examining the provided DataFrame, my_new_pd, which contains a column tokens with lists of multiple values:
data = {'tokens': ['["Spain", "Germany", "England", "Japan"]',
'["Spain", "Germany"]',
'["Morocco"]',
'[]',
'["Japan"]',
'[]']}
my_new_pd = pd.DataFrame(data)
As we can see, the tokens column contains lists of values that are already included in previous observations. The empty list [] indicates that a value is not present.
Method 1: Using scikit-learn’s MultiLabelBinarizer
One common approach to achieve one-hot encoding is by using the MultiLabelBinarizer class from scikit-learn:
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
yourdf=pd.DataFrame(mlb.fit_transform(df['tokens']),columns=mlb.classes_, index=df.index)
Here, we create an instance of MultiLabelBinarizer and apply it to the tokens column using the fit_transform() method. The resulting DataFrame contains new columns for each unique value in the original column.
Method 2: Using Pandas’ explode() Function
Another approach is to use the explode() function to create a new row for each value in the original column:
df['tokens'].explode().str.get_dummies().sum(level=0).add_prefix('tokens_')
In this step, we first use explode() to split the list values into separate rows. Then, we apply get_dummies() to create new columns for each unique value in the exploded row. Finally, we sum up the resulting values using sum(level=0) and add a prefix to the column names.
Method 3: Using Pandas’ get_dummies() Function
We can also use the get_dummies() function to achieve one-hot encoding:
pd.get_dummies(pd.DataFrame(df.tokens.tolist()),prefix='tokens',prefix_sep='_').sum(level=0,axis=1)
Here, we first convert the list values to a DataFrame using tolist(). Then, we apply get_dummies() to create new columns for each unique value in the original column. The resulting DataFrame contains new columns with prefixed names.
Resulting DataFrames
Let’s examine the resulting DataFrames from each method:
Method 1 (scikit-learn):
tokens_Spain | tokens_Germany | tokens_England | tokens_Japan | tokens_Morocco
0 1 1 1 1 0
1 1 1 0 0 0
2 0 0 0 0 1
3 0 0 0 0 0
4 0 0 1 1 0
Method 2 (explode() and get_dummies()):
tokens_A | tokens_B | tokens_C | tokens_D | tokens_Z
0 1 | 0 | 0 | 0 | 0
1 0 | 1 | 0 | 0 | 0
2 0 | 0 | 0 | 0 | 1
3 0 | 0 | 0 | 0 | 0
4 0 | 0 | 1 | 0 | 0
Method 3 (get_dummies()):
tokens_Spain | tokens_Germany | tokens_Japan | tokens_Morocco
0 1 1 1 0
1 0 1 0 0
2 0 0 0 1
3 0 0 0 0
4 0 0 1 0
As we can see, each method produces a different resulting DataFrame. However, all three methods achieve the same goal of converting categorical variables into numerical variables.
Conclusion
In this article, we explored how to create new columns in a pandas DataFrame based on unique values of an existing column using one-hot encoding techniques. We examined three approaches: using scikit-learn’s MultiLabelBinarizer, pandas’ explode() and get_dummies() functions. Each method produces a different resulting DataFrame, but all achieve the same goal of converting categorical variables into numerical variables.
Last modified on 2025-04-23