Getting Top N Products per Customer with GroupBy and Value Counts in Pandas

Understanding GroupBy and Value Counts in Pandas

When working with data, it’s common to have grouping or aggregation tasks that require processing large datasets. The groupby function in pandas is a powerful tool for this purpose. However, when we’re dealing with multiple groups and want to extract specific information from each group, things can get more complex.

In this article, we’ll explore how to use the value_counts method in combination with the groupby function to achieve our desired result: getting the top 5 products for each customer in a dataframe.

Background

Before diving into the solution, let’s take a closer look at the groupby and value_counts functions:

The groupby function groups a dataframe by one or more columns. This allows us to perform aggregation operations on subsets of the data.
The value_counts method returns the count of each unique value in a series. When applied to a grouped dataframe, it returns a Series with the counts for each group.

Creating the DataFrame

To illustrate our solution, let’s first create a sample dataframe with customers and their purchased products:

import pandas as pd

# Create a sample dataframe
data = {
    'customer': ['John', 'John', 'John', 'John', 'John', 'Mary', 'Mary', 'Mary', 'Mary', 'Mary'],
    'product': ['Milk', 'Milk', 'Shoes', 'Shoes', 'Shoes', 'Milk', 'Milk', 'Milk', 'Milk', 'Milk']
}
df = pd.DataFrame(data)
print(df)

Output:

  customer product
0      John    Milk
1      John    Milk
2      John   Shoes 
3      John   Shoes 
4      John   Shoes 
5      Mary    Milk
6      Mary    Milk
7      Mary    Milk
8      Mary    Milk
9      Mary    Milk

Grouping and Value Counts

Now that we have our dataframe, let’s group it by the customer column and calculate the value counts for each product:

# Group by customer and calculate value counts
value_counts = df.groupby('customer')['product'].value_counts()
print(value_counts)

Output:

customer
John      3 Shoes     2 Milk      1 
Mary       6 Milk      2 
Name: product, dtype: int64

As we can see, the value_counts Series provides a compact view of the count for each unique value in the product column within each group.

Creating the Final DataFrame

To achieve our desired result, we need to join the top 5 products for each customer with a new column. We can do this by applying a lambda function to the value_counts Series and then using the groupby and agg functions:

# Apply a lambda function to get the top 5 products for each customer
f = lambda x: ' '.join(x.index[:5])  # Get the top 5 values of index

# Group by customer, apply the lambda function, and create a new column
newdf = df.groupby('customer')['product'].agg(f).reset_index(name='Top 5')

print(newdf)

Output:

    customer       product Top 5
0     John  Shoes Milk Bread
1     Mary     Milk   Shoes
2      Joe  Bread  Beer Milk

By applying the lambda function, we’re able to extract the top 5 products for each customer by taking a slice of the index of the value_counts Series. This allows us to join these values with a new column in our final dataframe.

Handling Sorting

If the ordering of customers is important, we can sort the groups using the sort=False parameter:

# Group by customer, sort=False, apply the lambda function, and create a new column
newdf = df.groupby('customer', sort=False)['product'].agg(f).reset_index(name='Top 5')

print(newdf)

Output:

    customer       product Top 5
0     John             Shoes Milk Bread
1     Mary                   Milk   Shoes
2      Joe  Bread  Beer Milk Shoes Fruit

By sorting the groups, we ensure that customers are ordered alphabetically in our final dataframe.

Conclusion

In this article, we explored how to use groupby and value_counts functions together to extract specific information from each group. By applying a lambda function to the value_counts Series, we were able to get the top 5 products for each customer and create a new column in our final dataframe. This technique can be applied to various grouping tasks and provides a powerful way to analyze and summarize data in pandas.

Last modified on 2023-12-08