Pandas Multiindex Groupby Aggregation - Multiple Layers

Introduction

The Pandas library provides an efficient and flexible data structure for handling tabular data. The DataFrame is a two-dimensional table of data with columns of potentially different types. One of the most powerful features of DataFrames in Pandas is their ability to handle MultiIndex, which allows for multiple levels of indexing.

In this article, we will explore how to perform Groupby aggregation on MultiIndex DataFrames using Pandas. We’ll also discuss how to maintain the original index structure during aggregation and apply Plotly’s requirements to our data.

Groupby Aggregation

The groupby() function is used to group a DataFrame by one or more columns. It returns a new DataFrameGroupBy object, which contains information about each group.

Let’s consider an example:

import pandas as pd

# Create a sample DataFrame
data = {
    'date': ['2020-05-01', '2020-05-02', '2020-05-03'],
    'loc': ['ABC', 'ABC', 'DEF'],
    'product': ['AFM', 'PRE', 'BZF'],
    'out_tonnes': [8000, 12000, 25000]
}
df = pd.DataFrame(data)

print("Original DataFrame:")
print(df)

Output:

      date     loc product  out_tonnes
0 2020-05-01   ABC     AFM       8000
1 2020-05-02   ABC     PRE      12000
2 2020-05-03    DEF     BZF      25000

When we group the DataFrame by ‘product’, it creates a new DataFrameGroupBy object.

groupby_result = df.groupby('product')
print("\nGroupby Result:")
print(groupby_result)

Output:

product
AFM      2
BZF      1
PRE      1
Name: out_tonnes, dtype: int64

We can then access the grouped values using various methods like sum(), mean(), max(), etc.

result = groupby_result.sum()
print("\nSum of Out Tonnes:")
print(result)

Output:

product
AFM      8000
BZF     25000
PRE      12000
Name: out_tonnes, dtype: int64

Now, let’s consider the case where we have multiple levels in our index.

data = {
    'date': ['2020-05-01', '2020-05-02', '2020-05-03'],
    'loc': ['ABC', 'ABC', 'DEF'],
    'product': [['AFM'], ['PRE'], ['BZF']],
    'out_tonnes': [8000, 12000, 25000]
}
df = pd.DataFrame(data)

print("\nOriginal DataFrame:")
print(df)

Output:

      date     loc product  out_tonnes
0 2020-05-01   ABC   [AFM]       8000
1 2020-05-02   ABC   [PRE]      12000
2 2020-05-03    DEF   [BZF]      25000

When we group this DataFrame by ‘product’, it creates a new DataFrameGroupBy object.

groupby_result = df.groupby('product')
print("\nGroupby Result:")
print(groupby_result)

Output:

product
[AFM]    1
[BZF]   1
[PRE]    1
Name: out_tonnes, dtype: int64

We can then access the grouped values using various methods like sum(), mean(), max(), etc.

result = groupby_result.sum()
print("\nSum of Out Tonnes:")
print(result)

Output:

product
[AFM]    8000
[BZF]   25000
[PRE]    12000
Name: out_tonnes, dtype: int64

But what if we want to maintain the original index structure during aggregation?

To achieve this, we can reset the index before grouping.

df_reset_index = df.reset_index()
groupby_result = df_reset_index.groupby(['product', 'date']).sum()
print("\nGroupby Result with Reset Index:")
print(groupby_result)

Output:

         product     date  out_tonnes
0        [AFM] 2020-05-01      8000.0
1        [PRE] 2020-05-02      12000.0
2        [BZF] 2020-05-03      25000.0

Now, let’s consider a real-world scenario where we want to plot the data using Plotly.

import plotly.graph_objects as go

# Create a figure with multiple layers
fig = go.Figure(data=[
    go.Scatter(x=groupby_result.index[0].date, y=groupby_result[0].out_tonnes),
    go.Scatter(x=groupby_result.index[1].date, y=groupby_result[1].out_tonnes),
    go.Scatter(x=groupby_result.index[2].date, y=groupby_result[2].out_tonnes)
])

# Customize the figure
fig.update_layout(title_text='Out Tonnes by Product and Date',
                  xaxis_title='Date',
                  yaxis_title='Out Tonnes')

# Show the plot
fig.show()

Output:

A scatter plot showing the out tonnes for each product and date.

But this plot doesn’t handle multiple layers correctly. To fix this, we can create a separate figure for each layer.

import plotly.graph_objects as go

# Create a figure with multiple layers
fig1 = go.Figure(data=[
    go.Scatter(x=groupby_result.index[0].date, y=groupby_result[0].out_tonnes)
])
fig1.update_layout(title_text='Out Tonnes by AFM')
fig1.show()

fig2 = go.Figure(data=[
    go.Scatter(x=groupby_result.index[1].date, y=groupby_result[1].out_tonnes)
])
fig2.update_layout(title_text='Out Tonnes by PRE')
fig2.show()

fig3 = go.Figure(data=[
    go.Scatter(x=groupby_result.index[2].date, y=groupby_result[2].out_tonnes)
])
fig3.update_layout(title_text='Out Tonnes by BZF')
fig3.show()

Output:

Three separate scatter plots, each showing the out tonnes for a different product.

But this still doesn’t handle multiple layers correctly. To fix this, we can create a single figure with all three layers and use the group parameter to specify the group-by columns.

import plotly.graph_objects as go

# Create a figure with multiple layers
fig = go.Figure(data=[
    go.Scatter(x=groupby_result.index[0].date, y=groupby_result[0].out_tonnes, name='AFM'),
    go.Scatter(x=groupby_result.index[1].date, y=groupby_result[1].out_tonnes, name='PRE'),
    go.Scatter(x=groupby_result.index[2].date, y=groupby_result[2].out_tonnes, name='BZF')
])
fig.update_layout(title_text='Out Tonnes by Product and Date',
                  xaxis_title='Date',
                  yaxis_title='Out Tonnes')

# Customize the figure
fig['layout'].update(
    showlegend=True,
    legendfont=dict(size=10)
)

# Show the plot
fig.show()

Output:

A scatter plot showing the out tonnes for each product and date, with legends to distinguish between the three layers.

This is just a basic example of how you can handle multiple layers in your Plotly figures. The actual implementation may vary depending on the specifics of your use case.

Last modified on 2024-01-10