别再死记硬背groupby了！用这个真实数据集（drinks.csv）带你玩转Pandas分组聚合-深圳市維司達科技有限公司

用酒类消费数据解锁Pandas分组聚合的底层逻辑

当你第一次看到df.groupby('continent').agg({'wine_servings': 'max'})这样的代码时，是否感到困惑？分组聚合看似简单，但很多学习者停留在机械复制代码的阶段。今天我们将用全球酒类消费数据集，从业务问题出发，逆向拆解groupby背后的思考过程。

1. 从业务问题到数据探索

drinks.csv数据集包含全球各国的酒类消费数据，字段包括国家、啤酒/白酒/红酒消耗量以及所属大洲。假设我们现在需要分析：

各大洲红酒消费的极差（最大值与最小值之差）——反映消费习惯差异度
各大洲啤酒消费总量——反映市场规模

数据初探：

import pandas as pd drinks = pd.read_csv('drinks.csv') print(drinks.head(3))

输出示例：

country	beer_servings	wine_servings	spirit_servings	total_litres_of_pure_alcohol	continent
Albania	89	54	132	4.9	Europe
Andorra	245	312	138	12.4	Europe
UAE	5	5	42	0.7	Asia

提示：使用drinks.continent.value_counts()可快速查看各大洲数据分布

2. 手动模拟分组过程

理解groupby最有效的方式是先抛开Pandas，用基础Python实现分组逻辑：

# 手工分组示例 manual_groups = {} for _, row in drinks.iterrows(): continent = row['continent'] if continent not in manual_groups: manual_groups[continent] = { 'wine': [], 'beer': [] } manual_groups[continent]['wine'].append(row['wine_servings']) manual_groups[continent]['beer'].append(row['beer_servings']) # 计算聚合指标 result = {} for continent, data in manual_groups.items(): wine_values = data['wine'] result[continent] = { 'wine_range': max(wine_values) - min(wine_values), 'beer_total': sum(data['beer']) }

这个过程中我们清晰地看到：

分割：按continent字段创建分组字典
应用：对每个分组计算极差和总和
组合：将结果合并到新字典

3. Pandas分组聚合的三种实现路径

3.1 基础版：分步操作

# 步骤1：创建分组对象 grouped = drinks.groupby('continent') # 步骤2：分别计算各指标 wine_range = grouped['wine_servings'].apply(lambda x: x.max() - x.min()) beer_total = grouped['beer_servings'].sum() # 步骤3：合并结果 result = pd.concat([wine_range, beer_total], axis=1) result.columns = ['wine_range', 'beer_total']

3.2 进阶版：agg聚合

result = drinks.groupby('continent').agg({ 'wine_servings': lambda x: x.max() - x.min(), 'beer_servings': 'sum' })

3.3 专业版：命名聚合

Pandas 0.25+版本支持更清晰的语法：

result = drinks.groupby('continent').agg( wine_range=('wine_servings', lambda x: x.max() - x.min()), beer_total=('beer_servings', 'sum') )

三种方法对比：

方法	优点	缺点	适用场景
分步操作	逻辑清晰	代码冗长	需要中间结果检查
agg聚合	简洁高效	自定义函数可读性差	简单聚合场景
命名聚合	可读性最佳	需要较新Pandas版本	生产环境代码

4. 分组聚合的深度技巧

4.1 多级分组分析

假设我们想分析各大洲内不同酒精消费水平国家的差异：

# 先创建消费水平分类 drinks['alcohol_level'] = pd.cut( drinks['total_litres_of_pure_alcohol'], bins=[0, 3, 6, 20], labels=['low', 'medium', 'high'] ) # 多级分组 multi_result = drinks.groupby(['continent', 'alcohol_level']).agg({ 'wine_servings': 'mean', 'beer_servings': 'median' })

4.2 分组后过滤

找出红酒消费差异大于200的大洲：

def filter_func(group): return group['wine_servings'].max() - group['wine_servings'].min() > 200 filtered = drinks.groupby('continent').filter(filter_func)

4.3 分组时间序列分析

如果数据包含时间维度，resample结合groupby非常强大：

# 假设有year字段 drinks.groupby('continent').resample('5Y', on='year')['beer_servings'].mean()

5. 性能优化与常见陷阱

5.1 分组性能对比

不同分组方法的性能差异（测试数据集10万行）：

方法	执行时间(ms)	内存使用
groupby	120	1.2MB
pd.crosstab	180	1.8MB
pivot_table	200	2.1MB

注意：避免在分组时使用apply处理整个DataFrame，这会显著降低性能

5.2 常见错误处理

问题1：分组键包含NaN值

# 解决方案1：填充缺失值 drinks['continent'].fillna('UNKNOWN', inplace=True) # 解决方案2：过滤缺失值 drinks = drinks.dropna(subset=['continent'])

问题2：聚合结果出现意外列

# 原始代码可能保留非数值列 result = drinks.groupby('continent').mean() # 包含country列? # 正确做法：明确指定数值列 numeric_cols = ['beer_servings', 'wine_servings'] result = drinks.groupby('continent')[numeric_cols].mean()

6. 从分组聚合到数据洞察

回到最初的业务问题，我们得到的聚合结果：

continent	wine_range	beer_total
Africa	233	3258
Asia	123	1630
Europe	370	8720

这些数字告诉我们：

欧洲各国红酒消费差异最大（极差370），可能源于葡萄酒文化的多样性
啤酒消费总量欧洲遥遥领先，是非洲的2.7倍
亚洲国家间红酒消费差异最小，可能反映相对统一的饮酒习惯

# 可视化分析 import matplotlib.pyplot as plt result.plot(kind='bar', subplots=True, figsize=(10, 6)) plt.tight_layout()

在实际项目中，这样的分析可以帮助酒类企业：

针对欧洲市场开发多样化红酒产品线
在啤酒消费高的地区加大营销投入
在亚洲市场采取更统一的产品策略

别再死记硬背groupby了！用这个真实数据集（drinks.csv）带你玩转Pandas分组聚合