Grouping Data
Grouping data is a powerful way to analyze subsets of data by splitting it into groups based on one or more columns. Pandas provides the groupby() method to perform grouping operations, such as aggregations and transformations. This tutorial explores how to use groupby() effectively.
Basic Grouping
To group data by a specific column, use the groupby() method. Once grouped, you can apply aggregation functions like sum(), mean(), or count(). Here’s an example:
import pandas as pd
# Create a sample DataFrame
data = {
"City": ["Chennai", "Madurai", "Chennai", "Madurai", "Trichy"],
"Sales": [100, 200, 150, 180, 90],
"Profit": [20, 40, 30, 35, 15]
}
df = pd.DataFrame(data)
# Group by City and calculate total Sales
grouped = df.groupby("City")["Sales"].sum()
print(grouped)
Output
| City | Sales |
|---|---|
| Chennai | 250 |
| Madurai | 380 |
| Trichy | 90 |
Explanation: The groupby("City") method groups the data by the City column. The sum() function is applied to calculate the total sales for each city.
Grouping with Multiple Aggregations
You can perform multiple aggregations on grouped data using the agg() method. Here’s an example:
# Group by City and calculate total Sales and average Profit
grouped = df.groupby("City").agg({
"Sales": "sum",
"Profit": "mean"
})
print(grouped)
Output
| City | Sales | Profit |
|---|---|---|
| Chennai | 250 | 25.0 |
| Madurai | 380 | 37.5 |
| Trichy | 90 | 15.0 |
Explanation: The agg() method allows you to perform multiple aggregation operations on grouped data. In this example, total sales and average profit are calculated for each city.
Grouping with Multiple Columns
You can group data by multiple columns by passing a list to the groupby() method. Here’s an example:
# Group by City and Sales, and calculate total Profit
grouped = df.groupby(["City", "Sales"])["Profit"].sum()
print(grouped)
Output
| City | Sales | Profit |
|---|---|---|
| Chennai | 100 | 20 |
| Chennai | 150 | 30 |
| Madurai | 180 | 35 |
| Madurai | 200 | 40 |
| Trichy | 90 | 15 |
Explanation: Grouping by multiple columns creates a hierarchical index, allowing you to analyze the data at multiple levels. In this example, profits are grouped by both city and sales values.
Key Takeaways
- Basic Grouping: Use
groupby()to group data by a single column and apply aggregation functions. - Multiple Aggregations: Use
agg()for performing multiple operations on grouped data. - Grouping by Multiple Columns: Analyze data at multiple levels by grouping with more than one column.
- Insights: Grouping helps summarize and analyze data effectively based on categorical variables.