import pandas as pd
import numpy as np
num_rows = 10
department = np.random.choice(['HR', 'IT', 'Finance','Sales'], num_rows)
department
array(['Sales', 'IT', 'HR', 'HR', 'IT', 'Sales', 'Finance', 'Sales',
'Finance', 'Sales'], dtype='<U7')
Nipun Batra
February 11, 2025
pandas, groupby, data aggregation, quiz, exercises, data manipulation, python
import pandas as pd
import numpy as np
num_rows = 10
department = np.random.choice(['HR', 'IT', 'Finance','Sales'], num_rows)
department
array(['Sales', 'IT', 'HR', 'HR', 'IT', 'Sales', 'Finance', 'Sales',
'Finance', 'Sales'], dtype='<U7')
This notebook contains practical exercises focusing on pandas GroupBy operations - one of the most powerful features in pandas for data aggregation and analysis. These exercises will test your understanding of grouping data, applying functions, and extracting insights from grouped datasets.
This quiz will assess your ability to: - Create and manipulate sample datasets with pandas - Apply GroupBy operations effectively - Use various aggregation functions (max, min, mean, etc.) - Access specific groups and their properties - Understand method chaining in pandas operations
We’ll work with a simulated employee dataset containing information about departments, employee IDs, salaries, and office locations.
Let’s examine our employee dataset. Notice the different departments, varying salaries, and multiple office locations. This diversity makes it perfect for practicing GroupBy operations.
When we call df.groupby('department')
, pandas creates a GroupBy object that represents the data split into groups based on the specified column.
The get_group()
method allows us to extract data for a specific group. This is useful for examining subsets of your data.
department | employee_id | salary | location | |
---|---|---|---|---|
0 | Sales | 1372 | 41751 | Pune |
1 | IT | 1109 | 43223 | Chennai |
2 | HR | 1981 | 49878 | Bangalore |
3 | HR | 1185 | 45469 | Chennai |
4 | IT | 1502 | 50338 | Pune |
5 | Sales | 1073 | 53113 | Hyderabad |
6 | Finance | 1291 | 70930 | Chennai |
7 | Sales | 1058 | 37430 | Bangalore |
8 | Finance | 1648 | 53010 | Chennai |
9 | Sales | 1401 | 52789 | Pune |
Pandas provides multiple ways to apply aggregation functions to grouped data. Let’s explore the different syntaxes and their implications.
The numeric_only
parameter helps control which columns are included in aggregation operations.
Different syntax approaches for focusing on specific columns during aggregation.
Understanding the parameters and options available for aggregation methods is crucial for effective data analysis.
From this exercise, we’ve learned several important concepts:
df.groupby('column')
splits data into groupsget_group('group_name')
retrieves specific group datamax()
, min()
, mean()
numeric_only
to control which columns are processedGroupBy operations are fundamental for: - Business Analytics: Analyzing performance by department, region, or time period - Scientific Research: Grouping experimental data by conditions or categories
- Financial Analysis: Summarizing transactions by account, date, or category - Marketing: Analyzing customer behavior by demographics or segments
To deepen your GroupBy skills, explore: - Multiple column grouping: df.groupby(['col1', 'col2'])
- Custom aggregation functions with agg()
- Transformation operations with transform()
- Filtering groups with filter()
- Advanced indexing with grouped data
department | employee_id | salary | location | |
---|---|---|---|---|
0 | Sales | 1372 | 41751 | Pune |
5 | Sales | 1073 | 53113 | Hyderabad |
7 | Sales | 1058 | 37430 | Bangalore |
9 | Sales | 1401 | 52789 | Pune |
department | employee_id | salary | location | |
---|---|---|---|---|
1 | IT | 1109 | 43223 | Chennai |
4 | IT | 1502 | 50338 | Pune |
employee_id | salary | location | |
---|---|---|---|
department | |||
Finance | 1648 | 70930 | Chennai |
HR | 1981 | 49878 | Chennai |
IT | 1502 | 50338 | Pune |
Sales | 1401 | 53113 | Pune |
employee_id | salary | |
---|---|---|
department | ||
Finance | 1648 | 70930 |
HR | 1981 | 49878 |
IT | 1502 | 50338 |
Sales | 1401 | 53113 |
employee_id | salary | |
---|---|---|
department | ||
Finance | 1648 | 70930 |
HR | 1981 | 49878 |
IT | 1502 | 50338 |
Sales | 1401 | 53113 |
Signature: g.max( numeric_only: 'bool' = False, min_count: 'int' = -1, engine: "Literal['cython', 'numba'] | None" = None, engine_kwargs: 'dict[str, bool] | None' = None, ) Docstring: Compute max of group values. Parameters ---------- numeric_only : bool, default False Include only float, int, boolean columns. .. versionchanged:: 2.0.0 numeric_only no longer accepts ``None``. min_count : int, default -1 The required number of valid values to perform the operation. If fewer than ``min_count`` non-NA values are present the result will be NA. engine : str, default None None * ``'cython'`` : Runs rolling apply through C-extensions from cython. * ``'numba'`` : Runs rolling apply through JIT compiled code from numba. Only available when ``raw`` is set to ``True``. * ``None`` : Defaults to ``'cython'`` or globally setting ``compute.use_numba`` engine_kwargs : dict, default None None * For ``'cython'`` engine, there are no accepted ``engine_kwargs`` * For ``'numba'`` engine, the engine can accept ``nopython``, ``nogil`` and ``parallel`` dictionary keys. The values must either be ``True`` or ``False``. The default ``engine_kwargs`` for the ``'numba'`` engine is ``{'nopython': True, 'nogil': False, 'parallel': False}`` and will be applied to both the ``func`` and the ``apply`` groupby aggregation. Returns ------- Series or DataFrame Computed max of values within each group. Examples -------- For SeriesGroupBy: >>> lst = ['a', 'a', 'b', 'b'] >>> ser = pd.Series([1, 2, 3, 4], index=lst) >>> ser a 1 a 2 b 3 b 4 dtype: int64 >>> ser.groupby(level=0).max() a 2 b 4 dtype: int64 For DataFrameGroupBy: >>> data = [[1, 8, 2], [1, 2, 5], [2, 5, 8], [2, 6, 9]] >>> df = pd.DataFrame(data, columns=["a", "b", "c"], ... index=["tiger", "leopard", "cheetah", "lion"]) >>> df a b c tiger 1 8 2 leopard 1 2 5 cheetah 2 5 8 lion 2 6 9 >>> df.groupby("a").max() b c a 1 8 5 2 6 9 File: ~/mambaforge/lib/python3.12/site-packages/pandas/core/groupby/groupby.py Type: method