Data Analysis Quiz - Pandas GroupBy Operations

Data Science

Python

Pandas

Data Analysis

Practical exercises and quiz problems focusing on pandas groupby operations and data aggregation techniques

Author

Nipun Batra

Published

February 11, 2025

Keywords

pandas, groupby, data aggregation, quiz, exercises, data manipulation, python

import pandas as pd
import numpy as np


num_rows = 10
department = np.random.choice(['HR', 'IT', 'Finance','Sales'], num_rows)
department

array(['Sales', 'IT', 'HR', 'HR', 'IT', 'Sales', 'Finance', 'Sales',
       'Finance', 'Sales'], dtype='<U7')

Data Analysis Quiz - Pandas GroupBy Operations

Introduction

This notebook contains practical exercises focusing on pandas GroupBy operations - one of the most powerful features in pandas for data aggregation and analysis. These exercises will test your understanding of grouping data, applying functions, and extracting insights from grouped datasets.

Learning Objectives

This quiz will assess your ability to: - Create and manipulate sample datasets with pandas - Apply GroupBy operations effectively - Use various aggregation functions (max, min, mean, etc.) - Access specific groups and their properties - Understand method chaining in pandas operations

Key Concepts Covered

GroupBy Mechanics: How pandas splits data into groups
Aggregation Functions: Computing summary statistics per group
Group Selection: Accessing specific groups and their data
Method Syntax: Different ways to call aggregation methods

Problem Setup

We’ll work with a simulated employee dataset containing information about departments, employee IDs, salaries, and office locations.

location = np.random.choice(['Bangalore', 'Chennai', 'Hyderabad','Pune'], num_rows)

employee_id = np.random.choice(range(1000, 2000), num_rows)

employee_id

array([1372, 1109, 1981, 1185, 1502, 1073, 1291, 1058, 1648, 1401])

salary = np.random.normal(50000, 10000, num_rows).astype(int)
salary

array([41751, 43223, 49878, 45469, 50338, 53113, 70930, 37430, 53010,
       52789])

Understanding the Dataset

Let’s examine our employee dataset. Notice the different departments, varying salaries, and multiple office locations. This diversity makes it perfect for practicing GroupBy operations.

GroupBy Fundamentals

Creating a GroupBy Object

When we call df.groupby('department'), pandas creates a GroupBy object that represents the data split into groups based on the specified column.

Accessing Individual Groups

The get_group() method allows us to extract data for a specific group. This is useful for examining subsets of your data.

df = pd.DataFrame({'department': department, 
                   'employee_id': employee_id, 
                   'salary': salary,
                   'location': location})

df

	department	employee_id	salary	location
0	Sales	1372	41751	Pune
1	IT	1109	43223	Chennai
2	HR	1981	49878	Bangalore
3	HR	1185	45469	Chennai
4	IT	1502	50338	Pune
5	Sales	1073	53113	Hyderabad
6	Finance	1291	70930	Chennai
7	Sales	1058	37430	Bangalore
8	Finance	1648	53010	Chennai
9	Sales	1401	52789	Pune

Aggregation Methods

Understanding Different Aggregation Approaches

Pandas provides multiple ways to apply aggregation functions to grouped data. Let’s explore the different syntaxes and their implications.

Controlling Numeric vs. All Columns

The numeric_only parameter helps control which columns are included in aggregation operations.

g = df.groupby('department')

Column-Specific Aggregation

Different syntax approaches for focusing on specific columns during aggregation.

g.get_group('HR')

	department	employee_id	salary	location
2	HR	1981	49878	Bangalore
3	HR	1185	45469	Chennai

Exploring Method Documentation

Understanding the parameters and options available for aggregation methods is crucial for effective data analysis.

Key Learning Points

GroupBy Operation Summary

From this exercise, we’ve learned several important concepts:

GroupBy Object Creation: df.groupby('column') splits data into groups
Group Access: get_group('group_name') retrieves specific group data
Aggregation Methods: Multiple ways to apply functions like max(), min(), mean()
Column Selection: Different syntax for targeting specific columns
Parameter Control: Using numeric_only to control which columns are processed

Practical Applications

GroupBy operations are fundamental for: - Business Analytics: Analyzing performance by department, region, or time period - Scientific Research: Grouping experimental data by conditions or categories
- Financial Analysis: Summarizing transactions by account, date, or category - Marketing: Analyzing customer behavior by demographics or segments

Best Practices

Understand Your Data: Know the structure before grouping
Choose Appropriate Aggregations: Select functions that make sense for your analysis
Handle Missing Data: Consider how NaN values affect your groupings
Performance: For large datasets, consider using vectorized operations
Readability: Use clear, descriptive group column names

Further Exploration

To deepen your GroupBy skills, explore: - Multiple column grouping: df.groupby(['col1', 'col2']) - Custom aggregation functions with agg() - Transformation operations with transform() - Filtering groups with filter() - Advanced indexing with grouped data

g.get_group('Sales')

	department	employee_id	salary	location
0	Sales	1372	41751	Pune
5	Sales	1073	53113	Hyderabad
7	Sales	1058	37430	Bangalore
9	Sales	1401	52789	Pune

g.get_group('IT')

	department	employee_id	salary	location
1	IT	1109	43223	Chennai
4	IT	1502	50338	Pune

g.max()

	employee_id	salary	location
department
Finance	1648	70930	Chennai
HR	1981	49878	Chennai
IT	1502	50338	Pune
Sales	1401	53113	Pune

g.max(numeric_only=True)

	employee_id	salary
department
Finance	1648	70930
HR	1981	49878
IT	1502	50338
Sales	1401	53113

g.max('salary')

	employee_id	salary
department
Finance	1648	70930
HR	1981	49878
IT	1502	50338
Sales	1401	53113

g['salary'].max()

department
Finance    70930
HR         49878
IT         50338
Sales      53113
Name: salary, dtype: int64

g.max()['salary']

department
Finance    70930
HR         49878
IT         50338
Sales      53113
Name: salary, dtype: int64

g.max?

Signature:

g.max(

    numeric_only: 'bool' = False,

    min_count: 'int' = -1,

    engine: "Literal['cython', 'numba'] | None" = None,

    engine_kwargs: 'dict[str, bool] | None' = None,

)

Docstring:

Compute max of group values.



Parameters

----------

numeric_only : bool, default False

    Include only float, int, boolean columns.



    .. versionchanged:: 2.0.0



        numeric_only no longer accepts ``None``.



min_count : int, default -1

    The required number of valid values to perform the operation. If fewer

    than ``min_count`` non-NA values are present the result will be NA.



engine : str, default None None

    * ``'cython'`` : Runs rolling apply through C-extensions from cython.

    * ``'numba'`` : Runs rolling apply through JIT compiled code from numba.

        Only available when ``raw`` is set to ``True``.

    * ``None`` : Defaults to ``'cython'`` or globally setting ``compute.use_numba``



engine_kwargs : dict, default None None

    * For ``'cython'`` engine, there are no accepted ``engine_kwargs``

    * For ``'numba'`` engine, the engine can accept ``nopython``, ``nogil``

        and ``parallel`` dictionary keys. The values must either be ``True`` or

        ``False``. The default ``engine_kwargs`` for the ``'numba'`` engine is

        ``{'nopython': True, 'nogil': False, 'parallel': False}`` and will be

        applied to both the ``func`` and the ``apply`` groupby aggregation.



Returns

-------

Series or DataFrame

    Computed max of values within each group.



Examples

--------

For SeriesGroupBy:



>>> lst = ['a', 'a', 'b', 'b']

>>> ser = pd.Series([1, 2, 3, 4], index=lst)

>>> ser

a    1

a    2

b    3

b    4

dtype: int64

>>> ser.groupby(level=0).max()

a    2

b    4

dtype: int64



For DataFrameGroupBy:



>>> data = [[1, 8, 2], [1, 2, 5], [2, 5, 8], [2, 6, 9]]

>>> df = pd.DataFrame(data, columns=["a", "b", "c"],

...                   index=["tiger", "leopard", "cheetah", "lion"])

>>> df

          a  b  c

  tiger   1  8  2

leopard   1  2  5

cheetah   2  5  8

   lion   2  6  9

>>> df.groupby("a").max()

    b  c

a

1   8  5

2   6  9

File:      ~/mambaforge/lib/python3.12/site-packages/pandas/core/groupby/groupby.py

Type:      method