agrupacion-datos-group-by-sql

Grouping Data with GROUP BY in SQL

  • 4 min

Imagine you are the manager of a store. You are not so interested in knowing that “Juan bought a pencil at 10:00” and that “Ana bought an eraser at 10:05”.

What you really want to know is:

How much have we sold in total today?

How many products has each employee sold?

To answer these questions, we need to collapse hundreds of individual rows into a single summarized result. In T-SQL, the tool for this is the GROUP BY clause.

If you have ever used Pivot Tables in Excel, the concept is exactly the same: take a bunch of data and group it by a common characteristic.

Syntax and Logic

The GROUP BY clause is placed after WHERE (if any) and before ORDER BY.

SELECT GroupingColumn, AGGREGATE_FUNCTION(Column)
FROM Table
GROUP BY GroupingColumn;
Copied!

When SQL Server executes this, it does the following:

  1. Scans the table.
  2. Identifies the unique values of the column we are grouping by (e.g., City).
  3. Creates “buckets” or virtual groups for each unique value.
  4. Calculates the summary for each group.

Let’s see it with an example. Suppose we have an Employees table and we want to know how many employees are in each department.

SELECT Department, COUNT(*) AS NumberOfEmployees
FROM Employees
GROUP BY Department;
Copied!

The result would be as follows,

DepartmentNumberOfEmployees
Sales10
IT5
HR2

SQL Server has taken all the rows where Department is ‘Sales’, grouped them together, and counted them.

The Column invalid error

This is the most important concept in this article and the source of 90% of errors. The famous error:

Column is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

When you use GROUP BY, in the SELECT you can ONLY put:

  1. Columns that appear in the GROUP BY.
  2. Aggregate functions (COUNT, SUM, MAX, etc.).

Why?

It’s hard to see at first, but when you see it, it makes perfect sense.

When you group by Department, in the grouping the non-selected fields disappear. The group does not have them (it couldn’t have them, because there are several).

Suppose in the ‘Sales’ department you have Juan Gomez, Pedro García, and María Gutierrez… if you try to do this:

-- ⛔ THIS WILL ERROR
SELECT Department, LastName, COUNT(*)
FROM Employees
GROUP BY Department;
Copied!

SQL Server will tell you:

You ask me for the “Name” of the ‘Sales’ department

But I have several different names (Gomez, García, Gutierrez).

If you want to display the name, you either add it to the GROUP BY (breaking the group) or use an aggregate function (like MAX(Name)).

It won’t work even if all elements in the group have the same property equal.

Ultimately, it’s a conceptual problem. You are trying to fit a cell into several, and that cannot be done (without aggregation).

Grouping by Multiple Columns

We can create more specific groups by combining several columns. For example, we want to know how many employees are in each department, but broken down by City.

SELECT Department, City, COUNT(*) AS Total
FROM Employees
GROUP BY Department, City;
Copied!

Here, SQL Server will create a unique group for each combination.

  • Sales - Madrid
  • Sales - Barcelona
  • IT - Madrid
  • IT - Valencia

Only rows that share both values will be grouped together.

Execution Order of GROUP BY with WHERE

It is vital to understand the execution order we have mentioned in previous modules.

FROM (Loads data)

WHERE (Filters individual rows)

GROUP BY (Groups the surviving rows)

SELECT (Displays results)

This means that WHERE is applied before creating the groups.

For example, we want to count employees by department, but only counting those who are “Active”.

SELECT Department, COUNT(*) AS ActiveEmployees
FROM Employees
WHERE Active = 1  -- First we filter out the inactive ones
GROUP BY Department; -- Then we group those that remain
Copied!

Handling NULLs

In the world of GROUP BY, all NULLs are considered equal.

If you group by a column that has null values, SQL Server will create a single group for the NULLs, instead of ignoring or separating them.

SELECT ManagerID, COUNT(*)
FROM Employees
GROUP BY ManagerID;
Copied!

If there are 5 employees who have no manager (ManagerID is NULL), you will see a row in the result where ManagerID is NULL and the count is 5.