SQL 101 — Mastering Data Filtering: WHERE, HAVING, and QUALIFY Clause
As a BI analyst, I have spent countless hours working with SQL to extract meaningful insights from large datasets. During my journey, I have learned that one of the most important techniques in SQL is filtering, which allows you to extract relevant information from a dataset.
There are three main clauses that you can use to filter SQL results: the WHERE clause, the HAVING clause, and the QUALIFY clause. While the three of them are used to filter data, each clause work differently.
Where
The WHERE
clause in SQL is used to filter data based on specified conditions. It is added to a SELECT
, UPDATE
, or DELETE
statement and is used to retrieve or modify only those records that meet specific criteria. The WHERE
clause is used to specify the conditions that must be met for a record to be returned or affected by the statement.
For example, the following SQL statement uses the WHERE
clause to retrieve all records from the customers
table where the city
column is equal to “London”:
SELECT * FROM customers
WHERE city = 'London';
In this example, the WHERE
clause is used to specify the condition “city = ‘London’”, which means that only those records where the “city” column contains the value “London” will be returned.
Though WHERE
clause is powerful enough to filter data, it has some restriction that you have to put in mind:
- The use of indexes can affect the performance of the
WHERE
clause. If the columns referenced in theWHERE
clause are not indexed, the query can take longer to run, as the database has to scan the entire table for the relevant records. NULL
values can cause issues when using theWHERE
clause, asNULL
is not equal to any other value, including itself. To compareNULL
values in aWHERE
clause, you must use theIS NULL
orIS NOT NULL
operators.
Having
The HAVING
clause in SQL is used to filter groups of rows based on aggregate values. It is used in conjunction with the GROUP BY
clause, which groups rows from a table into aggregate values. The HAVING
clause is used to specify conditions that must be met by the groups, rather than individual rows.
For example, consider a table of sales data, with columns for the date, product, and sales amount. To find the total sales for each product, you would group the data by product and use the SUM
function to calculate the total sales for each group:
SELECT product, SUM(sales_amount) as total_sales
FROM sales_data
GROUP BY product
HAVING SUM(sales_amount) > 10000;
In this example, the HAVING
clause is used to filter the groups based on the condition SUM(sales_amount) > 10000
, which means that only those groups with a total sales greater than 10,000 will be returned.
Few things that you need to put in mind when using HAVING
clause:
NULL
values can cause issues when using theHAVING
clause, as aggregate functions may ignore or handleNULL
values differently. It’s important to understand how aggregate functions treatNULL
values, and to specify the appropriate conditions in theHAVING
clause to account for them.- The use of the
HAVING
clause can affect the performance of a query, as the database must perform the grouping and aggregation before applying the conditions specified in theHAVING
clause. Indexes, indexes on aggregate columns, and proper use of theWHERE
clause can improve query performance.
Qualify
The QUALIFY
clause in SQL is used to filter the results of window functions. It allows you to apply conditions to the results of aggregate and window functions, making it a powerful tool for summarizing and filtering data.
Please note that not all databases support the
QUALIFY
clause. It is important to check the database documentation for more information.
For example, consider a table of sales data, with columns for the date, product, and sales amount. To find the top 5 products by sales, you could group the data by product and use the SUM function to calculate the total sales for each group:
SELECT product, SUM(sales_amount) as total_sales
FROM sales_data
GROUP BY product
QUALIFY RANK() OVER (ORDER BY SUM(sales_amount) DESC) <= 5;
To get the top 5 products, the query will process as following:
- Get all
product
andsales_amount
fromsales_data
- Grouping the data by
product
- Summarize
sales_amount
per product astotal_sales
- Sorting
total_sales
from the biggest to the smallest amount - Assign rank for each row
- Filtering rank ≤ 5
All these filtering process (4–5) could be written into only single line QUALIFY RANK() OVER (ORDER BY SUM(sales_amount) DESC) <= 5;
Alternatively, if your database does not support QUALIFY
, you can use Common Table Expression (CTE) to achieve the same result:
with sales as (
SELECT
product,
SUM(sales_amount) as total_sales,
ROW_NUMBER() OVER (ORDER BY SUM(sales_amount) desc) rownum
FROM sales_data
GROUP BY product
)
SELECT *
FROM sales
where rownum <= 5;
or if you also don’t have window function:
with sales as (
SELECT
product,
SUM(sales_amount) as total_sales,
FROM sales_data
GROUP BY product
)
SELECT *
FROM sales
ORDER BY total_sales desc
LIMIT 5
Don’t worry if your database doesn’t support a function. There’s always a way to get the same result, just be creative!
In conclusion, filtering data in SQL is a crucial technique for extracting meaningful insights from large datasets. The WHERE, HAVING, and QUALIFY clauses are the primary methods used for filtering data in SQL, each with its own unique purpose and use case. Understanding the difference between these clauses is important for making efficient and accurate data filtering decisions.
P.S: If you have any questions or suggestions to make this article better, do not hesitate to comment!
Fun fact: This article is 80% written with the help of ChatGPT. I only gave the main idea of each paragraph and the AI turns it into paragraph.