Select first row in each GROUP BY group?
As the title suggests, I'd like to select the first row of each set of rows grouped with a
Specifically, if I've got a
purchases table that looks like this:
SELECT * FROM purchases;
id | customer | total ---+----------+------ 1 | Joe | 5 2 | Sally | 3 3 | Joe | 2 4 | Sally | 1
I'd like to query for the
id of the largest purchase (
total) made by each
customer. Something like this:
SELECT FIRST(id), customer, FIRST(total) FROM purchases GROUP BY customer ORDER BY total DESC;
FIRST(id) | customer | FIRST(total) ----------+----------+------------- 1 | Joe | 5 2 | Sally | 3
On Oracle 9.2+ (not 8i+ as originally stated), SQL Server 2005+, PostgreSQL 8.4+, DB2, Firebird 3.0+, Teradata, Sybase, Vertica:
WITH summary AS ( SELECT p.id, p.customer, p.total, ROW_NUMBER() OVER(PARTITION BY p.customer ORDER BY p.total DESC) AS rk FROM PURCHASES p) SELECT s.* FROM summary s WHERE s.rk = 1
Supported by any database:
But you need to add logic to break ties:
SELECT MIN(x.id), -- change to MAX if you want the highest x.customer, x.total FROM PURCHASES x JOIN (SELECT p.customer, MAX(total) AS max_total FROM PURCHASES p GROUP BY p.customer) y ON y.customer = x.customer AND y.max_total = x.total GROUP BY x.customer, x.total
In PostgreSQL this is typically simpler and faster (more performance optimization below):
SELECT DISTINCT ON (customer) id, customer, total FROM purchases ORDER BY customer, total DESC, id;
Or shorter (if not as clear) with ordinal numbers of output columns:
SELECT DISTINCT ON (2) id, customer, total FROM purchases ORDER BY 2, 3 DESC, 1;
total can be NULL (won't hurt either way, but you'll want to match existing indexes):
... ORDER BY customer, total DESC NULLS LAST, id;
DISTINCT ON is a PostgreSQL extension of the standard (where only
DISTINCT on the whole
SELECT list is defined).
List any number of expressions in the
DISTINCT ON clause, the combined row value defines duplicates. The manual:
Obviously, two rows are considered distinct if they differ in at least one column value. Null values are considered equal in this comparison.
Bold emphasis mine.
DISTINCT ON can be combined with
ORDER BY. Leading expressions in
ORDER BY must be in the set of expressions in
DISTINCT ON, but you can rearrange order among those freely. Example.
You can add additional expressions to
ORDER BY to pick a particular row from each group of peers. Or, as the manual puts it:
DISTINCT ONexpression(s) must match the leftmost
ORDER BYexpression(s). The
ORDER BYclause will normally contain additional expression(s) that determine the desired precedence of rows within each
id as last item to break ties:
"Pick the row with the smallest
id from each group sharing the highest
To order results in a way that disagrees with the sort order determining the first per group, you can nest above query in an outer query with another
ORDER BY. Example.
total can be NULL, you most probably want the row with the greatest non-null value. Add
NULLS LAST like demonstrated. See:
SELECT list is not constrained by expressions in
DISTINCT ON or
ORDER BY in any way. (Not needed in the simple case above):
You don't have to include any of the expressions in
You can include any other expression in the
SELECTlist. This is instrumental for replacing much more complex queries with subqueries and aggregate / window functions.
I tested with Postgres versions 8.3 – 13. But the feature has been there at least since version 7.1, so basically always.
The perfect index for the above query would be a multi-column index spanning all three columns in matching sequence and with matching sort order:
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
May be too specialized. But use it if read performance for the particular query is crucial. If you have
DESC NULLS LAST in the query, use the same in the index so that sort order matches and the index is applicable.
Effectiveness / Performance optimization
Weigh cost and benefit before creating tailored indexes for each query. The potential of above index largely depends on data distribution.
The index is used because it delivers pre-sorted data. In Postgres 9.2 or later the query can also benefit from an index only scan if the index is smaller than the underlying table. The index has to be scanned in its entirety, though.
For few rows per customer (high cardinality in column
customer), this is very efficient. Even more so if you need sorted output anyway. The benefit shrinks with a growing number of rows per customer.
Ideally, you have enough
work_mem to process the involved sort step in RAM and not spill to disk. But generally setting
work_mem too high can have adverse effects. Consider
SET LOCAL for exceptionally big queries. Find how much you need with
EXPLAIN ANALYZE. Mention of "Disk:" in the sort step indicates the need for more:
- Configuration parameter work_mem in PostgreSQL on Linux
- Optimize simple query using ORDER BY date and text
For many rows per customer (low cardinality in column
customer), a loose index scan (a.k.a. "skip scan") would be (much) more efficient, but that's not implemented up to Postgres 13. (An implementation for index-only scans is in development for Postgres 14. See here and here.)
For now, there are faster query techniques to substitute for this. In particular if you have a separate table holding unique customers, which is the typical use case. But also if you don't:
- Optimize GROUP BY query to retrieve latest row per user
- Optimize groupwise maximum query
- Query last N related rows per row
I had a simple benchmark here which is outdated by now. I replaced it with a detailed benchmark in this separate answer.
Read more... Read less...
Testing the most interesting candidates with Postgres 9.4 and 9.5 with a halfway realistic table of 200k rows in
purchases and 10k distinct
customer_id (avg. 20 rows per customer).
For Postgres 9.5 I ran a 2nd test with effectively 86446 distinct customers. See below (avg. 2.3 rows per customer).
CREATE TABLE purchases ( id serial , customer_id int -- REFERENCES customer , total int -- could be amount of money in Cent , some_column text -- to make the row bigger, more realistic );
I use a
serial (PK constraint added below) and an integer
customer_id since that's a more typical setup. Also added
some_column to make up for typically more columns.
Dummy data, PK, index - a typical table also has some dead tuples:
INSERT INTO purchases (customer_id, total, some_column) -- insert 200k rows SELECT (random() * 10000)::int AS customer_id -- 10k customers , (random() * random() * 100000)::int AS total , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int) FROM generate_series(1,200000) g; ALTER TABLE purchases ADD CONSTRAINT purchases_id_pkey PRIMARY KEY (id); DELETE FROM purchases WHERE random() > 0.9; -- some dead rows INSERT INTO purchases (customer_id, total, some_column) SELECT (random() * 10000)::int AS customer_id -- 10k customers , (random() * random() * 100000)::int AS total , 'note: ' || repeat('x', (random()^2 * random() * random() * 500)::int) FROM generate_series(1,20000) g; -- add 20k to make it ~ 200k CREATE INDEX purchases_3c_idx ON purchases (customer_id, total DESC, id); VACUUM ANALYZE purchases;
customer table - for superior query:
CREATE TABLE customer AS SELECT customer_id, 'customer_' || customer_id AS customer FROM purchases GROUP BY 1 ORDER BY 1; ALTER TABLE customer ADD CONSTRAINT customer_customer_id_pkey PRIMARY KEY (customer_id); VACUUM ANALYZE customer;
In my second test for 9.5 I used the same setup, but with
random() * 100000 to generate
customer_id to get only few rows per
Object sizes for table
Generated with a query taken from this related answer:
what | bytes/ct | bytes_pretty | bytes_per_row -----------------------------------+----------+--------------+--------------- core_relation_size | 20496384 | 20 MB | 102 visibility_map | 0 | 0 bytes | 0 free_space_map | 24576 | 24 kB | 0 table_size_incl_toast | 20529152 | 20 MB | 102 indexes_size | 10977280 | 10 MB | 54 total_size_incl_toast_and_indexes | 31506432 | 30 MB | 157 live_rows_in_text_representation | 13729802 | 13 MB | 68 ------------------------------ | | | row_count | 200045 | | live_tuples | 200045 | | dead_tuples | 19955 | |
row_number() in CTE, (see other answer)
WITH cte AS ( SELECT id, customer_id, total , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn FROM purchases ) SELECT id, customer_id, total FROM cte WHERE rn = 1;
row_number()in subquery (my optimization)
SELECT id, customer_id, total FROM ( SELECT id, customer_id, total , row_number() OVER(PARTITION BY customer_id ORDER BY total DESC) AS rn FROM purchases ) sub WHERE rn = 1;
DISTINCT ON (see other answer)
SELECT DISTINCT ON (customer_id) id, customer_id, total FROM purchases ORDER BY customer_id, total DESC, id;
4. rCTE with
LATERAL subquery (see here)
WITH RECURSIVE cte AS ( ( -- parentheses required SELECT id, customer_id, total FROM purchases ORDER BY customer_id, total DESC LIMIT 1 ) UNION ALL SELECT u.* FROM cte c , LATERAL ( SELECT id, customer_id, total FROM purchases WHERE customer_id > c.customer_id -- lateral reference ORDER BY customer_id, total DESC LIMIT 1 ) u ) SELECT id, customer_id, total FROM cte ORDER BY customer_id;
customer table with
LATERAL (see here)
SELECT l.* FROM customer c , LATERAL ( SELECT id, customer_id, total FROM purchases WHERE customer_id = c.customer_id -- lateral reference ORDER BY total DESC LIMIT 1 ) l;
ORDER BY (see other answer)
SELECT (array_agg(id ORDER BY total DESC)) AS id , customer_id , max(total) AS total FROM purchases GROUP BY customer_id;
Execution time for above queries with
EXPLAIN ANALYZE (and all options off), best of 5 runs.
All queries used an Index Only Scan on
purchases2_3c_idx (among other steps). Some of them just for the smaller size of the index, others more effectively.
A. Postgres 9.4 with 200k rows and ~ 20 per
1. 273.274 ms 2. 194.572 ms 3. 111.067 ms 4. 92.922 ms 5. 37.679 ms -- winner 6. 189.495 ms
B. The same with Postgres 9.5
1. 288.006 ms 2. 223.032 ms 3. 107.074 ms 4. 78.032 ms 5. 33.944 ms -- winner 6. 211.540 ms
C. Same as B., but with ~ 2.3 rows per
1. 381.573 ms 2. 311.976 ms 3. 124.074 ms -- winner 4. 710.631 ms 5. 311.976 ms 6. 421.679 ms
Here is a new one by "ogr" testing with 10M rows and 60k unique "customers" on Postgres 11.5 (current as of Sep. 2019). Results are still in line with what we have seen so far:
Original (outdated) benchmark from 2011
I ran three tests with PostgreSQL 9.1 on a real life table of 65579 rows and single-column btree indexes on each of the three columns involved and took the best execution time of 5 runs.
Comparing @OMGPonies' first query (
A) to the above
DISTINCT ON solution (
- Select the whole table, results in 5958 rows in this case.
A: 567.218 ms B: 386.673 ms
- Use condition
WHERE customer BETWEEN x AND yresulting in 1000 rows.
A: 249.136 ms B: 55.111 ms
- Select a single customer with
WHERE customer = x.
A: 0.143 ms B: 0.072 ms
Same test repeated with the index described in the other answer
CREATE INDEX purchases_3c_idx ON purchases (customer, total DESC, id);
1A: 277.953 ms 1B: 193.547 ms 2A: 249.796 ms -- special index not used 2B: 28.679 ms 3A: 0.120 ms 3B: 0.048 ms
This is common greatest-n-per-group problem, which already has well tested and highly optimized solutions. Personally I prefer the left join solution by Bill Karwin (the original post with lots of other solutions).
Note that bunch of solutions to this common problem can surprisingly be found in the one of most official sources, MySQL manual! See Examples of Common Queries :: The Rows Holding the Group-wise Maximum of a Certain Column.
In Postgres you can use
array_agg like this:
SELECT customer, (array_agg(id ORDER BY total DESC)), max(total) FROM purchases GROUP BY customer
This will give you the
id of each customer's largest purchase.
Some things to note:
array_aggis an aggregate function, so it works with
array_agglets you specify an ordering scoped to just itself, so it doesn't constrain the structure of the whole query. There is also syntax for how you sort NULLs, if you need to do something different from the default.
- Once we build the array, we take the first element. (Postgres arrays are 1-indexed, not 0-indexed).
- You could use
array_aggin a similar way for your third output column, but
DISTINCT ON, using
array_agglets you keep your
GROUP BY, in case you want that for other reasons.
The solution is not very efficient as pointed by Erwin, because of presence of SubQs
select * from purchases p1 where total in (select max(total) from purchases where p1.customer=customer) order by total desc;
SELECT purchases.* FROM purchases LEFT JOIN purchases as p ON p.customer = purchases.customer AND purchases.total < p.total WHERE p.total IS NULL
HOW DOES THAT WORK! (I've been there)
We want to make sure that we only have the highest total for each purchase.
Some Theoretical Stuff (skip this part if you only want to understand the query)
Let Total be a function T(customer,id) where it returns a value given the name and id To prove that the given total (T(customer,id)) is the highest we have to prove that We want to prove either
- ∀x T(customer,id) > T(customer,x) (this total is higher than all other total for that customer)
- ¬∃x T(customer, id) < T(customer, x) (there exists no higher total for that customer)
The first approach will need us to get all the records for that name which I do not really like.
The second one will need a smart way to say there can be no record higher than this one.
Back to SQL
If we left joins the table on the name and total being less than the joined table:
LEFT JOIN purchases as p ON p.customer = purchases.customer AND purchases.total < p.total
we make sure that all records that have another record with the higher total for the same user to be joined:
+--------------+---------------------+-----------------+------+------------+---------+ | purchases.id | purchases.customer | purchases.total | p.id | p.customer | p.total | +--------------+---------------------+-----------------+------+------------+---------+ | 1 | Tom | 200 | 2 | Tom | 300 | | 2 | Tom | 300 | | | | | 3 | Bob | 400 | 4 | Bob | 500 | | 4 | Bob | 500 | | | | | 5 | Alice | 600 | 6 | Alice | 700 | | 6 | Alice | 700 | | | | +--------------+---------------------+-----------------+------+------------+---------+
That will help us filter for the highest total for each purchase with no grouping needed:
WHERE p.total IS NULL +--------------+----------------+-----------------+------+--------+---------+ | purchases.id | purchases.name | purchases.total | p.id | p.name | p.total | +--------------+----------------+-----------------+------+--------+---------+ | 2 | Tom | 300 | | | | | 4 | Bob | 500 | | | | | 6 | Alice | 700 | | | | +--------------+----------------+-----------------+------+--------+---------+
And that's the answer we need.