How can I remove duplicate rows?
What is the best way to remove duplicate rows from a fairly large
SQL Server table (i.e. 300,000+ rows)?
The rows, of course, will not be perfect duplicates because of the existence of the
RowID identity field.
RowID int not null identity(1,1) primary key, Col1 varchar(20) not null, Col2 varchar(2048) not null, Col3 tinyint not null
Assuming no nulls, you
GROUP BY the unique columns, and
MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:
DELETE FROM MyTable LEFT OUTER JOIN ( SELECT MIN(RowId) as RowId, Col1, Col2, Col3 FROM MyTable GROUP BY Col1, Col2, Col3 ) as KeepRows ON MyTable.RowId = KeepRows.RowId WHERE KeepRows.RowId IS NULL
In case you have a GUID instead of an integer, you can replace
CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
Another possible way of doing this is
; --Ensure that any immediately preceding statement is terminated with a semicolon above WITH cte AS (SELECT ROW_NUMBER() OVER (PARTITION BY Col1, Col2, Col3 ORDER BY ( SELECT 0)) RN FROM #MyTable) DELETE FROM cte WHERE RN > 1;
I am using
ORDER BY (SELECT 0) above as it is arbitrary which row to preserve in the event of a tie.
To preserve the latest one in
RowID order for example you could use
ORDER BY RowID DESC
The execution plan for this is often simpler and more efficient than that in the accepted answer as it does not require the self join.
This is not always the case however. One place where the
GROUP BY solution might be preferred is situations where a hash aggregate would be chosen in preference to a stream aggregate.
ROW_NUMBER solution will always give pretty much the same plan whereas the
GROUP BY strategy is more flexible.
Factors which might favour the hash aggregate approach would be
- No useful index on the partitioning columns
- relatively fewer groups with relatively more duplicates in each group
In extreme versions of this second case (if there are very few groups with many duplicates in each) one could also consider simply inserting the rows to keep into a new table then
TRUNCATE-ing the original and copying them back to minimise logging compared to deleting a very high proportion of the rows.
Read more... Read less...
There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.
I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:
DELETE dupes FROM MyTable dupes, MyTable fullTable WHERE dupes.dupField = fullTable.dupField AND dupes.secondDupField = fullTable.secondDupField AND dupes.uniqueField > fullTable.uniqueField
The following query is useful to delete duplicate rows. The table in this example has
ID as an identity column and the columns which have duplicate data are
DELETE FROM TableName WHERE ID NOT IN (SELECT MAX(ID) FROM TableName GROUP BY Column1, Column2, Column3 /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially nullable. Because of semantics of NOT IN (NULL) including the clause below can simplify the plan*/ HAVING MAX(ID) IS NOT NULL)
The following script shows usage of
ORDER BY in one query, and returns the results with duplicate column and its count.
SELECT YourColumnName, COUNT(*) TotalCount FROM YourTableName GROUP BY YourColumnName HAVING COUNT(*) > 1 ORDER BY COUNT(*) DESC
delete t1 from table t1, table t2 where t1.columnA = t2.columnA and t1.rowid>t2.rowid
delete from table t1 using table t2 where t1.columnA = t2.columnA and t1.rowid > t2.rowid
DELETE LU FROM (SELECT *, Row_number() OVER ( partition BY col1, col1, col3 ORDER BY rowid DESC) [Row] FROM mytable) LU WHERE [row] > 1
This will delete duplicate rows, except the first row
DELETE FROM Mytable WHERE RowID NOT IN ( SELECT MIN(RowID) FROM Mytable GROUP BY Col1, Col2, Col3 )