Advertisement
Advertisement


How can I remove duplicate rows?


Question

What is the best way to remove duplicate rows from a fairly large SQL Server table (i.e. 300,000+ rows)?

The rows, of course, will not be perfect duplicates because of the existence of the RowID identity field.

MyTable

RowID int not null identity(1,1) primary key,
Col1 varchar(20) not null,
Col2 varchar(2048) not null,
Col3 tinyint not null
2018/03/28
1
1296
3/28/2018 10:48:58 AM

Accepted Answer

Assuming no nulls, you GROUP BY the unique columns, and SELECT the MIN (or MAX) RowId as the row to keep. Then, just delete everything that didn't have a row id:

DELETE FROM MyTable
LEFT OUTER JOIN (
   SELECT MIN(RowId) as RowId, Col1, Col2, Col3 
   FROM MyTable 
   GROUP BY Col1, Col2, Col3
) as KeepRows ON
   MyTable.RowId = KeepRows.RowId
WHERE
   KeepRows.RowId IS NULL

In case you have a GUID instead of an integer, you can replace

MIN(RowId)

with

CONVERT(uniqueidentifier, MIN(CONVERT(char(36), MyGuidColumn)))
2017/06/12
1149
6/12/2017 11:06:07 AM


There's a good article on removing duplicates on the Microsoft Support site. It's pretty conservative - they have you do everything in separate steps - but it should work well against large tables.

I've used self-joins to do this in the past, although it could probably be prettied up with a HAVING clause:

DELETE dupes
FROM MyTable dupes, MyTable fullTable
WHERE dupes.dupField = fullTable.dupField 
AND dupes.secondDupField = fullTable.secondDupField 
AND dupes.uniqueField > fullTable.uniqueField
2017/06/19

The following query is useful to delete duplicate rows. The table in this example has ID as an identity column and the columns which have duplicate data are Column1, Column2 and Column3.

DELETE FROM TableName
WHERE  ID NOT IN (SELECT MAX(ID)
                  FROM   TableName
                  GROUP  BY Column1,
                            Column2,
                            Column3
                  /*Even if ID is not null-able SQL Server treats MAX(ID) as potentially
                    nullable. Because of semantics of NOT IN (NULL) including the clause
                    below can simplify the plan*/
                  HAVING MAX(ID) IS NOT NULL) 

The following script shows usage of GROUP BY, HAVING, ORDER BY in one query, and returns the results with duplicate column and its count.

SELECT YourColumnName,
       COUNT(*) TotalCount
FROM   YourTableName
GROUP  BY YourColumnName
HAVING COUNT(*) > 1
ORDER  BY COUNT(*) DESC 
2013/01/04

delete t1
from table t1, table t2
where t1.columnA = t2.columnA
and t1.rowid>t2.rowid

Postgres:

delete
from table t1
using table t2
where t1.columnA = t2.columnA
and t1.rowid > t2.rowid
2015/11/07

DELETE LU 
FROM   (SELECT *, 
               Row_number() 
                 OVER ( 
                   partition BY col1, col1, col3 
                   ORDER BY rowid DESC) [Row] 
        FROM   mytable) LU 
WHERE  [row] > 1 
2014/10/15

This will delete duplicate rows, except the first row

DELETE
FROM
    Mytable
WHERE
    RowID NOT IN (
        SELECT
            MIN(RowID)
        FROM
            Mytable
        GROUP BY
            Col1,
            Col2,
            Col3
    )

Refer (http://www.codeproject.com/Articles/157977/Remove-Duplicate-Rows-from-a-Table-in-SQL-Server)

2017/03/17

Source: https://stackoverflow.com/questions/18932
Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Email: [email protected]