Remove duplicate rows in MySQL
A really easy way to do this is to add a UNIQUE
index on the 3 columns. When you write the ALTER
statement, include the IGNORE
keyword. Like so:
ALTER IGNORE TABLE jobsADD UNIQUE INDEX idx_name (site_id, title, company);
This will drop all the duplicate rows. As an added benefit, future INSERTs
that are duplicates will error out. As always, you may want to take a backup before running something like this...
If you don't want to alter the column properties, then you can use the query below.
Since you have a column which has unique IDs (e.g., auto_increment
columns), you can use it to remove the duplicates:
DELETE `a`FROM `jobs` AS `a`, `jobs` AS `b`WHERE -- IMPORTANT: Ensures one version remains -- Change "ID" to your unique column's name `a`.`ID` < `b`.`ID` -- Any duplicates you want to check for AND (`a`.`title` = `b`.`title` OR `a`.`title` IS NULL AND `b`.`title` IS NULL) AND (`a`.`company` = `b`.`company` OR `a`.`company` IS NULL AND `b`.`company` IS NULL) AND (`a`.`site_id` = `b`.`site_id` OR `a`.`site_id` IS NULL AND `b`.`site_id` IS NULL);
In MySQL, you can simplify it even more with the NULL-safe equal operator (aka "spaceship operator"):
DELETE `a`FROM `jobs` AS `a`, `jobs` AS `b`WHERE -- IMPORTANT: Ensures one version remains -- Change "ID" to your unique column's name `a`.`ID` < `b`.`ID` -- Any duplicates you want to check for AND `a`.`title` <=> `b`.`title` AND `a`.`company` <=> `b`.`company` AND `a`.`site_id` <=> `b`.`site_id`;
MySQL has restrictions about referring to the table you are deleting from. You can work around that with a temporary table, like:
create temporary table tmpTable (id int);insert into tmpTable (id)select idfrom YourTable ytwhere exists ( select * from YourTabe yt2 where yt2.title = yt.title and yt2.company = yt.company and yt2.site_id = yt.site_id and yt2.id > yt.id );delete from YourTablewhere ID in (select id from tmpTable);
From Kostanos' suggestion in the comments:
The only slow query above is DELETE, for cases where you have a very large database. This query could be faster:
DELETE FROM YourTable USING YourTable, tmpTable WHERE YourTable.id=tmpTable.id