What's the most efficient way to check for duplicates in an array of data using Perl? What's the most efficient way to check for duplicates in an array of data using Perl? arrays arrays

What's the most efficient way to check for duplicates in an array of data using Perl?


One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.

use strict;use warnings;my @array = qw/yes no maybe true false false perhaps no/;my %seen;foreach my $string (@array) {    next unless $seen{$string}++;    print "'$string' is duplicated.\n";}

Output

'false' is duplicated.

'no' is duplicated.


Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.

#!/usr/bin/perluse strict;use warnings;my %count;my %dups;for(@array) {    $dups{$_}++ if $count{$_}++;}

A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.

# not exactly sort in place, but Perl does a decent job optimizing it@array = sort @array;my $last;my %dups;for my $entry (@array) {    $dups{$entry}++ if defined $last and $entry eq $last;    $last = $entry;}

This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.

Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.


If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:

use strict;use warnings;use List::MoreUtils 'uniq';my @array = qw(1 1 2 3 fibonacci!);my @array_uniq = uniq @array;print ((scalar(@array) == scalar(@array_uniq)) ? "no dupes" : "dupes") . " found!\n";

Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:

my %uniq_elements;foreach my $element (@array){    die "dupe found!" if $uniq_elements{$element}++;}