What's the most efficient way to check for duplicates in an array of data using Perl?
One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.
use strict;use warnings;my @array = qw/yes no maybe true false false perhaps no/;my %seen;foreach my $string (@array) { next unless $seen{$string}++; print "'$string' is duplicated.\n";}
Output
'false' is duplicated.
'no' is duplicated.
Turning the array into a hash is the fastest way [O(n)
], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.
#!/usr/bin/perluse strict;use warnings;my %count;my %dups;for(@array) { $dups{$_}++ if $count{$_}++;}
A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.
# not exactly sort in place, but Perl does a decent job optimizing it@array = sort @array;my $last;my %dups;for my $entry (@array) { $dups{$entry}++ if defined $last and $entry eq $last; $last = $entry;}
This is nlogn
speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count
. Worst case memory usage is still O(n)
(when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.
Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.
If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:
use strict;use warnings;use List::MoreUtils 'uniq';my @array = qw(1 1 2 3 fibonacci!);my @array_uniq = uniq @array;print ((scalar(@array) == scalar(@array_uniq)) ? "no dupes" : "dupes") . " found!\n";
Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:
my %uniq_elements;foreach my $element (@array){ die "dupe found!" if $uniq_elements{$element}++;}