What's the most efficient way to check for duplicates in an array of data using Perl?

arrays perl

One of the things I love about Perl is it's ability to almost read like English. It just sort of makes sense.

use strict;use warnings;my @array = qw/yes no maybe true false false perhaps no/;my %seen;foreach my $string (@array) {    next unless $seen{$string}++;    print "'$string' is duplicated.\n";}

Output

'false' is duplicated.
'no' is duplicated.

arrays perl

Turning the array into a hash is the fastest way [O(n)], though its memory inefficient. Using a for loop is a bit faster than grep, but I'm not sure why.

#!/usr/bin/perluse strict;use warnings;my %count;my %dups;for(@array) {    $dups{$_}++ if $count{$_}++;}

A memory efficient way is to sort the array in place and iterate through it looking for equal and adjacent entries.

# not exactly sort in place, but Perl does a decent job optimizing it@array = sort @array;my $last;my %dups;for my $entry (@array) {    $dups{$entry}++ if defined $last and $entry eq $last;    $last = $entry;}

This is nlogn speed, because of the sort, but only needs to store the duplicates rather than a second copy of the data in %count. Worst case memory usage is still O(n) (when everything is duplicated) but if your array is large and there's not a lot of duplicates you'll win.

Theory aside, benchmarking shows the latter starts to lose on large arrays (like over a million) with a high percentage of duplicates.

arrays perl

If you need the uniquified array anyway, it is fastest to use the heavily-optimized library List::MoreUtils, and then compare the result to the original:

use strict;use warnings;use List::MoreUtils 'uniq';my @array = qw(1 1 2 3 fibonacci!);my @array_uniq = uniq @array;print ((scalar(@array) == scalar(@array_uniq)) ? "no dupes" : "dupes") . " found!\n";

Or if the list is large and you want to bail as soon as a duplicate entry is found, use a hash:

my %uniq_elements;foreach my $element (@array){    die "dupe found!" if $uniq_elements{$element}++;}

CodeHunter

What's the most efficient way to check for duplicates in an array of data using Perl?

Output

Recent Posts

How can I color dots in a xy scatterplot according to column value?

How to update a claim in ASP.NET Identity?

What does {0} mean when initializing an object?

Accessing members of items in a JSONArray with Java

How to log SQL statements in Spring Boot?

Powershell Get-WebSite name parameter is ignored

How to detect scroll to bottom of html element

Java synchronized method

How to test controllers with CodeIgniter?

Detect Visual Composer

Matplotlib: Specify format of floats for tick labels

Rails join a list of strings with commas and "and" before the last