You may be assuming that for each line you iterate on, you need to compare it to all the other lines. So if you have ten lines, you do 10^2, or 100 comparisons. If you have 40000 lines, you do 40000^2, or 1,600,000,000 comparisons; 1.6 billion comparisons. This is called O(n^2) time.
The reason it takes 1.6 billion comparisons is because you haven't found a way to look for a needle in a haystack without looking at every other straw every time you pick a straw. Your algorithm takes two linear searches and multiplies them together. O(n) * O(n) = O(n^2). But when people suggest to use a hash, they are relying on a piece of knowledge that you haven't considered as component of your algorithm: with a hash lookup you can find every straw again as soon as you've seen it once. Hash lookups occur in O(1) time. O(n) * O(1) = O(n).
A hashbased approach isn't hard to conceptualize if you start with the idea of an array. We have to assume our data set is just integers in a reasonable range for now; say 1 to 10000:
my @seen;
foreach my $target (@integer_data_set) {
if(defined($seen[$target]) {
print "Duplicate: $target\n";
next;
}
$seen[$target] = 1;
print "New: $target\n";
}
In this example, we use the data point as an array index. I think everyone agrees that looking up an array index is really fast. Because it's just a math calculation (address + offset), it's constant time; no matter how big the seen array grows, you can find that millionth element in the array by doing the math problem base_address + 1000000. So with this approach, you can check if a value was previously seen without searching through your list of seen values. You have only a single outer loop, so linear time.
But your data isn't integers; it's strings, and an algorithm that converts a string to a unique index could quickly result in arrays too big to fit in memory (offsets would be come huge). Arrays only use integers as indexes. So this approach won't work for strings. You really want the ability to find a previously seen string in constant time, as easily as indexing into an array, but you need to limit the number of possible offsets (buckets) to a range that fits in memory. You can (slight oversimplification). A hash lets you treat a string as an index. So the algorithm above becomes:
my %seen;
foreach my $target (@string_data_set) {
if (exists($seen{$target}) {
print "Duplicate: $target\n";
next;
}
$seen{$target} = 1;
print "New: $target\n";
}
What is this magic that allows us to treat a string as an index? What we refer to as a hash in Perl, is a data structure where the key is calculated so as to be easy to find later. Furthermore, the calculation limits the number of buckets to a set of offsets that will fit in a reasonable amount of memory. Fortunately Perl does this for us. But we can create our own poorman's version if we want. First we need a hashing algorithm that produces an offset value:
sub hash {
my $string = shift;
my $buckets = 100;
my $hash_value = 0;
foreach my $ordinal (map {ord} split(//, $string)) {
$hash_value = ($hash_value + $ordinal) % $buckets;
}
return $hash_value;
}
This hash function will take any string and turn it into a numeric value between 0 and 99. The method it uses is a simple modulus operation, which is a very primitive hashing algorithm, but it is useful here because it is cyclical in nature. It obviously makes no guarantees about collisions. But it does guarantee that the same string will always result in the same numeric value between 0 and 99. For random strings, the distribution is approximately balanced, too. The next thing we need is a way to deal with collisions. All hashing implementations have this need. In many hash implementations if two entities boil down to the same hash bucket, they're stored in a linked list. When the linked lists get too big to be efficient, the number of buckets is increased. Let's say the number of collisions we allow for is 10. After ten collisions we rehash with 200 buckets instead of 100. Ten more collisions and we rehash at 400 buckets, and so on.
With this strategy, finding an element in our poorman's hash is a matter of doing this:
sub find {
my ($storage_aref, $needle) {
my $hashed = hash($needle);
if(defined($storage_aref[$hashed)) {
my $found_ix = first_ix {$needle eq $storage_aref>[$hashed]>
+[$_]>[0]} @{$storage_aref>[$hashed]};
return $storage_aref>[$hashed]>[$found_ix]>[1];
}
return;
}
This makes some assumptions, such as one that undef is not inband in your data set, and that your storage array looks something like this:
$storage_aref = [
[ # Bucket 0.
['string' => 'value'],
['another' => 'other value'],
],
[ # bucket 1.
['something' => 'its value'],
],
[...], # Bucket 2,
...
];
So with this approach you can find the bucket just by calculating the hash value of the string: bucket = base_address + hash(string). And then you search for the string itself within that bucket, which will only have a few items in it. This is a really simplistic representation of how hash based data structures work. Perl provides these built in, and they're a lot more solidly implemented than my example. But the underlying principle is the same; boil a key string down to a repeatable hash value that is represented as an offset which can be reached immediately.
All this is to explain that it is possible to have constant time lookups of strings by calculating their bucket and providing enough buckets that there are an acceptable number of collisions to search through, and this is all hidden behind the humble %hash. So your algorithm can actually be a single loop.
