http://qs321.pair.com?node_id=11115316


in reply to Re^2: How to check lines that start with the same word then delete one of them
in thread How to check lines that start with the same word then delete one of them

agnes00:

You can use a hash to help you decide what to do on later lines something like this:

$ cat foo.pl use strict; use warnings; # Read the input file. Trim trailing whitespace # and preserve the line number. my $cnt = 0; my @inp = map { s/\s+$//; [ ++$cnt, $_ ] } <DATA>; print "INPUT LINES:\n"; print join(": ", @$_), "\n" for @inp; # Process the file. We'll keep the first record for # each key we find and ignore all successive values # with two exceptions: First, we won't process a # 'foo' record until we've handled a 'bar'. Second, # we won't handle a 'baz' record in the first five # lines. my %seen; my @out; for my $rLine (@inp) { # Parse out the interesting fields my $line_num = $rLine->[0]; # parse out the interesting fields my ($key, $val) = split /\s+/, $rLine->[1]; # ignore keys we've already processed next if $seen{$key}; # don't process 'foo' until we've handled 'baz' next if $key eq 'foo' and ! exists $seen{baz}; # don't process 'baz' in the first five lines next if $key eq 'baz' and $line_num < 5; # process the line and remember the key push @out, $rLine->[1]; ++$seen{$key}; } print "\n\nOUTPUT LINES:\n"; print $_, "\n" for @out; __DATA__ foo the bar quick baz red bar fox foo jumped biz over bar the bim lazy baz red foo dog

As you process your file, you record the important decisions you've made in the hash to help guide future decisions.

In the example I cobbled together, I used three rules:

  1. Only process a 'foo' record if we've already processed a 'baz' record.
  2. Ignore 'baz' records occurring in the first five lines of the file.
  3. Otherwise, keep the first record of each type we find.

Using these rules, when we run the program we get:

$ perl foo.pl INPUT LINES: 1: foo the 2: bar quick 3: baz red 4: bar fox 5: foo jumped 6: biz over 7: bar the 8: bim lazy 9: baz red 10: foo dog OUTPUT LINES: bar quick biz over bim lazy baz red foo dog

As you can see, we're able to handle all the rules with a single pass over the file with the help of a little bookkeeping.

As you've guessed in your original post, the nested loop can consume quite a bit of time for a large file. So it's worthwhile to think of ways you can do your processing without having to repeatedly scan the file.

What if you wanted to keep the *last* line starting with each key? One way would be to leave the logic the same, but to process the records in reverse order. Another way would be to change the way you handle the "seen" hash: Instead of checking whether you've processed the key or not, you could store the data you want to keep in it. That way, you can simply overwrite each record with a later record if you want, and then output them at the end. If you're keeping your data in memory, you can even come up with a method to process the data in *one* order and output the data in a *different* order to make your task simpler.

It's often a mistake to immediately jump in and solve the problem until you think about how to simplify things. Sometimes you'll find that a problem could easily be solved if the data came in a more convenient form or order. In those cases, it may be profitable to simply reshape or reorder the data to suit and then solve the simpler problem.

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Replies are listed 'Best First'.
Re^4: How to check lines that start with the same word then delete one of them
by agnes00 (Novice) on Apr 10, 2020 at 13:20 UTC
    Thanks for your reply. If I tried to resolve my problem making rules. I'll say for the first line I won't process sata until I find another line with the same first word. and If I don't find I won't treat other line because the loop has been finished, you got me ? my problem is to find for each line a line that has the same first word and I should do this for all lines. I don't see how to do this in one loop

      You may be assuming that for each line you iterate on, you need to compare it to all the other lines. So if you have ten lines, you do 10^2, or 100 comparisons. If you have 40000 lines, you do 40000^2, or 1,600,000,000 comparisons; 1.6 billion comparisons. This is called O(n^2) time.

      The reason it takes 1.6 billion comparisons is because you haven't found a way to look for a needle in a haystack without looking at every other straw every time you pick a straw. Your algorithm takes two linear searches and multiplies them together. O(n) * O(n) = O(n^2). But when people suggest to use a hash, they are relying on a piece of knowledge that you haven't considered as component of your algorithm: with a hash lookup you can find every straw again as soon as you've seen it once. Hash lookups occur in O(1) time. O(n) * O(1) = O(n).

      A hash-based approach isn't hard to conceptualize if you start with the idea of an array. We have to assume our data set is just integers in a reasonable range for now; say 1 to 10000:

      my @seen; foreach my $target (@integer_data_set) { if(defined($seen[$target]) { print "Duplicate: $target\n"; next; } $seen[$target] = 1; print "New: $target\n"; }

      In this example, we use the data point as an array index. I think everyone agrees that looking up an array index is really fast. Because it's just a math calculation (address + offset), it's constant time; no matter how big the seen array grows, you can find that millionth element in the array by doing the math problem base_address + 1000000. So with this approach, you can check if a value was previously seen without searching through your list of seen values. You have only a single outer loop, so linear time.

      But your data isn't integers; it's strings, and an algorithm that converts a string to a unique index could quickly result in arrays too big to fit in memory (offsets would be come huge). Arrays only use integers as indexes. So this approach won't work for strings. You really want the ability to find a previously seen string in constant time, as easily as indexing into an array, but you need to limit the number of possible offsets (buckets) to a range that fits in memory. You can (slight oversimplification). A hash lets you treat a string as an index. So the algorithm above becomes:

      my %seen; foreach my $target (@string_data_set) { if (exists($seen{$target}) { print "Duplicate: $target\n"; next; } $seen{$target} = 1; print "New: $target\n"; }

      What is this magic that allows us to treat a string as an index? What we refer to as a hash in Perl, is a data structure where the key is calculated so as to be easy to find later. Furthermore, the calculation limits the number of buckets to a set of offsets that will fit in a reasonable amount of memory. Fortunately Perl does this for us. But we can create our own poor-man's version if we want. First we need a hashing algorithm that produces an offset value:

      sub hash { my $string = shift; my $buckets = 100; my $hash_value = 0; foreach my $ordinal (map {ord} split(//, $string)) { $hash_value = ($hash_value + $ordinal) % $buckets; } return $hash_value; }

      This hash function will take any string and turn it into a numeric value between 0 and 99. The method it uses is a simple modulus operation, which is a very primitive hashing algorithm, but it is useful here because it is cyclical in nature. It obviously makes no guarantees about collisions. But it does guarantee that the same string will always result in the same numeric value between 0 and 99. For random strings, the distribution is approximately balanced, too. The next thing we need is a way to deal with collisions. All hashing implementations have this need. In many hash implementations if two entities boil down to the same hash bucket, they're stored in a linked list. When the linked lists get too big to be efficient, the number of buckets is increased. Let's say the number of collisions we allow for is 10. After ten collisions we re-hash with 200 buckets instead of 100. Ten more collisions and we re-hash at 400 buckets, and so on.

      With this strategy, finding an element in our poor-man's hash is a matter of doing this:

      sub find { my ($storage_aref, $needle) { my $hashed = hash($needle); if(defined($storage_aref[$hashed)) { my $found_ix = first_ix {$needle eq $storage_aref->[$hashed]-> +[$_]->[0]} @{$storage_aref->[$hashed]}; return $storage_aref->[$hashed]->[$found_ix]->[1]; } return; }

      This makes some assumptions, such as one that undef is not in-band in your data set, and that your storage array looks something like this:

      $storage_aref = [ [ # Bucket 0. ['string' => 'value'], ['another' => 'other value'], ], [ # bucket 1. ['something' => 'its value'], ], [...], # Bucket 2, ... ];

      So with this approach you can find the bucket just by calculating the hash value of the string: bucket = base_address + hash(string). And then you search for the string itself within that bucket, which will only have a few items in it. This is a really simplistic representation of how hash based data structures work. Perl provides these built in, and they're a lot more solidly implemented than my example. But the underlying principle is the same; boil a key string down to a repeatable hash value that is represented as an offset which can be reached immediately.

      All this is to explain that it is possible to have constant time lookups of strings by calculating their bucket and providing enough buckets that there are an acceptable number of collisions to search through, and this is all hidden behind the humble %hash. So your algorithm can actually be a single loop.


      Dave

      If you are not to work with lines which have different beginnings, then you can divide a problem: 1. group lines by their beginnings, 2. then solve a problem in each group separately.