http://qs321.pair.com?node_id=11115312


in reply to Re: How to check lines that start with the same word then delete one of them
in thread How to check lines that start with the same word then delete one of them

I didn't see how hash can do it. About the check, I test if the first word after the first semicolon match another word in the same file. Example :
S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0 S_FER_SCAM1_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0
here S_FER_SCAM1_ARRESTO match
  • Comment on Re^2: How to check lines that start with the same word then delete one of them
  • Download Code

Replies are listed 'Best First'.
Re^3: How to check lines that start with the same word then delete one of them
by hippo (Bishop) on Apr 10, 2020 at 12:38 UTC
    use strict; use warnings; use Test::More tests => 1; my @in = ( 'S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0', 'S_FER_SCAM1_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS;0;0' ); my @want = ( 'S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0', ); my @have; my %seen; for (@in) { /^(\w+)/; if (exists $seen{$1}) { next if (/SU LI IR ST/); # More code here if it doesn't match - this section not descri +bed. } $seen{$1} //= $_; push @have, $_; } is_deeply \@have, \@want, 'Arrays match';

    See also How to ask better questions using Test::More and sample data.

      Pretty much what I meant, thanks!

      Minor nitpick, I'd assign the first match to a normal var.

      Special vars like $1 can get overwritten easily by "more code" before seen is set.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Cool - glad we're in agreement.

        You are quite right about limiting the chances of stomping on $1 and friends, of course. The test script could be polished no doubt but I didn't want to spend/waste time on that before agnes00 confirmed that this does actually solve the problem. The requirements as stated were a little wooly.

      Thank you for your answer. The problem is that I've a big file (40000 line), that's why I did'nt bring it to my post. and also I can't specify all the wanted lines, I just want to check if for every line there is another line witch have the same first word then I check it to delete one of them, that's why I did two loops. but the algorithm is so slow
        The problem is that I've a big file (40000 line), that's why I did'nt bring it to my post.

        Everyone is glad you didn't. "Sample data" means just that. Pick maybe 6 lines - enough to be illustrative and to cover the bases. My script above is a test only. It illustrates that the algorithm and the code works, given the sample data.

        I just want to check if for every line there is another line witch have the same first word then I check it to delete one of them,

        This is precisely what my test shows. Would you not agree? To turn the test into a working script just replace @in with the code you already have which reads the input data from the file and similarly write @have to your file at the end.

        that's why I did two loops. but the algorithm is so slow

        Your algorithm is O(n2) whereas mine is O(n). Mine should therefore be thousands of times faster for a 40,000 line dataset.

        See also: Big O notation, SSCCE and Basic Testing Tutorial. HTH.

      I've tested your code with this input :
      my @in = ( 'S_FER_SCAM1_ARRESTO;ARRESTO;ST;0;ST;1;0;TS;0;0', 'S_VINREU_RLIP1_ALLARMEZONA3;ALLARME ZONA 3;SU LI IR ST;0;SU LI I +R ST;1;0;TS;0;0', 'S_VINREU_RLIP1_ANOMBAT;ANOMALIA BATTERIA;SU LI IR ST;0;SU LI IR S +T;1;0;TS;0;0', 'S_FER_VENT1_ERRCOLINV;ERRORE PROFIBUS COLL INVERTER;SU LI IR ST;0 +;SU LI IR ST;1;0;TS;0;0', 'S_VINREU_RLIP1_CIRCZONE1;CIRCUITO ZONA 1 FUNZONANTE;SU LI IR ST;0 +;SU LI IR ST;1;0;TS;0;0', 'S_VINREU_RLIP1_CIRCZONE2;CIRCUITO ZONA 2 FUNZONANTE;SU LI IR ST;0 +;SU LI IR ST;1;0;TS;0;0', 'S_FER_SCAM1_ARRESTO;ARRESTO;SU LI IR ST;0;SU LI IR ST;1;0;TS; +0;0', 'S_FER_VENT1_ERRCOLINV;ERRORE PROFIBUS COLL INVERTER;ST;0;ST;1;0;T +S;0;0' );
      I print  @have array and it shows 7 lines with  S_FER_VENT1_ERRCOLINV is duplicated (should show only 6), it hide only one, in my data file I've lines that have the same id ($1) two times and others are normal (no duplicate id)

        That's because you have changed the criteria. 'S_FER_VENT1_ERRCOLINV;ERRORE PROFIBUS COLL INVERTER;ST;0;ST;1;0;TS;0;0' does not match the check in your initial post of

        if($var eq $1 and $line2 =~ /(.*?);.*?;SU LI IR ST;.*?;SU LI IR ST;.*? +;.*?;.*?;.*?;.*?(?:$)/)

        ... so it has not been removed. Did you not mean this? Was your initial post misleading?

Re^3: How to check lines that start with the same word then delete one of them
by roboticus (Chancellor) on Apr 10, 2020 at 12:53 UTC

    agnes00:

    You can use a hash to help you decide what to do on later lines something like this:

    $ cat foo.pl use strict; use warnings; # Read the input file. Trim trailing whitespace # and preserve the line number. my $cnt = 0; my @inp = map { s/\s+$//; [ ++$cnt, $_ ] } <DATA>; print "INPUT LINES:\n"; print join(": ", @$_), "\n" for @inp; # Process the file. We'll keep the first record for # each key we find and ignore all successive values # with two exceptions: First, we won't process a # 'foo' record until we've handled a 'bar'. Second, # we won't handle a 'baz' record in the first five # lines. my %seen; my @out; for my $rLine (@inp) { # Parse out the interesting fields my $line_num = $rLine->[0]; # parse out the interesting fields my ($key, $val) = split /\s+/, $rLine->[1]; # ignore keys we've already processed next if $seen{$key}; # don't process 'foo' until we've handled 'baz' next if $key eq 'foo' and ! exists $seen{baz}; # don't process 'baz' in the first five lines next if $key eq 'baz' and $line_num < 5; # process the line and remember the key push @out, $rLine->[1]; ++$seen{$key}; } print "\n\nOUTPUT LINES:\n"; print $_, "\n" for @out; __DATA__ foo the bar quick baz red bar fox foo jumped biz over bar the bim lazy baz red foo dog

    As you process your file, you record the important decisions you've made in the hash to help guide future decisions.

    In the example I cobbled together, I used three rules:

    1. Only process a 'foo' record if we've already processed a 'baz' record.
    2. Ignore 'baz' records occurring in the first five lines of the file.
    3. Otherwise, keep the first record of each type we find.

    Using these rules, when we run the program we get:

    $ perl foo.pl INPUT LINES: 1: foo the 2: bar quick 3: baz red 4: bar fox 5: foo jumped 6: biz over 7: bar the 8: bim lazy 9: baz red 10: foo dog OUTPUT LINES: bar quick biz over bim lazy baz red foo dog

    As you can see, we're able to handle all the rules with a single pass over the file with the help of a little bookkeeping.

    As you've guessed in your original post, the nested loop can consume quite a bit of time for a large file. So it's worthwhile to think of ways you can do your processing without having to repeatedly scan the file.

    What if you wanted to keep the *last* line starting with each key? One way would be to leave the logic the same, but to process the records in reverse order. Another way would be to change the way you handle the "seen" hash: Instead of checking whether you've processed the key or not, you could store the data you want to keep in it. That way, you can simply overwrite each record with a later record if you want, and then output them at the end. If you're keeping your data in memory, you can even come up with a method to process the data in *one* order and output the data in a *different* order to make your task simpler.

    It's often a mistake to immediately jump in and solve the problem until you think about how to simplify things. Sometimes you'll find that a problem could easily be solved if the data came in a more convenient form or order. In those cases, it may be profitable to simply reshape or reorder the data to suit and then solve the simpler problem.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      Thanks for your reply. If I tried to resolve my problem making rules. I'll say for the first line I won't process sata until I find another line with the same first word. and If I don't find I won't treat other line because the loop has been finished, you got me ? my problem is to find for each line a line that has the same first word and I should do this for all lines. I don't see how to do this in one loop

        You may be assuming that for each line you iterate on, you need to compare it to all the other lines. So if you have ten lines, you do 10^2, or 100 comparisons. If you have 40000 lines, you do 40000^2, or 1,600,000,000 comparisons; 1.6 billion comparisons. This is called O(n^2) time.

        The reason it takes 1.6 billion comparisons is because you haven't found a way to look for a needle in a haystack without looking at every other straw every time you pick a straw. Your algorithm takes two linear searches and multiplies them together. O(n) * O(n) = O(n^2). But when people suggest to use a hash, they are relying on a piece of knowledge that you haven't considered as component of your algorithm: with a hash lookup you can find every straw again as soon as you've seen it once. Hash lookups occur in O(1) time. O(n) * O(1) = O(n).

        A hash-based approach isn't hard to conceptualize if you start with the idea of an array. We have to assume our data set is just integers in a reasonable range for now; say 1 to 10000:

        my @seen; foreach my $target (@integer_data_set) { if(defined($seen[$target]) { print "Duplicate: $target\n"; next; } $seen[$target] = 1; print "New: $target\n"; }

        In this example, we use the data point as an array index. I think everyone agrees that looking up an array index is really fast. Because it's just a math calculation (address + offset), it's constant time; no matter how big the seen array grows, you can find that millionth element in the array by doing the math problem base_address + 1000000. So with this approach, you can check if a value was previously seen without searching through your list of seen values. You have only a single outer loop, so linear time.

        But your data isn't integers; it's strings, and an algorithm that converts a string to a unique index could quickly result in arrays too big to fit in memory (offsets would be come huge). Arrays only use integers as indexes. So this approach won't work for strings. You really want the ability to find a previously seen string in constant time, as easily as indexing into an array, but you need to limit the number of possible offsets (buckets) to a range that fits in memory. You can (slight oversimplification). A hash lets you treat a string as an index. So the algorithm above becomes:

        my %seen; foreach my $target (@string_data_set) { if (exists($seen{$target}) { print "Duplicate: $target\n"; next; } $seen{$target} = 1; print "New: $target\n"; }

        What is this magic that allows us to treat a string as an index? What we refer to as a hash in Perl, is a data structure where the key is calculated so as to be easy to find later. Furthermore, the calculation limits the number of buckets to a set of offsets that will fit in a reasonable amount of memory. Fortunately Perl does this for us. But we can create our own poor-man's version if we want. First we need a hashing algorithm that produces an offset value:

        sub hash { my $string = shift; my $buckets = 100; my $hash_value = 0; foreach my $ordinal (map {ord} split(//, $string)) { $hash_value = ($hash_value + $ordinal) % $buckets; } return $hash_value; }

        This hash function will take any string and turn it into a numeric value between 0 and 99. The method it uses is a simple modulus operation, which is a very primitive hashing algorithm, but it is useful here because it is cyclical in nature. It obviously makes no guarantees about collisions. But it does guarantee that the same string will always result in the same numeric value between 0 and 99. For random strings, the distribution is approximately balanced, too. The next thing we need is a way to deal with collisions. All hashing implementations have this need. In many hash implementations if two entities boil down to the same hash bucket, they're stored in a linked list. When the linked lists get too big to be efficient, the number of buckets is increased. Let's say the number of collisions we allow for is 10. After ten collisions we re-hash with 200 buckets instead of 100. Ten more collisions and we re-hash at 400 buckets, and so on.

        With this strategy, finding an element in our poor-man's hash is a matter of doing this:

        sub find { my ($storage_aref, $needle) { my $hashed = hash($needle); if(defined($storage_aref[$hashed)) { my $found_ix = first_ix {$needle eq $storage_aref->[$hashed]-> +[$_]->[0]} @{$storage_aref->[$hashed]}; return $storage_aref->[$hashed]->[$found_ix]->[1]; } return; }

        This makes some assumptions, such as one that undef is not in-band in your data set, and that your storage array looks something like this:

        $storage_aref = [ [ # Bucket 0. ['string' => 'value'], ['another' => 'other value'], ], [ # bucket 1. ['something' => 'its value'], ], [...], # Bucket 2, ... ];

        So with this approach you can find the bucket just by calculating the hash value of the string: bucket = base_address + hash(string). And then you search for the string itself within that bucket, which will only have a few items in it. This is a really simplistic representation of how hash based data structures work. Perl provides these built in, and they're a lot more solidly implemented than my example. But the underlying principle is the same; boil a key string down to a repeatable hash value that is represented as an offset which can be reached immediately.

        All this is to explain that it is possible to have constant time lookups of strings by calculating their bucket and providing enough buckets that there are an acceptable number of collisions to search through, and this is all hidden behind the humble %hash. So your algorithm can actually be a single loop.


        Dave

        If you are not to work with lines which have different beginnings, then you can divide a problem: 1. group lines by their beginnings, 2. then solve a problem in each group separately.