lukka has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I am a newbie to perl. I need to do something like the unix grep in perl. say i have a large file BUFFER.dat (which has 322129 lines) I have another large array $validbufs (which can have say 200000 elements). What i need to do is print out the lines in BUFFER.dat that have the first column (in that line) exactly matching any element in the array $validbufs. The problem is it seems to take way too much time to do this look up. I am only indicating a few lines of the code here, which i think are the problematic sections that take too much time
my %linecontainsbuf=(); while ($line = <BUFFER>) { @fields=split /\'/,$line; $searchfield=$fields[1]; $linecontainsbuf{$searchfield} = $line for @validbufs; } foreach $validbuf (@validbufs) { print $linecontainsbuf{$validbuf}; };
The problem is it seems to work ok if say file/array sizes are small. Say BUFFER.dat has 10000 lines and say validbufs has 28 elements, then it finishes in 2 minutes. However as soon as BUFFER.dat has large number of lines (e.g 322129 lines) and @validbufs has 200000 elements, then it seems to take hours!! Please note: @validbufs is a unique list of strings. and in BUFFER.dat also the first column is always unique. There are no duplicates in the input data (both in the BUFFER.dat and in @validbufs). @validbufs can have number of elements varying between 200000 to say 50. So essentially if @validbufs has say 50 elements, then the script should just print out the 50 lines in BUFFER.dat which match the elements in @validbufs. If @validbufs has say 200000 elements, then the script should print out the 200000 lines in BUFFER.dat that match the elements in @validbufs. I tried splitting the big file BUFFER.dat in lines of 1000 each and doing the lookup on the split files, but even that seems very slow (takes hours). Can you please suggest what is a fast way to do this lookup?

Replies are listed 'Best First'.
Re: how to speed lookups?
by ikegami (Pope) on Nov 11, 2008 at 21:06 UTC

    You are doing 64,425,800,000 assignments where should be doing 322,129.

    $linecontainsbuf{$searchfield} = $line for @validbufs;
    should be
    $linecontainsbuf{$searchfield} = $line;

    Also, using split does more work than you need. Try replacing

    @fields=split /\'/,$line; $searchfield=$fields[1];
    ($searchfield) = $line =~ /^'([^']*)/;
      Thanks a million for pointing out the offending line. As you correctly said i was unnecessarily doing those assignments (when there was no need of them). Just by implementing your first suggestion (removing the for @validbufs part) helped bring the runtime to say 2 minutes (from the huge runtime of more than 2 hours (probably even more) that my erroneous code was doing earlier) Thanx again
      Also, using split does more work than you need. Try replacing

      I personally believe that people indeed tend to overuse split too much. But at the same time IIRC the latter is optimized if an explicit LIMIT parameter and maybe (I only have a vague memory of this, and I admit I may just have "invented" it...) even simply if a specific number of items are assigned, i.e.:

      my ($searchfield) = split /'/, $line;

      Update: as far as the last point is concerned, I definitely stand corrected, as per ikegami's remark, which I thoroughly trust.

      But at this point one should certainly do a benchmark to be sure, and I don't have the slightest intention of doing so, especially given that the question asked by the OP has bigger issues to the point of not really being understandable at all, as far as I'm concerned...

      If you can't understand the incipit, then please check the IPB Campaign.

        While the OP mentions fetching the first field, you'll notice the OP used $fields[1], not $fields[0]. I believe that means he's using split to parse a single-quoted string. That's not what I'd call an appropriate use of split.

        So in this case, the optimal split would be

        my $searchfield = ( split /'/, $line, 3 )[1];

        Using split requires matching twice, creating three variables and copying the entire string.

        my ($searchfield) = $line =~ /^'([^']*)/;

        Using the match operator requires matching once, creating one variable and copying only the field. Actual performance aside, it's definitely a cleaner process.

        Finally, split will be very bad at handling escaping when the OP discovers the need for it.

        maybe ([...]) even simply if a specific number of items are assigned

        An operand doesn't know what the caller will do with the returned list, so your "maybe" doesn't apply. If it behaves as you think, the following snippet will result in $c being assigned 2 when it should be assigned 3.

        $c = ($f1,$f2) = (4,5,6);
Re: how to speed lookups?
by dragonchild (Archbishop) on Nov 11, 2008 at 21:01 UTC
    What is $linecontainsbuf{$searchfield} = $line for @validbufs; doing? You're also going to have memory issues because you're, essentially, loading both files into RAM.

    Best would be to sort the two files on the relevant keys (using an external sort program), then iterate through the filehandles. Remember - once you've seen something in %valid_bufs, you can't see it again.

    My criteria for good software:
    1. Does it work?
    2. Can someone else come in, make a change, and be reasonably certain no bugs were introduced?
Re: how to speed lookups?
by duckyd (Hermit) on Nov 12, 2008 at 00:45 UTC
    A couple of other changes you might consider are using a hash to check if a line is one that should be kept, and outputting lines you want to keep immediately, rather than accumulating them all in memory:
    my %validbufs = map { $_ => 1 } @validbufs; while ($line = <BUFFER>) { $searchfield=(split /\'/,$line)[1]; print $line if $validbufs{$searchfield} }
      Thank you very much for the suggestions. I will try them out. This forum is awesome!
Re: how to speed lookups?
by blazar (Canon) on Nov 12, 2008 at 19:55 UTC

    I personally believe I have some problems understanding your question: I'm very tired so it may be just me, but I'm trying to recap from your description of the problem, and please tell me if I'm getting anything wrong...

    1. "@validbufs is a unique list of strings." (Taken verbatim!) Thus not regexen...
    2. You need to "print out the lines in BUFFER.dat that have the first column (in that line) exactly matching any element in the array @validbufs." (Also verbatim but for a typo.) Thus what does it mean to "exactly match" in this context?

    I think I eventually got it: you want to print those lines in your input file i.e. BUFFER.dat such that their first field is an element of @validbufs. Is this a correct rephrasing of your problem? If so, then the general rule is to always make that @validbufs into a hash, say %isvalid. (Although in 5.10 times you can take advantage of the smartmatch operator ~~, but I'm sure that for a problem of this size a hash is still better.) This can be just as simple as:

    my %isvalid; # outside of the loop over lines, of course! @isvalid{@validbufs}=(1) x @validbufs;

    Then you'd simply

    print $line if $isvalid{$field};

    with $field got as per ikegami's suggestion or perhaps not even created as an intermediate variable.

    Now, it is to be said that since as you claim there is a 1-1 correspondence between lines that match in the file and entries of @validbufs, one may be tempted to delete from %isvalid any key that matches, to shrink it, because once it has matched it won't match any more. But given the amount of work this would inflict on %isvalid and hashes' efficiency at lookups, I'm sure that both a big-Oh analysis and experimental evidence would be that it would take more work overall than not doing it...

    If you can't understand the incipit, then please check the IPB Campaign.