Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Hi, I am a newbie to perl. I need to do something like the unix grep in perl. say i have a large file BUFFER.dat (which has 322129 lines) I have another large array $validbufs (which can have say 200000 elements). What i need to do is print out the lines in BUFFER.dat that have the first column (in that line) exactly matching any element in the array $validbufs. The problem is it seems to take way too much time to do this look up. I am only indicating a few lines of the code here, which i think are the problematic sections that take too much time
my %linecontainsbuf=(); while ($line = <BUFFER>) { @fields=split /\'/,$line; $searchfield=$fields[1]; $linecontainsbuf{$searchfield} = $line for @validbufs; } foreach $validbuf (@validbufs) { print $linecontainsbuf{$validbuf}; };
The problem is it seems to work ok if say file/array sizes are small. Say BUFFER.dat has 10000 lines and say validbufs has 28 elements, then it finishes in 2 minutes. However as soon as BUFFER.dat has large number of lines (e.g 322129 lines) and @validbufs has 200000 elements, then it seems to take hours!! Please note: @validbufs is a unique list of strings. and in BUFFER.dat also the first column is always unique. There are no duplicates in the input data (both in the BUFFER.dat and in @validbufs). @validbufs can have number of elements varying between 200000 to say 50. So essentially if @validbufs has say 50 elements, then the script should just print out the 50 lines in BUFFER.dat which match the elements in @validbufs. If @validbufs has say 200000 elements, then the script should print out the 200000 lines in BUFFER.dat that match the elements in @validbufs. I tried splitting the big file BUFFER.dat in lines of 1000 each and doing the lookup on the split files, but even that seems very slow (takes hours). Can you please suggest what is a fast way to do this lookup?

In reply to how to speed lookups? by lukka

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.
  • Log In?

    What's my password?
    Create A New User
    and the web crawler heard nothing...

    How do I use this? | Other CB clients
    Other Users?
    Others drinking their drinks and smoking their pipes about the Monastery: (7)
    As of 2021-04-12 18:58 GMT
    Find Nodes?
      Voting Booth?

      No recent polls found