Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number

remove lines matching list of strings

by rootcho (Pilgrim)
on Nov 08, 2012 at 20:17 UTC ( [id://1002977]=perlquestion: print w/replies, xml ) Need Help??

rootcho has asked for the wisdom of the Perl Monks concerning the following question:

Any idea which will be the fastest way to remove a lines from a file (of hundred of thousands lines) which match any line from another file or array-of-strings (thousands of lines).
Should I build one giant regex from the second-file and then compare against the first one by one OR there is faster way ??

Replies are listed 'Best First'.
Re: remove lines matching list of strings
by TomDLux (Vicar) on Nov 08, 2012 at 21:51 UTC
    grep -v -F -f file2 file1

    English translation: using the Unix grep command, search file1 for lines which do not match ( -v ) the fixed strings ( -F ) ( as opposed to regular expressions ) found in file2 ( -f file2 ).

    There are versions of grep available for Windows. Mac OS has a Unix basis, so it already has it.

    If you absolutely have to do it in Perl, I would use the lines from file2 as the keys of a hash, assigning the number 1 as a value. Then, as I read the other file, it's trivial to check whether it is present in the hash.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Yes, I was thinking in the same line... if done in perl using hash seem to be better than building giant regex.
      Heh...didn't know about -F option.. will check it out

        Neither did I, but I scanned through man grep to make sure I was doing things right. I did have a vague recollection there was an option to search for strings rather than regex ... it helps if you know to search for something.

        As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Re: remove lines matching list of strings
by roboticus (Chancellor) on Nov 08, 2012 at 21:37 UTC


    The fastest way? If you're on a unix box, I'd try using grep. It's specialized for that sort of task.

    If you just want to do it with perl, I think reading the entire file into a scalar and then building the giant regex may be the fastest. But you may want to use Benchmark and test to find out what's fast or not.


    When your only tool is a hammer, all problems look like your thumb.

Re: remove lines matching list of strings
by frozenwithjoy (Priest) on Nov 08, 2012 at 21:11 UTC
    Couple questions to start:
    • Are the lines that you want to remove exact matches between the two files?
    • Are the lines in common in the same order for both files?
      - not exact match
      - no, the order is random

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1002977]
Approved by Athanasius
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (6)
As of 2024-04-19 13:15 GMT
Find Nodes?
    Voting Booth?

    No recent polls found