remove lines matching list of strings

by rootcho (Pilgrim)
Any idea which will be the fastest way to remove a lines from a file (of hundred of thousands lines) which match any line from another file or array-of-strings (thousands of lines).
Should I build one giant regex from the second-file and then compare against the first one by one OR there is faster way ??

Re: remove lines matching list of strings
by TomDLux (Vicar) on Nov 08, 2012 at 21:51 UTC
    grep -v -F -f file2 file1

    English translation: using the Unix grep command, search file1 for lines which do not match ( -v ) the fixed strings ( -F ) ( as opposed to regular expressions ) found in file2 ( -f file2 ).

    There are versions of grep available for Windows. Mac OS has a Unix basis, so it already has it.

    If you absolutely have to do it in Perl, I would use the lines from file2 as the keys of a hash, assigning the number 1 as a value. Then, as I read the other file, it's trivial to check whether it is present in the hash.

      Yes, I was thinking in the same line... if done in perl using hash seem to be better than building giant regex.
      Heh...didn't know about -F option.. will check it out

        Neither did I, but I scanned through man grep to make sure I was doing things right. I did have a vague recollection there was an option to search for strings rather than regex ... it helps if you know to search for something.

Re: remove lines matching list of strings
by roboticus (Chancellor) on Nov 08, 2012 at 21:37 UTC


    The fastest way? If you're on a unix box, I'd try using grep. It's specialized for that sort of task.

    If you just want to do it with perl, I think reading the entire file into a scalar and then building the giant regex may be the fastest. But you may want to use Benchmark and test to find out what's fast or not.


Re: remove lines matching list of strings
by frozenwithjoy (Priest) on Nov 08, 2012 at 21:11 UTC
    Couple questions to start:
    • Are the lines that you want to remove exact matches between the two files?
    • Are the lines in common in the same order for both files?
      - not exact match
      - no, the order is random

