Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Filtering lines from one file from another

by antonn (Initiate)
on Feb 03, 2011 at 02:40 UTC ( #885883=perlquestion: print w/replies, xml ) Need Help??

antonn has asked for the wisdom of the Perl Monks concerning the following question:

Hi everyone, I'm trying to delete the lines from one file that appears in another file. It would be like this:
filter.pl file1 file2

And file1 contains:
cat dogwhisperer umbrella mother sun
file2:
dog sun moon umbrella rain
And the output should be:
cat dogwhisperer mother
I'm trying to read the whole first file and then loop reading lines from the other file doing grep, but I'm failing with this... Edit: I think now it kinda works, I was trying with $_ inside grep but it doesn't seem to work.
open(RANDOM,"<",$frandom) or die("could not open first file!"); my @r_contents = <RANDOM>; close(RANDOM); open(FILTER,"<",$ffilter) or die("could not open the second file!"); while(<FILTER>) { chomp; my $line = $_; @r_contents = grep(!/^$line$/,@r_contents); } close(FILTER); print @r_contents;

Replies are listed 'Best First'.
Re: Filtering lines from one file from another
by bellaire (Hermit) on Feb 03, 2011 at 03:31 UTC
    If you supply some code, you can get some Perl-specific help. Until then, as an aside, you can do this without perl at all if you're on a system with grep installed:
    $ grep -vf file2 file1 cat dogwhisperer mother
      bellaire,
      First, I believe your grep usage would need to be modified because as-is, the lines will be treated as regular expressions and not fixed strings. Second, I wonder how this would work if arbitrarily large files need to be supported. For instance, sort resolves this issue by using an on-disk divided and conquer strategy so assuming you have enough free filesystem, you can sort any size file. I just don't know how grep would do in this situation.

      Cheers - L~R

        I have tried with grep -xvf file2 file1 and seems to work
Re: Filtering lines from one file from another
by LanX (Sage) on Feb 03, 2011 at 02:46 UTC
    > I'm trying to read the whole first file and then loop reading lines from the other file doing grep, but I'm failing with this...

    sounds good, show us your code in order to help you! :)

    Cheers Rolf

      Thanks, I have edited and posted my solution that seems to work, could be any problem with that solution?
Re: Filtering lines from one file from another
by mertserger (Curate) on Feb 03, 2011 at 09:16 UTC
    Another non-Perl solution would be the Unix command "comm"
    comm -23 file1 file2 would do it.
    I like comm because you can get lines in file1 and not in 2, in 2 and not in 1 and lines common to both.
      mertserger,
      The problem with comm is that it requires the files to be sorted (which they are not). If antonn doesn't need to preserve order than this is a fine solution. It is actually a fairly difficult problem to solve if you self-impose a number of constraints such as order preservation and arbitrarily large files.

      Cheers - L~R

Re: Filtering lines from one file from another
by pavunkumar (Scribe) on Feb 03, 2011 at 05:37 UTC
    Hi, You can try this,
    
    # open file1, fil2 ; open F1 , "$file1" or die "Cant open : $! \n" ; open F2 , "$file2" or die "Cant open : $! \n" ; # Create the hash key with each line of the file2 while (<F2> ) { chomp; $file{$_}=''; } # Print the line , if key does not exist in the hash ; while (<F1> ) { chomp ; print $_ , "\n" unless(exists ( $file{$_} ) ) ; }
      pavunkumar,
      Setting aside that you are using bare word filehandles, this is a fine solution assuming one of the files can fit into memory as a perl hash. It may be able to be improved by determining which of the two files is smaller and loading that one into memory (assuming which file should be considered the filter isn't important to antonn). Unfortunately, this solution fails if you need to support arbitrarily large files. I know the description of the problem says one of the files is being read into memory, I am just thinking about the general problem.

      Cheers - L~R

Re: Filtering lines from one file from another
by Limbic~Region (Chancellor) on Feb 04, 2011 at 00:42 UTC
    antonn,
    I am sorry if your first post at the Monastery didn't go as well as you might have hoped or expected. I encourage you to stick around but also to read How (Not) To Ask A Question and How do I post a question effectively?.

    Since you already have a working solution, I am going to give you far more than you wanted to know about how this problem might be approached as a research project. First consider the most straight forward implementation. Read 1 file line by line while repeatedly opening/reading the lines of the second file. As a rule of thumb, don't reach for a regex when eq or index will do. This solution will work but is suboptimal because it does not scale well O(N * M) and IO is notoriously slow.

    Assuming you have enough memory, you could improve this solution by completely reading 1 file into an in-memory array. This still leaves an O(N * M) algorithm but you have dramatically reduced the IO. If instead we use an in-memory hash, we can reduce IO and effectively change the algorithm to O(N) or O(N + M) if we don't ignore the time to construct the hash is based on the size of the file.

    You may decide using an in-memory hash of the filter file is great until you run into a situation where the filter file will no longer fit into memory. An easy solution to this problem is to sort both of the files. You can then read them both at the same time. Essentially, you read one line from each file and compare them. Which ever is less than the other, you continue to read lines from that file comparing it until you find a line that is equal to or greater than your other line. I assume the rest of the algorithm is obvious, but if you need more help let me know. If you don't look too closely, this algorithm is O(N + M).

    Again, this may look like a fine solution. You can even use existing solutions like comm. What happens when the output needs to appear in the same order as in the original file? In this case, you need to find a way to have our cake and eat it too. One solution may be to load the filter file into a database with an index. This effectively takes us back to the in-memory hash solution but on disk. You are now doing extra IO but it is minimized because of the efficiency in finding the data you are looking for. Since an entire DB is overkill, you might instead use DB_File.

    If you have read my Necessity is the mother of invention meditation, perhaps you might want to further your skill level by trying to solve this problem without the aid of a DB as described above. Since we are contriving this situation, let's suppose environmental conditions prevent you. You could read the filter file line by line and generate a Digest::MD5 hash and write them out to a new file. Next, we can sort the file numerically - remember the hash digest can be treated as 128 bit integer. Now we read the file to be filtered line by line, generate a MD5 hash and then perform a binary search of the filter file. The binary search is possible because we know how many lines are in the file (wc for instance) and each line is fixed width. This way we can calculate where we need to seek to. I haven't conducted rigorous algorithm analysis but this would appear to be O((N * Log M) + M).

    As you can see, something you thought of simple question can turn into an expedition into learning. I hope this helps.

    Cheers - L~R

Re: Filtering lines from one file from another
by umasuresh (Hermit) on Feb 03, 2011 at 19:02 UTC
    Another simple solution:
    grep -v -w "$(cat file2)" file1 -v to eclude -w to match exact word
    caution: If the file sizes are too large, this may not work!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://885883]
Approved by LanX
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (1)
As of 2022-07-06 00:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?