Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

how to find the unique lines in a file?

by patric (Acolyte)
on Apr 22, 2009 at 04:24 UTC ( [id://759179]=perlquestion: print w/replies, xml ) Need Help??

patric has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks, i have a list of lines in a file. Each line is compared with the other line. If one line has a substring of the other line, the shortest of the two is deleted. Only the longest line is stored.
input file: mylist_1 sublist_153 sublist_87 sublist_876 sublist_78 mylist_6 sublist_8 mylist_2 sublist_12 sublist_34 sublist_09 mylist_3 sublist_87 sublist_09 mylist_7 sublist_8 sublist_9 mylist_9 sublist_56 in the above example, line 2 is a substring of line5. since line5 is l +onger than line2, only only line5 is taken for results. another examp +le is line4 is a substring of line3, since line3 is longer, i take on +ly that for results. another example: apple orange cake juice apple fruits car van bus jeep sumo hat people van car in the above example, apple in found in 2 lines, but i keep only the l +ine which has many elements compared the other. car is found in last +line and 3rd line. but i take only the 3rd line as hit because it has + many elements compared to last line. so my result would be: apple orange cake juice car van bus jeep sumo hat my program: #!/usr/bin/perl open(FH,"input_file.txt") or die "can not open input file"; while($line=<FH>){ @collect=split(/\s+/,$line); push(@aoa,join("#",@collect)); } my %h; for(@aoa){ push(@uaoa,$_ if !$h{join $;, @$_}++; } foreach(@uaoa){ print "$_\n"; } the desired output for this problem: mylist_1 sublist_153 sublist_87 sublist_876 sublist_78 mylist_7 sublist_8 sublist_9 mylist_2 sublist_12 sublist_34 sublist_09 mylist_9 sublist_56
please help :(

Replies are listed 'Best First'.
Re: how to find the unique lines in a file?
by ELISHEVA (Prior) on Apr 22, 2009 at 08:00 UTC

    This problem needs something more than an AoA. Whenever you have a problem needing one of anything, it is best to start thinking "HASH". In this case we will need three hashes. While reading in data we will use two: one to keep track of the array associated with each list name ("mylist_1", "mylist_2", etc) and one to keep track of the name of the longest list found so far for each list member. Once we have the two hashes, we still have another step: finding only the list names that show up as longest lists. That will require using values and map as well as yet another hash.

    The solution to this problem is a good illustration of three distinct, but very important uses of hashes:

    • as an association between a name and a bunch of data (a string, a list, another hash, an object, etc)
    • as a way of keeping track of the "most-est" - the longest, the smallest, the best, the worst, etc.
    • as a way of generating unique lists from a list of possibly duplicate values

    Here is the code. Please make sure you understand it. Feel free to post a reply if you have any questions: understanding how this works is far more important than the solution.

    #always good to start your scripts with these two lines #they let the compiler do about half of the bug catching #and can save you lots of work use strict; use warnings; #First we find which is the longest list for each list #member my %hLists; #associate list name w/ list members my %hLongest; #keep track of longest list for each mbr. while(my $line=<DATA>){ #read in the list and its name my @aListMembers=split(/\s+/,$line); next unless scalar(@aListMembers); #skip empty lines #remember the name of the list my $sList = shift @aListMembers; $hLists{$sList} = \@aListMembers; # now for each member of the list record the name of # the longest list found so far my $iCount = scalar(@aListMembers); foreach (@aListMembers) { #we only care if its longer than what we already have if (exists($hLongest{$_})) { my $kLists = $hLongest{$_}; my $aList = $hLists{$kLists}; next if ($iCount <= scalar(@$aList)); } $hLongest{$_} = $sList; } } #now we only care about lists that show up as the #longest list for at least one element so lets get #a unique "list" of those. We do that by creating a hash #whose keys are the thing we want to be unique. Then #when we want the "unique list" we extract the keys of #the hash. my %hListsWeCareAbout = map { $_ => 1 } values %hLongest; #now lets print out the results: foreach (sort keys %hListsWeCareAbout) { my $aList = $hLists{$_}; print "$_: @$aList\n"; } __DATA__ mylist_1 sublist_153 sublist_87 sublist_876 sublist_78 mylist_6 sublist_8 mylist_2 sublist_12 sublist_34 sublist_09 mylist_3 sublist_87 sublist_09 mylist_7 sublist_8 sublist_9 mylist_9 sublist_56

    Best, beth

    Update: clarification of terminology, cleaned up some misuses of the word "list". Added sort to final print out. Added reminder about understanding the code. Added comments about how this problem illustrates 3 important uses of hashes.

      Thanks for your effort in understanding my problem here. I tried your code, its works perfectly fine for the example which i had posted. I didnt understand this part
      foreach (@aListMembers) { #we only care if its longer than what we already have if (exists($hLongest{$_})) { my $kLists = $hLongest{$_}; my $aList = $hLists{$kLists}; next if ($iCount <= scalar(@$aList)); }
      hLongest is empty, how can we find if exists here? moreover, when i added a line to the input file like this:
      mylist_12 sublist153 sublist_34 sublist_123 sublist_345 sublist_245 mylist_1 sublist_153 sublist_87 sublist_876 sublist_78 mylist_6 sublist_8 mylist_2 sublist_12 sublist_34 sublist_09 mylist_3 sublist_87 sublist_09 mylist_7 sublist_8 sublist_9 mylist_9 sublist_56 the result should be: mylist_12 sublist_153 sublist_34 sublist_123 sublist_345 sublist_245 mylist_2 sublist_12 sublist_34 sublist_09 mylist_7 sublist_8 sublist_9 mylist_9 sublist_56 but in the result, even the shorter line which has sublist_153 gets ad +ded to result like this: mylist_12 sublist153 sublist_34 sublist_123 sublist_345 sublist_245 mylist_1 sublist_153 sublist_87 sublist_876 sublist_78 mylist_2 sublist_12 sublist_34 sublist_09 mylist_7 sublist_8 sublist_9 mylist_9 sublist_56 In the above result, sublist_153 is present in 2 lines.
      In my final output, all the lines should be unique. All the lines in the output file shouldnt have anything in common. In your program, Are you comparing each element in each line, with each element in other lines ???? can we arrange the lines by descending order first, and then start searching "one line" with all the other lines in the file. In that case, when a match(common elements) of that "one line" in present in some other lines, all the other lines having a duplicate element can be deleted. We need not worry about the length, because, the "one line" will always be longer than the other lines in the file since we have sorted it by length in decending order. So, when we read through the whole file, "one line" would be the current line in side a foreach loop or while loop, and we will encounter only the left out lines( because, we will delete the duplicate lines when we find them while matching/looking for common elements). Hope my explanation is clear to you. Thank you once again for your kind help :)
        hLongest is empty, how can we find if exists here?

        %hLongest is only empty when we read the very first line of the file. Thereafter it gets an entry for each and every list member we have found so far. How? Well, if you look at the line after the if (exists... statement, you'll see an assignment statement that creates an entry for the list member currently stored in $_. Or if an entry already exists, it updates it with the name of the longest list found so far.

        The purpose of the if statement is to make sure that we only get to that assignment line, if the current list is longer than any other list containing the list member $_. First, it checks to see if we already have an entry for the list member $_ in the hash. That is the purpose of exists($hLongest{$_}) statement. If the entry is missing we go straight to the assignment.

        If the entry isn't missing, then we look up the last list we found for the list member $_. If the list in the current line is shorter or the same size, then we skip to the top of the loop and start testing the next list member. next is what does the skipping for us. It insures that we never reach the assignment statement below it.

        but in the result, even the shorter line which has sublist_153 gets added to result like this

        That is happening because "mylist_1" is the longest line for "sublist_87" and "sublist_153" just happens to appear in both "mylist_12" and "mylist_1". This is part of the problem bart and Anno were trying to explain to you in the CB.

        For some datasets, there is no way to satisfy the dual goal of (a) having only one line per list member and (b) having the longest list containing that list member.

        Best, beth

Re: how to find the unique lines in a file?
by Anonymous Monk on Apr 22, 2009 at 13:15 UTC
    This solution guarantees that each 'sublist' that does appear will appear only once, and that you will get the longest lines. You may find, however, that some 'sublist's will not show up at all.
    use warnings; use strict; my %data; my %uniq; my @final; while (<DATA>) { $data{$_}=()=$_=~m/\s+\b/g; } OUTER: foreach my $keys (sort {$data{$b} <=> $data{$a}} keys %data) { my @list=split(/\s+/,$keys); foreach (@list) { if (exists $uniq{$_}) { next OUTER; } } $uniq{$_}++ foreach (@list); push @final, $keys; } print @final; __DATA__ mylist_12 sublist153 sublist_34 sublist_123 sublist_345 sublist_245 mylist_1 sublist_153 sublist_87 sublist_876 sublist_78 mylist_6 sublist_8 mylist_2 sublist_12 sublist_34 sublist_09 mylist_3 sublist_87 sublist_09 mylist_7 sublist_8 sublist_9 mylist_9 sublist_56
      brilliant..Awesome...thank you :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://759179]
Approved by almut
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (10)
As of 2024-03-28 10:50 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found