Re^2: how to find the unique lines in a file?

Thanks for your effort in understanding my problem here. I tried your code, its works perfectly fine for the example which i had posted. I didnt understand this part

foreach (@aListMembers) {
    #we only care if its longer than what we already have
    if (exists($hLongest{$_})) {
      my $kLists = $hLongest{$_};
      my $aList = $hLists{$kLists};
      next if ($iCount <= scalar(@$aList));
    }
[download]

hLongest is empty, how can we find if exists here? moreover, when i added a line to the input file like this:

mylist_12 sublist153  sublist_34 sublist_123 sublist_345 sublist_245
mylist_1  sublist_153  sublist_87  sublist_876  sublist_78
mylist_6  sublist_8
mylist_2  sublist_12  sublist_34  sublist_09
mylist_3  sublist_87  sublist_09
mylist_7  sublist_8  sublist_9
mylist_9  sublist_56 

the result should be:
mylist_12 sublist_153  sublist_34 sublist_123 sublist_345 sublist_245
mylist_2  sublist_12  sublist_34  sublist_09
mylist_7  sublist_8  sublist_9
mylist_9  sublist_56 

but in the result, even the shorter line which has sublist_153 gets ad
+ded to result like this:

mylist_12 sublist153  sublist_34 sublist_123 sublist_345 sublist_245
mylist_1  sublist_153  sublist_87  sublist_876  sublist_78
mylist_2  sublist_12  sublist_34  sublist_09
mylist_7  sublist_8  sublist_9
mylist_9  sublist_56 

In the above result, sublist_153 is present in 2 lines.
[download]

In my final output, all the lines should be unique. All the lines in the output file shouldnt have anything in common. In your program, Are you comparing each element in each line, with each element in other lines ???? can we arrange the lines by descending order first, and then start searching "one line" with all the other lines in the file. In that case, when a match(common elements) of that "one line" in present in some other lines, all the other lines having a duplicate element can be deleted. We need not worry about the length, because, the "one line" will always be longer than the other lines in the file since we have sorted it by length in decending order. So, when we read through the whole file, "one line" would be the current line in side a foreach loop or while loop, and we will encounter only the left out lines( because, we will delete the duplicate lines when we find them while matching/looking for common elements). Hope my explanation is clear to you. Thank you once again for your kind help :)

Comment on Re^2: how to find the unique lines in a file? Select or Download Code

Replies are listed 'Best First'.

Re^3: how to find the unique lines in a file?
by ELISHEVA (Prior) on Apr 22, 2009 at 10:02 UTC

hLongest is empty, how can we find if exists here?

%hLongest is only empty when we read the very first line of the file. Thereafter it gets an entry for each and every list member we have found so far. How? Well, if you look at the line after the if (exists... statement, you'll see an assignment statement that creates an entry for the list member currently stored in $_. Or if an entry already exists, it updates it with the name of the longest list found so far.

The purpose of the if statement is to make sure that we only get to that assignment line, if the current list is longer than any other list containing the list member $_. First, it checks to see if we already have an entry for the list member $_ in the hash. That is the purpose of exists($hLongest{$_}) statement. If the entry is missing we go straight to the assignment.

If the entry isn't missing, then we look up the last list we found for the list member $_. If the list in the current line is shorter or the same size, then we skip to the top of the loop and start testing the next list member. next is what does the skipping for us. It insures that we never reach the assignment statement below it.

but in the result, even the shorter line which has sublist_153 gets added to result like this

That is happening because "mylist_1" is the longest line for "sublist_87" and "sublist_153" just happens to appear in both "mylist_12" and "mylist_1". This is part of the problem bart and Anno were trying to explain to you in the CB.

For some datasets, there is no way to satisfy the dual goal of (a) having only one line per list member and (b) having the longest list containing that list member.

Best, beth

[reply]
[d/l]
[select]


There's more than one way to do things
	PerlMonks