Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Are two lines in the text file equal

by prostoalex (Scribe)
on Nov 12, 2003 at 23:49 UTC ( [id://306676]=perlquestion: print w/replies, xml ) Need Help??

prostoalex has asked for the wisdom of the Perl Monks concerning the following question:

What's the fastest algorithm to check whether two lines in a given text file are equal?

We're talking about log files, and two equal lines could be anywhere in the file. The file contains several million lines.

Update: Memory concerns are of issue, while we're talking about a generic machine with 256 MB on board, generating a giant hash, where the memory consumption might kill everything, is not truly an option.

Thanks for the pointers on MD5 digests and simply re-sorting the log file alphabetically and then going through it two lines at a time to flag the dupes.

Replies are listed 'Best First'.
Re: Are two lines in the text file equal
by BrowserUk (Patriarch) on Nov 13, 2003 at 01:53 UTC

    With a file containing "millions" of lines, your biggest problem is memory consumption.

    1 million lines of 80 chars and you've got 80 MB if you load the file as a scalar. Put it into an array, and this rises to something like 104 MB. To detect duplicates, you really need to use a hash. For this, with keys of 80 characters, a quick experiment shows that you will require approximately 160 bytes perl line, or 150 MB for a file of one million lines. And this is storing undef as the value in the hash, when you probably want to be storing either the line number or preferable the byte offset of the start of the line.

    If your lines are more than 80 characters, or your file contains multiple millions of lines, the memory consumption is likely to grow beyond the capacity of your machine. So, you need to look for alternatives. One such possibility is to create a digest from each line. Using a binary MD5 digest generated from each line will give you 16 byte keys rather than 80, which sounds like it will allow you to store a 5x bigger file for a given amount of memory. Unfortunately, this isn't so. Using 16 byte keys, still requires approx. 120 bytes/line or 115 MB. Again, this is using undef for the values, which isn't useful as it would only tell you that you had duplicates and their MD5, but not where in the file they were, and would require another pass to regenerate the MD5s for each line until you found the duplicates. The MD5 algorithm itself isn't that quick either. There is also no guarantee that every line will generate a unique MD5, so you would need to retain the line numbers or byte offsets of each line that generated each MD5 so that you can go back and check that they are indeed duplicates. This will again increase the memory usage and reduce the size of the files you can handle.

    Ultimately, unless you have huge amounts of memory, you're probably better off sorting the file and then processing the sorted file comparing consecutive lines in order to find the duplicates, and then re-process the original file to find their original context if that is a requirement.

    If you have a system sort facility that can handle the volumes of data involved, then use that, and the uniq command will file the dups if you have it. This will almost certainly be quicker than anything you write in perl.

    However, if you don't have a sort facility that can handle the volume, then you might try this split/sort/merge code I wrote a while back. It is fairly simplistic, and could definitely be improved from a performance standpoint, but it runs well enough that I have never needed to improve it. YMMV:)


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

Re: Are two lines in the text file equal
by sauoq (Abbot) on Nov 13, 2003 at 01:24 UTC
    sort log.txt | uniq -d
    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Are two lines in the text file equal
by tachyon (Chancellor) on Nov 13, 2003 at 00:19 UTC

    If you want really fast (and have a lot of memory like 1-4GB) use a hash thusly:

    my $log1 = '/var/log/httpd/access_log'; my $log2 = '/var/log/httpd/access_log.1'; my %hash; open FH1, $log1 or die $!; open FH2, $log2 or die $!; $hash{$_}++ while <FH1>; $hash{$_}++ while <FH2>; close FH1; close FH2; for (keys %hash) { print if $hash{$_} > 1; }

    Any key with a count > 1 is a match.

    Expect this to use about 200-400 MB of memory per million lines as a ballpark (at least double, maybe even triple the raw data size). With Perl you spend memory for speed. Because of the time stamps, ips, etc you may want to parse out the part you really want to see similar (which will also save memory as the keys will be much shorter).

    cheers

    tachyon

      Any key with a count > 1...

      ...might just have appeared in one file more than than once but never the other.

      my $log1 = '/var/log/httpd/acce­ss_log'; my $log2 = '/var/log/httpd/acce­ss_log.1'; my %hash; open FH, $log1 or die $!; $hash{$_}++ while <FH>; close FH; open FH, $log2 or die $!; while( <FH> ) { print if $hash{$_}; } close FH;

      At least, that is how I interpret the question.

      Update: BrowserUk has more singular words than I have plural words to go on and so is probably right. It appears tachyon also thought two files were involved, so I won't feel too bad.

                      - tye

        Am I the only one who interprets this question to relate to finding duplicates in a single file?

        ...a given text file....anywhere in the file...The file contains...

        Examine what is said, not who speaks.
        "Efficiency is intelligent laziness." -David Dunham
        "Think for yourself!" - Abigail
        Hooray!
        Wanted!

Re: Are two lines in the text file equal
by davido (Cardinal) on Nov 13, 2003 at 00:28 UTC
    Determining fastest must be a matter of profiling and benchmarking. And ultimately if speed is your greatest concern you would drop into assembly language and hand-craft your routines. If speed isn't quite that critical, here is one way to do it:

    my %hash; while ( my $line = <DATA> ) { print "Found another duplicate of: \"", $line, "\" in line $.\n" if ++$hash{$line} > 1; }

    That will tell you if the file contains duplicate lines.

    On the other hand, if you have a particular line in mind, and want to know if it is repeated elsewhere, you pretty much just scan through the file looking for a duplicate. No shortcut there.


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
Re: Are two lines in the text file equal
by QM (Parson) on Nov 13, 2003 at 00:30 UTC
    Update: He did say Algorithm, didn't he?

    Assuming not all-in-memory, and for certain values of "fastest"...

    Sort each log file, removing duplicates (perhaps into new files).

    Merge sort the files together, (adding some tag to each line indicating the original file and line, if desired).

    Scan the merge-sorted file, checking for identical "lines" (of original data).

    If duplicate lines within a file are interesting, handle that in the merge with a tag.

    -QM
    --
    Quantum Mechanics: The dreams stuff is made of

Re: Are two lines in the text file equal
by delirium (Chaplain) on Nov 13, 2003 at 00:45 UTC
    Here's an old gem from Merlyn covering this a similar topic. Might not be the fastest, but it's pretty damn cool.

    Update : OK, I'm batting about 120 this week. Better spend some time on the bench for awhile.

Re: Are two lines in the text file equal
by ptkdb (Monk) on Nov 13, 2003 at 14:05 UTC
    Worst case: Millions of lines, and the only two lines that are equal are the first and last ones.

    I'm surpised that no one has yet mentioned an on disk database, given the constraint on memory.

    use DB_File ; tie %h, "DB_File", $dbFname ; while( <$f> ) { if( exists $h{$_} ) { # equal lines in file } $h{$_} = 1 ; # or some piece of data }
    It's painful, and probably not nearly so fast as an in-mem hash, but it should work. However, now the next question is how much disk space have you got? The indexes that DB_File generates can get to be VERY large.

    As for speed, there are quite a few options to DB_File and you may spend some time 'tuning' it. Sensitize oneself to the concept of "fast" vs "fastest" vs "working at all".

Re: Are two lines in the text file equal
by kelan (Deacon) on Nov 13, 2003 at 00:30 UTC

    How about this (untested):

    open LOG, '<', 'somelog' or die $!; my %tracker; my $line; while ($line = <LOG>) { push @{ $tracker{$line} }, $.; } close LOG; for my $key (keys %tracker) { print "Lines ", join(', ', @{$tracker{$key}}), " are the same.\n" if (@{ $tracker{$key} } > 1); }

    kelan


    Perl6 Grammar Student

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://306676]
Approved by tachyon
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-24 08:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found