Are two lines in the text file equal

prostoalex has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Are two lines in the text file equal by BrowserUk (Patriarch) on Nov 13, 2003 at 01:53 UTC
With a file containing "millions" of lines, your biggest problem is memory consumption. 1 million lines of 80 chars and you've got 80 MB if you load the file as a scalar. Put it into an array, and this rises to something like 104 MB. To detect duplicates, you really need to use a hash. For this, with keys of 80 characters, a quick experiment shows that you will require approximately 160 bytes perl line, or 150 MB for a file of one million lines. And this is storing undef as the value in the hash, when you probably want to be storing either the line number or preferable the byte offset of the start of the line. If your lines are more than 80 characters, or your file contains multiple millions of lines, the memory consumption is likely to grow beyond the capacity of your machine. So, you need to look for alternatives. One such possibility is to create a digest from each line. Using a binary MD5 digest generated from each line will give you 16 byte keys rather than 80, which sounds like it will allow you to store a 5x bigger file for a given amount of memory. Unfortunately, this isn't so. Using 16 byte keys, still requires approx. 120 bytes/line or 115 MB. Again, this is using undef for the values, which isn't useful as it would only tell you that you had duplicates and their MD5, but not where in the file they were, and would require another pass to regenerate the MD5s for each line until you found the duplicates. The MD5 algorithm itself isn't that quick either. There is also no guarantee that every line will generate a unique MD5, so you would need to retain the line numbers or byte offsets of each line that generated each MD5 so that you can go back and check that they are indeed duplicates. This will again increase the memory usage and reduce the size of the files you can handle. Ultimately, unless you have huge amounts of memory, you're probably better off sorting the file and then processing the sorted file comparing consecutive lines in order to find the duplicates, and then re-process the original file to find their original context if that is a requirement. If you have a system sort facility that can handle the volumes of data involved, then use that, and the uniq command will file the dups if you have it. This will almost certainly be quicker than anything you write in perl. However, if you don't have a sort facility that can handle the volume, then you might try this split/sort/merge code I wrote a while back. It is fairly simplistic, and could definitely be improved from a performance standpoint, but it runs well enough that I have never needed to improve it. YMMV:) Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply]
Re: Are two lines in the text file equal by sauoq (Abbot) on Nov 13, 2003 at 01:24 UTC
`sort log.txt \| uniq -d` -sauoq "My two cents aren't worth a dime.";	[reply] [d/l]
Re: Are two lines in the text file equal by tachyon (Chancellor) on Nov 13, 2003 at 00:19 UTC
If you want really fast (and have a lot of memory like 1-4GB) use a hash thusly: `my $log1 = '/var/log/httpd/access_log'; my $log2 = '/var/log/httpd/access_log.1'; my %hash; open FH1, $log1 or die $!; open FH2, $log2 or die $!; $hash{$_}++ while <FH1>; $hash{$_}++ while <FH2>; close FH1; close FH2; for (keys %hash) { print if $hash{$_} > 1; }` [download] Any key with a count > 1 is a match. Expect this to use about 200-400 MB of memory per million lines as a ballpark (at least double, maybe even triple the raw data size). With Perl you spend memory for speed. Because of the time stamps, ips, etc you may want to parse out the part you really want to see similar (which will also save memory as the keys will be much shorter). cheers tachyon	[reply] [d/l]
Re^2: Are two lines in the text file equal (!count) by tye (Sage) on Nov 13, 2003 at 02:38 UTC
Any key with a count > 1... ...might just have appeared in one file more than than once but never the other. `my $log1 = '/var/log/httpd/access_log'; my $log2 = '/var/log/httpd/access_log.1'; my %hash; open FH, $log1 or die $!; $hash{$_}++ while <FH>; close FH; open FH, $log2 or die $!; while( <FH> ) { print if $hash{$_}; } close FH;` [download] At least, that is how I interpret the question. Update: BrowserUk has more singular words than I have plural words to go on and so is probably right. It appears tachyon also thought two files were involved, so I won't feel too bad. - tye	[reply] [d/l]
Re: Re^2: Are two lines in the text file equal (!count) by BrowserUk (Patriarch) on Nov 13, 2003 at 03:09 UTC
Am I the only one who interprets this question to relate to finding duplicates in a single file? ...a given text file....anywhere in the* file...The file contains...* Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail Hooray! Wanted!	[reply]
Re: Re: Re^2: Are two lines in the text file equal (!count) by prostoalex (Scribe) on Nov 13, 2003 at 07:05 UTC
Re: Re: Re: Re^2: Are two lines in the text file equal (!count) by PodMaster (Abbot) on Nov 13, 2003 at 12:29 UTC
Some notes below your chosen depth have not been shown here
Re: Are two lines in the text file equal by davido (Cardinal) on Nov 13, 2003 at 00:28 UTC
Determining fastest must be a matter of profiling and benchmarking. And ultimately if speed is your greatest concern you would drop into assembly language and hand-craft your routines. If speed isn't quite that critical, here is one way to do it: `my %hash; while ( my $line = <DATA> ) { print "Found another duplicate of: \"", $line, "\" in line $.\n" if ++$hash{$line} > 1; }` [download] That will tell you if the file contains duplicate lines. On the other hand, if you have a particular line in mind, and want to know if it is repeated elsewhere, you pretty much just scan through the file looking for a duplicate. No shortcut there. Dave "If I had my life to live over again, I'd be a plumber." -- Albert Einstein	[reply] [d/l]
Re: Are two lines in the text file equal by QM (Parson) on Nov 13, 2003 at 00:30 UTC
Update: He did say Algorithm, didn't he? Assuming not all-in-memory, and for certain values of "fastest"... Sort each log file, removing duplicates (perhaps into new files). Merge sort the files together, (adding some tag to each line indicating the original file and line, if desired). Scan the merge-sorted file, checking for identical "lines" (of original data). If duplicate lines within a file are interesting, handle that in the merge with a tag. -QM -- Quantum Mechanics: The dreams stuff is made of	[reply]
Re: Are two lines in the text file equal by delirium (Chaplain) on Nov 13, 2003 at 00:45 UTC
Here's an old gem from Merlyn covering ~~this~~ a similar topic. Might not be the fastest, but it's pretty damn cool. Update : OK, I'm batting about 120 this week. Better spend some time on the bench for awhile.	[reply]
Re: Are two lines in the text file equal by ptkdb (Monk) on Nov 13, 2003 at 14:05 UTC
Worst case: Millions of lines, and the only two lines that are equal are the first and last ones. I'm surpised that no one has yet mentioned an on disk database, given the constraint on memory. `use DB_File ; tie %h, "DB_File", $dbFname ; while( <$f> ) { if( exists $h{$_} ) { # equal lines in file } $h{$_} = 1 ; # or some piece of data }` [download] It's painful, and probably not nearly so fast as an in-mem hash, but it should work. However, now the next question is how much disk space have you got? The indexes that DB_File generates can get to be VERY large. As for speed, there are quite a few options to DB_File and you may spend some time 'tuning' it. Sensitize oneself to the concept of "fast" vs "fastest" vs "working at all".	[reply] [d/l]
Re: Are two lines in the text file equal by kelan (Deacon) on Nov 13, 2003 at 00:30 UTC
How about this (untested): `open LOG, '<', 'somelog' or die $!; my %tracker; my $line; while ($line = <LOG>) { push @{ $tracker{$line} }, $.; } close LOG; for my $key (keys %tracker) { print "Lines ", join(', ', @{$tracker{$key}}), " are the same.\n" if (@{ $tracker{$key} } > 1); }` [download] kelan Perl6 Grammar Student	[reply] [d/l]