Efficiency: Finding if a file contains a paragraph

C_T has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Efficiency: Finding if a file contains a paragraph by halley (Prior) on Jun 02, 2004 at 16:06 UTC
Two thoughts come to mind. One, how similar must two strings be (a proposed paragraph and an existing paragraph) to be considered identical? `Now is the time for all good Men to come to the aid of their Country.` [download] `Now is the time, for all good men, to come to the aid of their count +ry!` [download] Two, once you have a canonical paragraph, find a good way to hash or digest the paragraph to something that is quick and easy to compare later. For example, Digest::SHA1 or Digest::MD5. The digests are fast to compare, and you can even fit them into an in-memory hashtable or save them to a separate file. Remember to boil down the paragraph to the most canonical form possible, so that you won't get many false-positives that are different in irrelevant ways. Some examples of this might include changing all multiple whitespace to single spaces, lowercasing everything, and removing diacritical marks or some forms of punctuation. The resulting string is not ready to display anymore, but it is ready to hash or digest. -- `[ e d @ h a l l e y . c c ]`	[reply] [d/l] [select]
Re: Efficiency: Finding if a file contains a paragraph by BrowserUk (Patriarch) on Jun 02, 2004 at 17:42 UTC
I think halley has it right with building an index using an MD5 checksum of your paragraphs is the way to go. I though it worth mentioning that you can easily read a file, paragraph by paragraph by setting `local $/ = '';` The following script builds an in memory index (filename/byte offset) for every file in a list given on the command line. It indexed every paragraph in 512 html files in my perl distribution in under 10 seconds and 6MB. #! perl -slw use strict; use G; ## Expands command line wildcards use Data::Dumper; use Digest::MD5 qw[ md5_hex ]; local $/ = ''; ## paragraph mode. my( $pos, %index ) = 0; ## The first para start at offset 0 while( <> ) { ## build a HoAoAs, MD5 is the key ## The values are arrays of [ filename, offset ]. push @{ $index{ md5_hex( $_ ) } }, [ $ARGV, $pos ]; ## Getthe next offset $pos = tell ARGV; ## Back to 0 if we reached the EOF $pos = 0 if eof( ARGV ); } print Dumper \%index; __END__ C:\Perl\html>p:359522 .html \.html \\.html Processed 512 files and 5829 paragraphs into 3293 unique signatures. [download] By storing the offset, you can seek to the place n the file(s) that have a matching MD5 and verify that the paras are indeed identical. There is a small, but statistically possible chance of collisions. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: Efficiency: Finding if a file contains a paragraph by dave_the_m (Monsignor) on Jun 02, 2004 at 16:03 UTC
Are any of your files huge? Are any of your paragraphs huge? Are your paragraphs delineated with a single blank line? Are you looking for an exact paragraph match (eg exact spaces, punctuation, and newlines)? Dave.	[reply]
Re: Re: Efficiency: Finding if a file contains a paragraph by C_T (Scribe) on Jun 02, 2004 at 16:08 UTC
>Are any of your files huge? No, there are just a lot of them. >Are any of your paragraphs huge? No. >Are your paragraphs delineated with a single blank line? Yes. >Are you looking for an exact paragraph match (eg exact spaces, punctuation, and newlines)? Yes. Basically I'm writing configuration paragraphs into configuration files. IF the paragraph already exists (a distinct possibility) I don't want to re-write it, but if it doesn't I want to add it. CT Charles Thomas Madison, WI	[reply]
Re: Re: Re: Efficiency: Finding if a file contains a paragraph by revdiablo (Prior) on Jun 02, 2004 at 20:03 UTC
Are you looking for an exact paragraph match (eg exact spaces, punctuation, and newlines)? Yes. Basically I'm writing configuration paragraphs into configuration files. The problem with an exact match is that two configuration paragraphs may be semantically equivalent, but syntactically different, e.g. in a different order, or with different whitespace. It seems, to me, that the only foolproof way to avoid duplicates would be an actual parser for the config files. Unless you can ensure that semantically equal paragraphs will be syntactically identical, you might be creating duplicates. And with a lot of files, as you say you have, this would make cleaning up any mistakes just as hard. Just something to think about...	[reply]
Re: Efficiency: Finding if a file contains a paragraph by TomDLux (Vicar) on Jun 03, 2004 at 01:30 UTC
I may be totally wrong, but I get the feeling the people suggesting MD5 and other hashing methods focused too much on the performing thousands of times. If it were being performed thousands of times on the same file, or a small number of files, hashing would be worthwile. But here it's thousands of files, each one once. Since hashing is more expensive than grepping, there's no benefit. Anyway, you would need to hash paragraph by paragraph, and thus store a number of hashes for each files, one for each paragraph. Tell you what I would do ... Save the paragraph you're looking for in a file. Read in the paragraph file at the beginning of the script. Then go through the other files, passing over any that are newer than the paragraph file, processing only older files. That way, if you have to stop the script or the system crashes, you don't have to re-do things. of course, processing the files may not actually take all that long, in which case the minor delay caused by checking creation date will not be significant. -- `TTTATCGGTCGTTATATAGATGTTTGCA`	[reply]


Perl-Sensitive Sunglasses
	PerlMonks