Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Efficiency: Finding if a file contains a paragraph

by C_T (Scribe)
on Jun 02, 2004 at 15:53 UTC ( [id://359522]=perlquestion: print w/replies, xml ) Need Help??

C_T has asked for the wisdom of the Perl Monks concerning the following question:

I've got a paragraph of text stored in a scalar. What I'd like to do is:
open a file see if that file already contains the paragraph if not, write the paragraph to the file (append mode)

I can think of at least one way to do this, but I'm concerned because this is an operation I'll be performing tens of thousands of times. I want to make sure I'm doing this the best way possible so as not to slow my script down unnecessarily.

Since there are so many experts here who deeply understand the inner workings of the language and how to make it do things efficiently, I thought I'd bounce it off you before proceeding.

Thanks!

Charles Thomas
Madison, WI

Replies are listed 'Best First'.
Re: Efficiency: Finding if a file contains a paragraph
by halley (Prior) on Jun 02, 2004 at 16:06 UTC
    Two thoughts come to mind.

    One, how similar must two strings be (a proposed paragraph and an existing paragraph) to be considered identical?

    Now is the time for all good Men to come to the aid of their Country.
    Now is the time, for all good men, to come to the aid of their count +ry!

    Two, once you have a canonical paragraph, find a good way to hash or digest the paragraph to something that is quick and easy to compare later. For example, Digest::SHA1 or Digest::MD5. The digests are fast to compare, and you can even fit them into an in-memory hashtable or save them to a separate file.

    Remember to boil down the paragraph to the most canonical form possible, so that you won't get many false-positives that are different in irrelevant ways. Some examples of this might include changing all multiple whitespace to single spaces, lowercasing everything, and removing diacritical marks or some forms of punctuation. The resulting string is not ready to display anymore, but it is ready to hash or digest.

    --
    [ e d @ h a l l e y . c c ]

Re: Efficiency: Finding if a file contains a paragraph
by BrowserUk (Patriarch) on Jun 02, 2004 at 17:42 UTC

    I think halley has it right with building an index using an MD5 checksum of your paragraphs is the way to go.

    I though it worth mentioning that you can easily read a file, paragraph by paragraph by setting local $/ = '';

    The following script builds an in memory index (filename/byte offset) for every file in a list given on the command line.

    It indexed every paragraph in 512 html files in my perl distribution in under 10 seconds and 6MB.

    #! perl -slw use strict; use G; ## Expands command line wildcards use Data::Dumper; use Digest::MD5 qw[ md5_hex ]; local $/ = ''; ## paragraph mode. my( $pos, %index ) = 0; ## The first para start at offset 0 while( <> ) { ## build a HoAoAs, MD5 is the key ## The values are arrays of [ filename, offset ]. push @{ $index{ md5_hex( $_ ) } }, [ $ARGV, $pos ]; ## Getthe next offset $pos = tell ARGV; ## Back to 0 if we reached the EOF $pos = 0 if eof( ARGV ); } print Dumper \%index; __END__ C:\Perl\html>p:359522 *.html *\*.html *\*\*.html Processed 512 files and 5829 paragraphs into 3293 unique signatures.

    By storing the offset, you can seek to the place n the file(s) that have a matching MD5 and verify that the paras are indeed identical. There is a small, but statistically possible chance of collisions.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: Efficiency: Finding if a file contains a paragraph
by dave_the_m (Monsignor) on Jun 02, 2004 at 16:03 UTC
    Are any of your files huge?
    Are any of your paragraphs huge?
    Are your paragraphs delineated with a single blank line?
    Are you looking for an exact paragraph match (eg exact spaces, punctuation, and newlines)?

    Dave.

      >Are any of your files huge?

      No, there are just a lot of them.

      >Are any of your paragraphs huge?

      No.

      >Are your paragraphs delineated with a single blank line?

      Yes.

      >Are you looking for an exact paragraph match (eg exact spaces, punctuation, and newlines)?

      Yes. Basically I'm writing configuration paragraphs into configuration files. IF the paragraph already exists (a distinct possibility) I don't want to re-write it, but if it doesn't I want to add it.

      CT

      Charles Thomas
      Madison, WI
        Are you looking for an exact paragraph match (eg exact spaces, punctuation, and newlines)?
        Yes. Basically I'm writing configuration paragraphs into configuration files.

        The problem with an exact match is that two configuration paragraphs may be semantically equivalent, but syntactically different, e.g. in a different order, or with different whitespace. It seems, to me, that the only foolproof way to avoid duplicates would be an actual parser for the config files. Unless you can ensure that semantically equal paragraphs will be syntactically identical, you might be creating duplicates. And with a lot of files, as you say you have, this would make cleaning up any mistakes just as hard.

        Just something to think about...

Re: Efficiency: Finding if a file contains a paragraph
by TomDLux (Vicar) on Jun 03, 2004 at 01:30 UTC

    I may be totally wrong, but I get the feeling the people suggesting MD5 and other hashing methods focused too much on the performing thousands of times. If it were being performed thousands of times on the same file, or a small number of files, hashing would be worthwile. But here it's thousands of files, each one once. Since hashing is more expensive than grepping, there's no benefit. Anyway, you would need to hash paragraph by paragraph, and thus store a number of hashes for each files, one for each paragraph.

    Tell you what I would do ...

    Save the paragraph you're looking for in a file. Read in the paragraph file at the beginning of the script. Then go through the other files, passing over any that are newer than the paragraph file, processing only older files. That way, if you have to stop the script or the system crashes, you don't have to re-do things. of course, processing the files may not actually take all that long, in which case the minor delay caused by checking creation date will not be significant.

    --
    TTTATCGGTCGTTATATAGATGTTTGCA

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://359522]
Approved by Happy-the-monk
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (4)
As of 2024-04-24 19:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found