Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

Code Efficiency

by fourmi (Scribe)
on Mar 25, 2004 at 11:19 UTC ( [id://339682]=perlquestion: print w/replies, xml ) Need Help??

fourmi has asked for the wisdom of the Perl Monks concerning the following question:

Hi,
I have a base directory containing (amongst other files/dirs) eleven sub directories (Sub1-Sub11), each of these contains approx 10000 html files (amongst other files), which don't actually contain html, just a list of keywords (30 max). I wish to produce a list of sorted unique keywords. There is a lot of repetition, and new keywords can be added at any time to any file (hence this needs to be done dynamically). Worst case is that i might have to sort and uniq a list of (11x10000x30) 3300000 items. I can't flatten the directory structure as the OS (windows) has difficulty with directories containing such large numbers of small files. My code is definately not optimised, I've never really worried abut optimisation for saving milliseconds, but given the magnitude here, I am happy to save minutes! Apologies for any offensive code!

Any solutions, pointers, references, or just thoughts and comments will be well received.

Thanks
ant
use strict; use Cwd; my ($PWD) = getcwd; my (@DirList); my ($DirItem); my (@SUBDirList); my ($SUBDirItem); my ($CurrentHtmlFile); my (@Lines); my ($Line); my (%een); my (@KeyWords); opendir(DIR, $PWD) || die "Cannot Open The Directory \"$PWD\"\n"; @DirList = readdir(DIR); closedir DIR; foreach $DirItem (@DirList) { if ($DirItem =~ /^Sub/) { opendir(SUBDIR, "$PWD\\$DirItem") || die "Cannot Open The Dire +ctory \"$PWD\\$DirItem\"\n"; @SUBDirList = readdir(SUBDIR); closedir SUBDIR; foreach $SUBDirItem (@SUBDirList) { if ( $SUBDirItem =~ /html$/) { $CurrentHtmlFile = "$PWD\\$DirItem\\$SUBDirIte +m" ; open (READ, "<$CurrentHtmlFile") || die "Could +n't Read From $CurrentHtmlFile"; $Line = <READ>; @Lines = split (/,/,$Line); close (READ); push (@KeyWords, @Lines); } } } } foreach (@KeyWords) {++$een{$_};} print sort keys %een;



a bit of background. i have an image library, keywords for image \d+.jpg are stored in \d+.html, and this can be updated

Slight paradigm shift. Am not going to run this dynamically each time a keyword file is updated, but instead just update the keyword list. (troll through the keyword list, if seen, nothing, else appened new keyword to the end) keywords can only be added, not removed, should reduce load considerably.
Thanks to all that commented, it will definately help with the initial collection, and i'm a better perl programmer now too!
cheers

Replies are listed 'Best First'.
Re: Code Efficiency
by PodMaster (Abbot) on Mar 25, 2004 at 11:42 UTC
    What is this for (what is your ultimate goal -- Heard of Plucene? Heard of DB_File {would help with memory overhead and the sorting}))?

    Now looking at the code...

    • One easy way to speed this up is to implement caching. If the timestamp hasn't changed, no need to rescan the entire file/directory.
    • Don't build a giant array only so you can build a giant hash afterwards, just build the giant hash ($wordlist{$word}++).
    • Why are you keeping a @DirList? You don't appear to be doing anything with it, might think about timestamps again.
    • You can write foreach my $SUBDirItem( @SUBDirList ){ ... }
    • Are you sure you wanna die (ex, you've reached the last two files, and you can't read the one before last, why not just move on to the next one)?

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      looking up Plucene, dying is just at this stage, will lose it on implementation, very nice point re timestamp, and again with giant has & giant array
      thanks!
Re: Code Efficiency
by Jaap (Curate) on Mar 25, 2004 at 11:31 UTC
    1. Lose the parens () around a variable declaration (just my $PWD;). Edit on Abigail: this is just a style issue, unless you need the parens, i'd omit them.
    2. Do not use @KeyWords, just put them in the hash as you find them:
    foreach (split (/,/,$Line)) { $een{$_}++; }
    3. Declare variables as late as possible (if you want the advantages of use strict;).
    4. I assume you use @DirList and @SUBDirList because you can't 'flatten' the directory structure? There is not real speed problem here but my instinct tells me to lose the two arrays and open the subdirs in a while loop of the readdir(DIR).

    These are all minor items, none of them speed up the most time-consuming part.
      Lose the parens () around a variable declaration (just my $PWD;).
      Why? It's utterly useless to give "suggestions" without motivating them. It's even more useless to give suggestions that appear to be general, but which are sometimes plain wrong. There's a difference between my $var and my ($var), and while sometimes it doesn't matter whether you place the parens (like in the code being discussed), it sometimes changes the meaning of the expression. You seem to think it matters in this case. Could you explain why?

      Declare variables as late as possible (if you want the advantages of use strict;).
      Huh? While declaring variables in an as restrictive scope as necessary is usually a good thing, what does that have to do with any advantage of use strict?

      Abigail

        Declare variables as late as possible (if you want the advantages of use strict;). -- Huh? While declaring variables in an as restrictive scope as necessary is usually a good thing, what does that have to do with any advantage of use strict?

        The first use of a variable is usually important for consequent uses. If you declare the variable as late as possible, you make sure the program breaks when you remove the first use, but not all others.

        my $foo; # Predeclared: bad style # Usually done by C-coders, who predeclare a whole bunch of variables: # my ($foo, $bar, $baz, $quux, $whatever, $i, $think, $is, $needed, $a +nywhere); ... $foo = foo(); ... print $foo;
        Now, I remove the assignment:
        my $foo; ... print $foo; # No error, just a warning at run time.
        However, if you declare $foo as late as possible:
        my $foo = foo(); # Declared when needed ... print $foo;
        Removing the assignment again:
        ... print $foo; # Error at compile time!

        Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

      cool, thanks a lot, useful stuff i wouldn't have thought about, i agree that the guts are going to take ages anyway though!
      cheers

      een

      What is een? Are you one of those people who have rrays and calars?

      Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        What is een? Are you one of those people who have rrays and calars?

        Actually, calar would be very nice looking name (when compared to other two) :D.
        yup ;o)
Re: Code Efficiency
by Abigail-II (Bishop) on Mar 25, 2004 at 13:10 UTC
    If I understand correctly, this can be done in the shell using less than 20 characters (yeah, shell!):
    sort -u Sub*/*html
    With the added benefit that sort knows how to sort huge lists efficiently. (This is assuming that your shell doesn't have problems with the number of arguments that result from the expansion - but even then, a shell solution is way smaller)

    But if you really want to save time, you shouldn't look in sorting/uniquing phase. Look at the updating phase. Keep files sorted. Reduce the number of files. Reduce the number of files per directory.

    Abigail

      i WISH i was using nix! just came to the same conclusion re: updating too, i can't changed the tree/file structure, but the updating is a sinch now hopefully (queue BSOD)
      cheers!

        i WISH i was using nix!

        Unixish utilities are available for many platforms, including MS Windows.

        Juerd # { site => 'juerd.nl', plp_site => 'plp.juerd.nl', do_not_use => 'spamtrap' }

        windows also has a built in sort. run help sort from the command line for options.
Re: Code Efficiency
by Hena (Friar) on Mar 25, 2004 at 12:27 UTC
    You could lose one if structure (testing for html in last foreach loop) there by using grep in readdir.
    # instead of # @SUBDirList = readdir(SUBDIR); @SUBDirList = grep {/html$/} readdir (SUBDIR);
    Don't know how much speed it gives though :).
      I'll take anything that goes, thanks a lot, also applicable in the other regexp if, less involved there, only 11 recursions, but all the same!!
      cheers!
Re: Code Efficiency
by fourmi (Scribe) on Mar 25, 2004 at 13:16 UTC
    a bit of background. i have an image library, keywords for image \d+.jpg are stored in \d+.html, and this can be updated
    Slight paradigm shift. Am not going to run this dynamically each time a keyword file is updated, but instead just update the keyword list. (troll through the keyword list, if seen, nothing, else appened new keyword to the end) keywords can only be added, not removed, should reduce load considerably.
    Thanks to all that commented, it will definately help with the initial collection, and i'm a better perl programmer now too!
    cheers
Re: Code Efficiency
by MidLifeXis (Monsignor) on Mar 25, 2004 at 18:16 UTC

    If you have 10000 files in a directory, you may also be running into problems with the OS being able to handle directories that large. See my writup on this on a different thread.

    Depending on the OS, you are better off (from a memory / OS standpoint) making your directory structure deeper, so that when the OS is opening the file, it does not need to read more than necessary. The wider / flatter the directory structure, the more resources may be necessary to get to an individual file (on open() for example).

    --MidLifeXis

      Hi,
      yeah i definately found that windows tends to get screwy once a dir has more than 30,000 files in it, hence splitting into smaller chunks, even then it's still fairly screwy, that it a VERY interesting thread you link to, thanks very much, i didn't catch it initially.
      cheers!
Re: Code Efficiency
by TilRMan (Friar) on Mar 25, 2004 at 19:49 UTC

    You say "new keywords can be added at any time." What about changing and deleting? If you only have keywords added, then you could optimize (for time) very efficiently by just detecting what new keywords show up and then inserting them into the sorted list. For instance, make a cache (copy) of the tree and compare the old vs. the new.

    The next optimization step: Keep a timestamp (and/or hash) of every file instead of a copy. Since you are only adding keywords, any time a file changes, dump its whole contents into the sorted list.

      Hi
      It's a photo archive, so hopefully previous keywords won't be incorrect (carrots shouldn't dissapear from the image!).
      Basically what i have done now is made a file 'UBER1' containing
      PicRefID: Keyword1, Keyword2 ...
      The keyword list is basically another file 'UBER2', identical but without the PicRefID, and sorted, and uniq'd. if a keyword is added to a picture, it adds it to the relevant PicRefId line in 'UBER1', and also if the new keyword does not already appear in 'UBER2', places it in the relevant place.

      It's all much much nicer now. All the data in the 100,000ish files is in one file, and the keyword list only needs occasional updating.

      Basically my initial plan was based on a tiny subset running well, and not expecting such massive over heads. now it only takes a minute or two, and that's more the upload of 15M of keywords, rather than script load (and then the upload!)
      Cheers to all that helps, much appreciated!!
      ant

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://339682]
Approved by b10m
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-04-24 23:24 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found