Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Simply Too Slow Find and Replace

by guopan (Acolyte)
on Jun 02, 2002 at 22:40 UTC ( [id://171079]=perlquestion: print w/replies, xml ) Need Help??

guopan has asked for the wisdom of the Perl Monks concerning the following question:

Good afternoon,

I have to 'standardize' 1 million-ish records with demographic data that sit (alas) in SQL server.

The 'old-way' processes 6000 records/per hour on a good day.

Because I know Perl is the right way to go, but think that SQL server is a cow, I'm using this baby-Perl code within an activeX script within a DTS package to work with the list of words to find and replace (250 lines. example: ADMINISTRATION=ADMIN) and text file of demographics (1000000 lines example: C00006 DR. TOM SMITH VETERANS ADMINISTRATION BUFFALO NEW YORK) outside of the db environment. It's still too slow, though.

Target speed: 1 million records in an hour!

#*********************** # Perl ActiveX Script #*********************** sub Main() { use Win32; #allows for MsgBox functionality use OLE; $file1 = "c:/foo/sub_list.txt"; $file2 = "c:/foo/cust_names_city.txt"; $file3 = ">c:/foo/cust_names_city_new.txt"; open (SUBLIST, $file1) || die MsgBox("Can't open file"); open (CUSTLIST, $file2) || die MsgBox("Can't open file"); open (NEWCUSTLIST, $file3) || die MsgBox("Can't open file"); @pairs = <SUBLIST>; @names = <CUSTLIST>; foreach $pair (@pairs) { ($key,$value) = split (/=/, $pair); chomp ($value); foreach $name (@names) { $name =~s /\s$key/$value/ig; @newrecords = (@newrecords,"$name"); } } foreach $newrecord (@newrecords) { print NEWCUSTLIST $newrecord; } close (SUBLIST); close (CUSTLIST); close (NEWCUSTLIST); return 0; # DTSTaskExecResult_Success; } ******************************************

This humble and frustrated initiate thanks you in advance for your help!

guopan@earthlink.net

Replies are listed 'Best First'.
Re: Simply Too Slow Find and Replace
by samtregar (Abbot) on Jun 02, 2002 at 23:24 UTC
    It's hard to know exactly, but I think your problem is here:

    foreach $pair (@pairs) { ($key,$value) = split (/=/, $pair); chomp ($value); foreach $name (@names) { $name =~s /\s$key/$value/ig; @newrecords = (@newrecords,"$name"); } }

    This code recompiles the regular expression on every execution. Perl doesn't know that it only changes on every pass through the outer loop. I would replace it with something like this:

    foreach $pair (@pairs) { ($key,$value) = split (/=/, $pair); chomp $value; # compile pattern to match $key with qr// my $match = qr/\b\Q$key\E\b/; # serach and replace names and push onto @newrecords foreach (@names) { s/$match/$value/gi; push @newrecords, $_; } }

    This version also incorporates the eariler good advice concerning \b, \Q, \E and push().

    If that doesn't help you, consider slowing down and analyzing the performance of your program. Use Devel::DProf and/or Benchmark to get a better idea of how your code performs and where the time is being spent. Performance problems are very hard to solve by guessing.

    -sam

Re: Simply Too Slow Find and Replace
by tadman (Prior) on Jun 02, 2002 at 22:59 UTC
    You should be able to do it if you're working with plain text files. Still, your description doesn't really say what you want to do in specific terms. Just that you're searching and replacing.

    So, here's a few quick observations:
    @newrecords = (@newrecords,"$name");
    Could be replaced with:
    push(@newrecords, $name);
    Apart from that, you seem to have the right idea. Just make sure your substitution table doesn't contain any "fancy" characters such as brackets or question marks and so forth. If it might, probably best to handle them better:
    $name =~ s/\s\Q$key\E/$value/ig;
    The \Q and \E prevent any of those characters from being treated as special. If you had a $key of 'OPEN?' then it would match both 'OPE' and 'OPEN' which is clearly not what you want. Also, if you are matching words, this is probably a better regex:
    $name =~ s/\b\Q$key\E\b/$value/ig;
    The \b means to match on a word boundary. This way you can replace things like the "HAM" in "RALPH USED A HAM AS A HAMMER" without substituting on "HAMMER".
Re: Simply Too Slow Find and Replace
by Aristotle (Chancellor) on Jun 03, 2002 at 03:49 UTC
    All my comments are in the.. uh, comments. In the code. *grin*
    #*********************** # Perl ActiveX Script #*********************** use Win32; #allows for MsgBox functionality use strict; ## <---- get in the habit <----- ## "die MsgBox()" works, but it's more of a coincidence; ## this is better sub msgbox_die { my $problem = shift; MsgBox $problem; die $problem; } ****************************************** sub Main() { my $file1 = "c:/foo/sub_list.txt"; my $file2 = "c:/foo/cust_names_city.txt"; ## you don't want the > in the filename - just a matter of cleanliness my $file3 = "c:/foo/cust_names_city_new.txt"; my @search; my @replace; open (SUBLIST, $file1) || msgbox_die("Can't open file"); ## fetch only one line at a time rather than slurping the entire file +at once while(<SUBLIST>) { ## use a third parameter in split: ## specifies how in how many parts we want to break the string ## split() can quit looking after that number, ## and it also prevents loss of data my ($key,$value) = split (/=/, $_, 2); chomp ($value); push @search, qx/\b\Q$key/; ## put a precompiled regex in the arra +y push @replace, $value; } close (SUBLIST); open(CUSTLIST, $file2) || msgbox_die("Can't open file"); open(NEWCUSTLIST, ">$file3") || msgbox_die("Can't open file"); ## you +want the > here while(<CUSTLIST>) { ## ditto slurping my $i = 0; for my $re (@search) { ## go through all the stored regexes, s/$re/$replace[$i]/ig; ## and apply them, $i++; ## keeping track of which string we are replacing with } print NEWCUSTLIST $_; ## no need to keep the result in memory } close (CUSTLIST); close (NEWCUSTLIST); return 0; # DTSTaskExecResult_Success; } ******************************************
    ____________
    Makeshifts last the longest.
Re: Simply Too Slow Find and Replace
by dws (Chancellor) on Jun 02, 2002 at 23:49 UTC
    I'm thinking there's a problem with
    foreach $pair (@pairs) { ($key,$value) = split (/=/, $pair); chomp ($value); foreach $name (@names) { $name =~s /\s$key/$value/ig; @newrecords = (@newrecords,"$name"); } }
    Assuming M pairs and N names, won't this end up with NxM newrecords? I suspect that you either want to verify that the subsitution fired before appending to @newrecords, or, if multiple substitutions per record are possible, defer the appending to @newrecord until all @names are processed. In either case, you'll end up with M newrecords instead of NxM.

Re: Simply Too Slow Find and Replace
by theorbtwo (Prior) on Jun 03, 2002 at 00:44 UTC

    Also, you seem to use OLE; without needing it anywhere.

    You don't want to die MsgBox; that'd take the return value of MsgBox, and use it as a reason to die.

    But most importantly, you're making two passes over the data where only one is neccessary.

    Instead of @newrecords = (@newrecords,"$name");, just print NEWCUSTLIST $name;, and get rid of your second for loop.

    For that matter, instead of reading all of SUBLIST into @pairs, do while ($pair = <SUBLIST>) {

    The idea is that less passes is better, and less data in memory at once is better.


    We are using here a powerful strategy of synthesis: wishful thinking. -- The Wizard Book

Re: Simply Too Slow Find and Replace
by thor (Priest) on Jun 03, 2002 at 11:51 UTC
    Sometimes, you have to take the whole picture into account. In this case, it seems like you are doing all of your work in flat files, but you will probably have to upload your final results back into the database, which itself takes time. Which will be faster? Who knows. Also, how often will have to standardize this data? If you're only doing it once, your code is throw away and it could be argued that time spent optimizing could be time better spent. If you're not only doing it once, the larger question that begs to be asked is: why? If you don't have control over how the records are inserted into the db, fair enough. However, if you do, you should place a filter on the front end to keep your records in canonical form by doing these substitutions before they are even inserted into the db. This keeps you from having to iterate over the entire database, most of which you have already "fixed".

    thor

Re: Simply Too Slow Find and Replace
by ariels (Curate) on Jun 03, 2002 at 06:43 UTC

    The chief cause of your program wasting time is that it recompiles every pattern for every replacement. It's better to do it just once, before processing commences. The correct abstraction for a substitution is a code ref (an "anonymous sub"). You pre-compile all your subs, then blast each string through them all.

Re: Simply Too Slow Find and Replace
by Beatnik (Parson) on Jun 03, 2002 at 06:53 UTC
    Besides the stuff already covered... do you need the case insensitive matching? On that amount of records, using i when you dont need it can cost you quite alot of cycles.

    Greetz
    Beatnik
    ... Quidquid perl dictum sit, altum viditur.
Re: Simply Too Slow Find and Replace
by agentv (Friar) on Jun 03, 2002 at 05:22 UTC

    ...I'm thinking that there might be a little increase in speed here:

    ($key,$value) = split (/=/, $pair); chomp ($value);

    Instead of a call to split() and another to chomp(), it might work just to use a regular expression to get the terms.

    ($key,$value) = ($pair =~ /^([^=]+)=(.*)$/);

    I don't know if it helps in a significant way, but it's something I would have instinctively tried. (Of course, if you spend half a day bulletproofing the regular expression, it probably wouldn't have been worth it.)

    ---v

Re: Simply Too Slow Find and Replace
by mandog (Curate) on Jun 03, 2002 at 12:46 UTC

    thor has given you the best advice in this thread. You should look again here

    If I understand your problem correctly, you should also seriously consider whether or not Perl is the tool. SQL is designed precisely for this job. Even with 250 substituions to make, you can meet your million record per hour goal with the right indexes and halfway modern hardware.

    Also for the non Perl people that follow you this:
    UPDATE my_table SET name='Admin' WHERE name='Administrator';
    ..is easier to follow than the Perl.

    Another (maybe better?) more perlish way to take advantage of the indexes in your database would be to use DBI to create a cached SQL statement with placeholders. You could loop through, stuffing items from your list of substitutions into the placeholders..



    email: mandog
      And that will only work where the entire column "name" is "Admin". Whereas the user has described a case where the entire "name" contains "Admin" -- and your solution would do absolutely nothing to fix the data.
        But if he can use sql, surely you would agree that it would be much faster.
Re: Simply Too Slow Find and Replace
by I0 (Priest) on Jun 03, 2002 at 01:15 UTC
    foreach $pair ( @pairs ){ ($key,$value) = split (/=/, $pair); chomp ($value); $s{lc $key} = $value; } foreach $name ( @names ){ $name =~s/(\s(\w+))/$s{lc $2}||$1/ieg; }
      Yow! That's probably going to be a heck of a lot slower than what's he's doing already. That will do a substitution on every pass, whether it contains a match or not!

      -sam

Re: Simply Too Slow Find and Replace
by Sifmole (Chaplain) on Jun 03, 2002 at 11:48 UTC
    Since there were many answers on how to improve your regex handling, I thought perhaps a rearranging of the structure might help. This idea will be greatly affected by the size of the @pairs array; also it is just an idea and I haven't benchmarked it.
    # Get a hash of your translations, for quick lookup %pair_lookup = map { chomp; split(/=/, $_, 2); } <SUBLIST>; foreach $name (<CUSTLIST>) { print NEWCUSTLIST // split the line into words and check if there // is a translation for it. If there is, return // the translation otherwise return the original join(" ", map { (defined $pair_lookup{$_}) ? $pair_lookup{$_} : $_; } split (/ /, $name, 2) ); }
    Again this was just on the idea that there might be a way other than via a regex, and that it might be quicker.
Re: Simply Too Slow Find and Replace
by Abigail-II (Bishop) on Jun 03, 2002 at 15:13 UTC
    A problem I see is that for each of your lines you are going to perform each of your substitutions. That is, with N names and M pairs, you are doing N * M substitutions.

    Is that really necessary? Is there more about the data, be it the names or the pairs that can be used to speed up the process? Do you really need the N * M substitutions, or are you just replacing words with words?

    Abigail

Re: Simply Too Slow Find and Replace
by paulbort (Hermit) on Jun 03, 2002 at 15:51 UTC
    I think you're losing speed in the DTS/ASP/Perl mess. It's easy enough to dump a SQL Server table to a text file. Use that as the input for the Perl script. Write the output in the same format and use BULK INSERT to load it into the table. Half an hour learning these two steps will save a a lot of the time spent pushing records through DTS/ASP. Even 1 million records times 400 bytes is only 400Mb, so disk space for the text files shouldn't be a problem.

    ---
    Spring: Forces, Coiled Again!
Re: Simply Too Slow Find and Replace
by krazken (Scribe) on Jun 03, 2002 at 18:27 UTC
    I would go about it a little different. I would read your find/replace table into a hash with the find value as the key of the hash. Depending on the amount of memory on your box you are running this with, you may need to tie this has to a file in order to accommodate the data. With find/replace type stuff, I have found better performance with DB_BTREE instead of DB_HASH. Using this approach on a 4 processor 4GB ram server, I have processed close to 12 million an hour. By doing this way you would have something like:
    #pseudo code open lookup file while(<lookup>) { chomp; my ($find,$replace)=split /=/; $hash{$find}=$replace; } close lookup open file to work on while(file) { my $field_to_lookup_on=split/unpack whatever if(exists $hash{$field_to_lookup}) { $new_field=$hash{$field_to_lookup}; } else { $new_field=$field_to_lookup_on; } put field back in record print record.
    This method does away with looping through arrays and puts it into a little better of a data structure. let me know if this doesn't make sense. kudos. krazken

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://171079]
Approved by samtregar
Front-paged by graff
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-18 02:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found