http://qs321.pair.com?node_id=1056645


in reply to regex for translation

I think your regex takes too long because it tries to match the parameter as first part again and fails (not sure if this is possible, depends on where the regex continues after a match has happened, I'm a little rusty on that part of the regex lore and too lazy to look it up). But to fail it has to search through to the end of the file until it can be sure it failed, because of the evil "s" parameter on your regex, so that ".*" means rest of file instead of just rest of line.

How to correct that depends. Maybe you can change the "__" in front of the parameter to something else. Or it might make sense to split on "__", then work on the array piece by piece, avoiding the g parameter on your regex. Or change the global matching to happen in a loop and make sure the matching starts after the parameter

Replies are listed 'Best First'.
Re^2: regex for translation
by klayman (Initiate) on Oct 03, 2013 at 07:17 UTC
    The problem is that i need to go through whole file as it contain html markup and that can contain translation strings anywhere in it. But you are right about splitting it on __ symbol, however issue arise if there is translation string inside translation string, i need to separate them somehow and make sure that translation string or anything within _(' .. ') doesnt contain __(' ... ') as well

      As I said you have more than one option. If my hypothesis is right. I would do the following:

      1) Make sure hypothesis is right: If possible, call the code from a small test-script which calls the code a few 100.000 times and time that. Then use two testfiles: One with a few translation strings at the beginning of a long file, the other with the same translation strings at the end of the file. If the first file takes much longer then you can be pretty sure that runaway regex search is the culprit.

      Another possibility would be to execute the code (either all or a extracted parts with a test script) with a newer perl version and use debugging features like "use re "debug";".

      2) Change your programm to do the search and replace in a loop. If you call a regex with g parameter in scalar context, it only finds one occurence and stops, but it remembers where it left of (you can find out with pos() and change where it continues with pos() as well). What I would propose would be something like this:

      my $result=""; while ($trans=m/__\('.*?[^\\']'|".*?[^\\"]"(?:,,?.*?[^'"])?\)/gis) { +#changed to remove the two capture parens my $pos=pos(); $result.= substr($_,0,$pos); my $translen= length($trans); my $transtext= substr($_,$pos,$translen); <here $transtext has your complete translation string. Do the subs +titution on $transtext, you can use the code you already used or even + simplify it> $result.= $transtext; #remove the already translated part from $_ substr($_, 0, $pos+$translen)=''; #we reset search to begin at position 0 again pos()=0; } $_= $result . $_;

      Untested code but this should theoretically work. It has to parse the translation string twice, so it will naturally be twice as slow as your original simple regex. But it should not bring your webserver to its knees.

      Clarification update: "twice as slow" only applies to the parsing of the string, not to the complete regex execution. gettr() will still be called only once,