Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list

by hippo (Bishop)
on Oct 02, 2022 at 15:08 UTC ( [id://11147218]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Need to speed up many regex substitutions and somehow make them a here-doc list
in thread Need to speed up many regex substitutions and somehow make them a here-doc list

Thanks for providing all this, that gives us a lot more to work on. I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

$ echo Washington werewolves are wasteful | perl -pe 's/w(as|ere)/be/g +i;' behington bewolves are beteful $ echo No work was carried out on Thursday as that as a day of rest | +perl -pe 's/\s.*work.*/ work /gi;' No work $ echo Did you swallow all that bacon | perl -pe 's/\s.*allow.*/ allow + /gi;' Did allow $

As there's no point optimising code which doesn't do what you want it would be good to clear this sort of thing up first.


🦛

  • Comment on Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list
  • Download Code

Replies are listed 'Best First'.
Re^4: Need to speed up many regex substitutions and somehow make them a here-doc list
by xnous (Sexton) on Oct 02, 2022 at 19:42 UTC
    hippo> I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

    Yes, you're right , the actual match/subs are non-greedy. I just wanted to provide a simpler and beautified version of my ugly script but the code structure is exactly the same.

    Corion> Regardless of the performance problems, you may be interested in using a proper stemmer to create a search index. See Lingua::Stem.

    I don't need (yet) a full stemming solution, which might not be the ideal tool as I'd have to override numerous substitutions.

    hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

    AnomalousMonk> Here's something that may address your needs more closely. As always, the fine details of regex definition are critical. I still have no idea as to relative speed :)

    I tested your solution last but unfortunately it took 2'23" to complete. I'll be doing more tests in the following days and report back with any progress. Thank you all for your wisdom.

      hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

      Glad it's making some progress, at least. :)

      It occurs to me now that since you do not need the /.*/ "to end of line" behaviour, you also do not actually need to split the text on newlines: you could work directly on the full text. That would substantially reduce the number of ops you execute, which should give a further speedup.

      The next step beyond that would be to combine the three substitutions into a single one, with a single hash. The idea here would be to concatenate the three regexps from the previous iteration, but wrapping the whole in (?|...) so the three distinct captures each get saved as $1, and make a single "master" lookup combining each of %w1, %w2, %w3. If we can combine "was/were" in there as well, I think we'd be starting to get properly competitive with the sed scripts.

      It is also worth considering whether you need Unicode support (I have no idea whether your sed supports it). If you do not need Unicode, you should also be able to get further speed by adding aa to the regexp flags, like my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}iaa;

      First of all, please don't reply to different sub-threads in one post, this makes following the discussion much harder and is damaging your cause.

      Secondly, it's unlikely that the speed of the regex-engine matters much if combined with the overhead to read those amounts of data. Processing data in RAM is now many magnitudes faster than file-systems.

      Benchmarking the whole workflow might give you a new perspective.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        First of all, please don't reply to different sub-threads in one post, this makes following the discussion much harder and is damaging your cause.

        Apologies, I scoured the FAQ on this and couldn't understand how to properly reply. And I was thoroughly baffled when my last reply wasn't listed under the initial node. I still can't understand how my reply ended elsewhere; if you don't mind me asking about the proper way to do it? Should I hit reply or comment on?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11147218]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (3)
As of 2024-04-24 04:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found