Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Re^4: Need to speed up many regex substitutions and somehow make them a here-doc list

by xnous (Sexton)
on Oct 02, 2022 at 19:42 UTC ( [id://11147222] : note . print w/replies, xml ) Need Help??


in reply to Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list
in thread Need to speed up many regex substitutions and somehow make them a here-doc list

hippo> I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs?

Yes, you're right , the actual match/subs are non-greedy. I just wanted to provide a simpler and beautified version of my ugly script but the code structure is exactly the same.

Corion> Regardless of the performance problems, you may be interested in using a proper stemmer to create a search index. See Lingua::Stem.

I don't need (yet) a full stemming solution, which might not be the ideal tool as I'd have to override numerous substitutions.

hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

AnomalousMonk> Here's something that may address your needs more closely. As always, the fine details of regex definition are critical. I still have no idea as to relative speed :)

I tested your solution last but unfortunately it took 2'23" to complete. I'll be doing more tests in the following days and report back with any progress. Thank you all for your wisdom.

  • Comment on Re^4: Need to speed up many regex substitutions and somehow make them a here-doc list

Replies are listed 'Best First'.
Re^5: Need to speed up many regex substitutions and somehow make them a here-doc list
by hv (Prior) on Oct 02, 2022 at 22:05 UTC

    hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds.

    Glad it's making some progress, at least. :)

    It occurs to me now that since you do not need the /.*/ "to end of line" behaviour, you also do not actually need to split the text on newlines: you could work directly on the full text. That would substantially reduce the number of ops you execute, which should give a further speedup.

    The next step beyond that would be to combine the three substitutions into a single one, with a single hash. The idea here would be to concatenate the three regexps from the previous iteration, but wrapping the whole in (?|...) so the three distinct captures each get saved as $1, and make a single "master" lookup combining each of %w1, %w2, %w3. If we can combine "was/were" in there as well, I think we'd be starting to get properly competitive with the sed scripts.

    It is also worth considering whether you need Unicode support (I have no idea whether your sed supports it). If you do not need Unicode, you should also be able to get further speed by adding aa to the regexp flags, like my $re1 = qr{\b(@{[ join '|', reverse sort keys %w1 ]})\b}iaa;

Re^5: Need to speed up many regex substitutions and somehow make them a here-doc list
by LanX (Saint) on Oct 03, 2022 at 12:26 UTC
    First of all, please don't reply to different sub-threads in one post, this makes following the discussion much harder and is damaging your cause.

    Secondly, it's unlikely that the speed of the regex-engine matters much if combined with the overhead to read those amounts of data. Processing data in RAM is now many magnitudes faster than file-systems.

    Benchmarking the whole workflow might give you a new perspective.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      First of all, please don't reply to different sub-threads in one post, this makes following the discussion much harder and is damaging your cause.

      Apologies, I scoured the FAQ on this and couldn't understand how to properly reply. And I was thoroughly baffled when my last reply wasn't listed under the initial node. I still can't understand how my reply ended elsewhere; if you don't mind me asking about the proper way to do it? Should I hit reply or comment on?

        Should I hit 'reply' or 'comment on'

        To clarify what LanX said: You use [comment on] if you want to reply to the top visible node of the thread you are currently reading. But you use [reply] next to a particular comment ("node", or what you might think of as a sub-post) if you want to reply to that node instead of the top node.

        Whichever of the links you choose, the forum will show the node you are replying to directly above your edit-the-new-post window, so you know where exactly in the conversation you are.

        So, when I am looking at your Re^6: Need to speed up many regex substitutions and somehow make them a here-doc list, I clicked on [comment on] to reply to your "Apologies" post -- and I can see the text of that post while I am creating my answer. (Once I [preview] my post, I lose that context and instead see the rendered version of my post above the editing box before [create]-ing it.) If I had wanted to contradict LanX directly, I could have clicked on [reply] next to his post in the threaded view (and in fact will be, soon).

        [Comment on] on the bottom bar of the specific post you're replying to.

        This is a full threaded forum and every node of the tree will display the subtree of all specific replies. But the comments are always per node.

        If you're used to list oriented boards this might be confusing at first.

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery