Re^3: Need to speed up many regex substitutions and somehow make them a here-doc list

Replies are listed 'Best First'.
Re^4: Need to speed up many regex substitutions and somehow make them a here-doc list by xnous (Sexton) on Oct 02, 2022 at 19:42 UTC
hippo> I am intrigued by some of your s/// operations - perhaps you could confirm that these give your intended outputs? Yes, you're right , the actual match/subs are non-greedy. I just wanted to provide a simpler and beautified version of my ugly script but the code structure is exactly the same. Corion> Regardless of the performance problems, you may be interested in using a proper stemmer to create a search index. See Lingua::Stem. I don't need (yet) a full stemming solution, which might not be the ideal tool as I'd have to override numerous substitutions. hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds. AnomalousMonk> Here's something that may address your needs more closely. As always, the fine details of regex definition are critical. I still have no idea as to relative speed :) I tested your solution last but unfortunately it took 2'23" to complete. I'll be doing more tests in the following days and report back with any progress. Thank you all for your wisdom.	[reply]
Re^5: Need to speed up many regex substitutions and somehow make them a here-doc list by hv (Prior) on Oct 02, 2022 at 22:05 UTC
hv: Your hash lookup implementation runs twice as fast (34" vs 1'05" for my here-doc regexes). Another difference is it runs faster when operating on lines compared to words. sed seems unbeatable at 6 seconds. Glad it's making some progress, at least. :) It occurs to me now that since you do not need the `/.*/` "to end of line" behaviour, you also do not actually need to split the text on newlines: you could work directly on the full text. That would substantially reduce the number of ops you execute, which should give a further speedup. The next step beyond that would be to combine the three substitutions into a single one, with a single hash. The idea here would be to concatenate the three regexps from the previous iteration, but wrapping the whole in `(?\|...)` so the three distinct captures each get saved as `$1`, and make a single "master" lookup combining each of `%w1, %w2, %w3`. If we can combine "was/were" in there as well, I think we'd be starting to get properly competitive with the sed scripts. It is also worth considering whether you need Unicode support (I have no idea whether your sed supports it). If you do not need Unicode, you should also be able to get further speed by adding `aa` to the regexp flags, like `my $re1 = qr{\b(@{[ join '\|', reverse sort keys %w1 ]})\b}iaa;`	[reply] [d/l] [select]
Re^5: Need to speed up many regex substitutions and somehow make them a here-doc list by LanX (Saint) on Oct 03, 2022 at 12:26 UTC
First of all, please don't reply to different sub-threads in one post, this makes following the discussion much harder and is damaging your cause. Secondly, it's unlikely that the speed of the regex-engine matters much if combined with the overhead to read those amounts of data. Processing data in RAM is now many magnitudes faster than file-systems. Benchmarking the whole workflow might give you a new perspective. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply]
Re^6: Need to speed up many regex substitutions and somehow make them a here-doc list by xnous (Sexton) on Oct 03, 2022 at 17:56 UTC
First of all, please don't reply to different sub-threads in one post, this makes following the discussion much harder and is damaging your cause. Apologies, I scoured the FAQ on this and couldn't understand how to properly reply. And I was thoroughly baffled when my last reply wasn't listed under the initial node. I still can't understand how my reply ended elsewhere; if you don't mind me asking about the proper way to do it? Should I hit reply or comment on?	[reply]
Re^7: Need to speed up many regex substitutions and somehow make them a here-doc list by pryrt (Abbot) on Oct 03, 2022 at 19:17 UTC
Re^8: Need to speed up many regex substitutions and somehow make them a here-doc list by LanX (Saint) on Oct 03, 2022 at 19:47 UTC
Re^7: Need to speed up many regex substitutions and somehow make them a here-doc list by LanX (Saint) on Oct 03, 2022 at 18:24 UTC
Re^8: Need to speed up many regex substitutions and somehow make them a here-doc list by pryrt (Abbot) on Oct 03, 2022 at 19:20 UTC


Perl Monk, Perl Meditation
	PerlMonks