Growing strings in search

b4swine has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Growing strings in search by choroba (Cardinal) on Apr 15, 2020 at 13:56 UTC
By "growing", do you mean you have variables that contain the strings and their content gets longer over time? Or do you need to read the strings from outside (a file, socket)? The answer also depends on the regex. For a simple regex, just starting where you left off might be enough (e.g. `qr/X/`), but even for a bit more complex regex `qr/abc/`, you can't just continue with the newly added characters, as the original string might have end in "ab" and the new character might be "c" that completes the match. For even more complex regexes like `qr/a.b/` you might need to return unknown number of characters to the left (i.e. to the last "a"). `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]`	[reply] [d/l] [select]
Re^2: Growing strings in search by b4swine (Pilgrim) on Apr 15, 2020 at 15:47 UTC
The strings are growing because data is coming in from the outside. The regexes are so complicated that they are software generated. The only other possibility seems to be to try to understand the regex and see if it can be expressed as something simpler, but that sounds too horrible to contemplate.	[reply]
Re: Growing strings in search by Corion (Patriarch) on Apr 15, 2020 at 19:31 UTC
This sounds to me like a network packet scanner. If your problem is really like scanning network packets with regular expressions, you might want to take a look at Intels Hyperscan regex engine. It is not as powerful as Perls regular expressions, but it is geared towards matching lots of strings against lots of regular expressions in parallel.	[reply]
Re: Growing strings in search by haukex (Archbishop) on Apr 15, 2020 at 15:31 UTC
Is there any chance you can provide an SSCCE? In theory, it's possible to both check and change the regex engine's position via pos, but as choroba said, in lots of cases you'd actually have to backtrack an unknown amount, and that may not be feasible with pos - depending on the specific case.	[reply]
Re: Growing strings in search by roboticus (Chancellor) on Apr 15, 2020 at 18:44 UTC
b4swine: You can't do it just using the perl regex engine. You might be able to turn your regex into a state machine, though, and then you could your strings through it character by character to find which one(s) match and which one(s) fail. Building a state machine from a regex isn't terribly difficult, but there are a couple sticking points. The main problem is that you'll have a tradeoff of speed vs. memory consumption. For example, if your regex has branches and multiple capture groups in it, then you may have a lot of bookkeeping to track possible starts/stops of capture groups. You may be able to considerably simplify things by making your state machine simpler so it can recognize when a match happens but without tracking your capture groups, and then you can turn the normal perl regex engine loose on the full string to get your capture groups or do a more refined match. I got distracted by your question and put together a simplified demo of the technique for fun. I'm still tuning it a little bit and I'll post it as a secondary reply later when I'm happy enough with it. If you post the regex you're using, I can tweak my demo accordingly. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^2: Growing strings in search by NERDVANA (Deacon) on Apr 16, 2020 at 02:01 UTC
A particularly good tool for this is Ragel http://www.colm.net/open-source/ragel/ It compiles character-fed state machines and is a more modern and flexible design than lex or flex.	[reply]
Re^2: Growing strings in search by tybalt89 (Monsignor) on Apr 15, 2020 at 21:28 UTC
Sounds like you are describing "flex" or "lex". That makes this problem simple, just run a "flex" process against each individual string doing incremental reads -> problem solved :)	[reply]
Re^3: Growing strings in search by roboticus (Chancellor) on Apr 15, 2020 at 23:03 UTC
tybalt89: Yes, it's very much like that. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^3: Growing strings in search by AnomalousMonk (Archbishop) on Apr 16, 2020 at 03:00 UTC
I've never done anything significant with `f?lex` and it's been a long, long time since I've even read anything about its/their capabilities, but don't they have the same inherent "time-forward", if you will, orientation that regex compilers have? IOW, if you had a flex process that parsed backwards, it would be easy to use, but isn't the big trick to design such a process in the first place? Can you expand on the capabilities of flex in this regard? Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Growing strings in search by AnomalousMonk (Archbishop) on Apr 15, 2020 at 19:27 UTC
If one could run the match engine "backwards" and match a string in reverse, i.e., from end to beginning, that might solve your problem. Unfortunately, there's no `m//r` modifier that I'm aware of. One might do a match against a reversed string: `c:\@Work\Perl\monks>perl -wMstrict -le "my $s = ''; my $r = ''; my @add = qw(a b c X d e Y f g h); ;; my $xeger = qr{ \A Y .* X }xms; ;; while (@add){ print qq{'$s' }, $r =~ $xeger ? 'MATCH' : \"no match\"; $r = qq{$add[0]$r}; $s .= shift @add; } " '' no match 'a' no match 'ab' no match 'abc' no match 'abcX' no match 'abcXd' no match 'abcXde' no match 'abcXdeY' MATCH 'abcXdeYf' no match 'abcXdeYfg' no match` [download] Incrementally building and maintaining a reversed string for each "forward" string might not be too expensive even for hundreds of strings of thousands of characters. Unfortunately, you now have the problem of finding some way to persuade the "software" to generate an arbitrary regex backwards, so that, e.g., `X .* Y` becomes `Y .* X` (but that's just a small matter of programming, right? :). BTW: Is it possible for you to suppress re-compilation of the regex on every match, perhaps with the `m//o` modifier? That might speed up matching in general and at least alleviate your problem. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Growing strings in search by belg4mit (Prior) on Apr 20, 2020 at 13:33 UTC
One precompiles an expression using qr// to avoid repeated compilations later. `my $RE = qr/foo.+?bar/; foreach (@str) { warn "Huzzah! $str" if m/$RE/; }` [download] `-- In Bob We Trust, All Others Bring Data.`	[reply] [d/l]
Re^3: Growing strings in search by AnomalousMonk (Archbishop) on Apr 21, 2020 at 00:35 UTC
What I had in mind with my comment about the `m//o` modifier is that b4swine might be receiving a complex regex and then interpolating it a la `m/...$re.../` during actual matching (as, indeed, you show in your example). This might be done if some additional match condition(s) were to be imposed on every received regex, e.g., adding a `\z` anchor at the end to force end-of-string matching. Of course, I don't know that anything like this is actually happening; it was just a thought. For regexes interpolated into `m//` matches, use of the `/o` modifier seems beneficial in some cases, apparently depending on Perl version. (I know that work is being done continuously to refine and optimize regex matching.) For instance, under two older Perl versions that I have access to ATM (ActiveState 5.8.9 and Strawberry 5.14.4.1), benchmarking shows a significant benefit in the newer Perl for using `/o` when interpolating into an `m//` match (assuming this is a valid benchmark, and benchmarks can be tricky :). `use strict; use warnings; use Benchmark qw(cmpthese); print "perl version $] \n"; my $rx = qr{X}xms; cmpthese(-1, { 'qr_bound' => sub { 'Y' =~ $rx ; }, 'qr_interp' => sub { 'Y' =~ / $rx / ; }, 'qr_interp_o' => sub { 'Y' =~ / $rx /o; }, 'm_empty' => sub { 'Y' =~ //; }, 'm_empty_o' => sub { 'Y' =~ //o; }, });` [download] Output: c:\@Work\Perl\monks\belg4mit>perl cmp_qr_compilation_1.pl perl version 5.008009 Rate qr_interp qr_interp_o qr_bound m_empty + m_empty_o qr_interp 2880477/s -- -9% -43% -55% + -67% qr_interp_o 3181791/s 10% -- -37% -51% + -64% qr_bound 5078627/s 76% 60% -- -21% + -42% m_empty 6452035/s 124% 103% 27% -- + -26% m_empty_o 8775008/s 205% 176% 73% 36% + -- c:\@Work\Perl\monks\belg4mit>perl cmp_qr_compilation_1.pl perl version 5.014004 Rate qr_bound qr_interp m_empty m_empty_o q +r_interp_o qr_bound 1004609/s -- -70% -76% -80% + -92% qr_interp 3378700/s 236% -- -20% -33% + -74% m_empty 4237856/s 322% 25% -- -16% + -68% m_empty_o 5068698/s 405% 50% 20% -- + -61% qr_interp_o 13135439/s 1208% 289% 210% 159% + -- [download] The `qr//` comparisons are between failing matches because that's what I imagine would happen most often in the actual application. I just threw in some successful `//` null matches out of curiosity. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Growing strings in search by roboticus (Chancellor) on Apr 16, 2020 at 19:33 UTC
b4swine: Here's the demo I put together. It still has shortcomings, but I got tired of playing with it, so I thought I'd post it here in case it would be interesting to you or another perlmonk at some time. It has a few shortcomings due to me wanting to keep it a short demo, and I'm not happy about how the abstractions are factored, but I've got other things I want to do, so I'm leaving it as-is for you or anyone to play with. It does, however, demonstrate what I described in my earlier note. Code: Read more... (10 kB) Sample output: Read more... (3 kB) ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]


good chemistry is complicated, and a little bit messy -LW
	PerlMonks