| [reply] |
I sure was hoping someone would be able to suggest a regexp secret that I had not yet learned. I was hoping there would be some way of doing this. I may have to just pre-parse looking for the false positives, and exchange them temporarily for a marker of some sort before parsing a second time. I'm not even sure if that would work. I'll have to ponder that some more. I need to be able to reorder the sentences following a specific ruleset and in a specific order, by order of appearance in the sentence.
Sigh. Too bad regex can't do everything!
| [reply] |
I was hoping there would be some way of doing this. ... Sigh. Too bad regex can't do everything!
Be aware of the "if all you have is a hammer, everything looks like a nail" effect. Doing everything in a single regex is nice, but shouldn't be a requirement - sometimes, things can be expressed much more cleanly with a few regexes and some code. And be aware of premature optimization as well - sure, oftentimes a single regex is faster than multiple, but usually it's better to get things working first instead of trying to bend over backwards and trying to wrap your head around a complex regex. Especially in the case you describe, IMHO the brainpower is much better spent on writing up test cases first!
use warnings;
use strict;
use Test::More;
sub my_sentence_splitter {
my $input = shift;
my @output;
# ... magic ...
return \@output;
}
is_deeply my_sentence_splitter(<<END),
I'm looking for the end of a sentence, where possible. However, in so
+me cases, I'll need to go with a non-conventional "end" to it, such a
+s: "Here's a quote by a famous person which is supposed to exceed for
+ty words and is therefore required to be set apart as a separate, ind
+ented paragraph per APA style." (Famous, 1999) Note that the regex ne
+eds to look for the full end of the sentence, if it exists: it cannot
+ simply stop at the colon unless there is no further part to the sent
+ence provided in that paragraph.
END
[
q#I'm looking for the end of a sentence, where possible.#,
q#However, in some cases, I'll need to go with a non-conventional
+"end" to it, such as:#,
q#"Here's a quote by a famous person which is supposed to exceed f
+orty words and is therefore required to be set apart as a separate, i
+ndented paragraph per APA style."#,
q#(Famous, 1999)#,
q#Note that the regex needs to look for the full end of the senten
+ce, if it exists: it cannot simply stop at the colon unless there is
+no further part to the sentence provided in that paragraph.#,
];
# TODO: Many more test cases here!
done_testing;
| [reply] [d/l] |
Could you perhaps match repeatedly within the same string, in a loop, and then manually select what you consider to be the most appropriate match?
| [reply] |
I'd say use Hippo's template of an SSCCE Re: Matching a string in a parenthesized block (regex help) to write some tests for
- what you want and
- what you don't want.
This would certainly be beneficial for you too.
Other than that, |-or conditions with swallowing can prioritize areas, like "quoted" ones.
demo
DB<132> $_ = 'phrase. "phrase1.phrase2" phrase. phrase'
0 'phrase. "phrase1.phrase2" phrase. phrase'
DB<133> split /(".*?"|\.)/
0 'phrase'
1 '.'
2 ' '
3 '"phrase1.phrase2"'
4 ' phrase'
5 '.'
6 ' phrase'
DB<134>
| [reply] [d/l] |