Extract sequence of UC words?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Extract sequence of UC words? by gaal (Parson) on Aug 18, 2008 at 13:57 UTC
`\b([A-Z\s]+)\b` [download] Though you should note that A-Z misses out on accented characters. This is a little more i18n-friendly (not tested): `use charnames ":full"; \b([\p{IsUpper}\s]+)\b` [download]	[reply] [d/l] [select]
Re^2: Extract sequence of UC words? by BrowserUk (Patriarch) on Aug 18, 2008 at 14:10 UTC
`\b([A-Z\s]+)\b` This doesn't work because the space in the character class means it matches the first single space in the line and returns that. You need to ensure that the match starts with an UPPER alpha, and then continues with UPPER alpha or space: `print $data =~ m/(\b[A-Z][A-Z ]+\b)/;; TEST SENTENCE` [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^3: Extract sequence of UC words? by monarch (Priest) on Aug 18, 2008 at 17:01 UTC
Unfortunately this would also match "`TEST SENTENCE`" (note the trailing whitespace). The following test illustrates another method: `#!/usr/bin/perl -w my $data = <<'EOF'; This is a sentence. THIS \ IS A SENTENCE. This is \ a SEQUENCE OF UPPER WORDS and \ this is not. EOF while ( $data =~ m/(\b(?:[A-Z]+(?:\s+[A-Z]+)*)+\b)/g ) { print "Upper Sentence: \"$1\"\n"; }` [download] Outputs: `Upper Sentence: "THIS IS A SENTENCE" Upper Sentence: "SEQUENCE OF UPPER WORDS"` [download]	[reply] [d/l] [select]
Re^4: Extract sequence of UC words? by BrowserUk (Patriarch) on Aug 18, 2008 at 17:13 UTC
Re^5: Extract sequence of UC words? by monarch (Priest) on Aug 18, 2008 at 17:58 UTC
Some notes below your chosen depth have not been shown here
Re^4: Extract sequence of UC words? by johngg (Canon) on Aug 19, 2008 at 14:05 UTC
Re^3: Extract sequence of UC words? by dHarry (Abbot) on Aug 18, 2008 at 14:26 UTC
Thou art wise brother BrowserUk. I was just about to comment that I see a lot of non-working solutions;-) Alas my votes for today are gone.	[reply]
Re^3: Extract sequence of UC words? by gaal (Parson) on Aug 18, 2008 at 15:58 UTC
Thanks for the correction!	[reply]
Re: Extract sequence of UC words? by broomduster (Priest) on Aug 18, 2008 at 14:30 UTC
Do you really mean "a sequence of upper case words"? Or do you mean "all of the upper case words in a string"? To see what I'm getting, consider: `my $data = 'THIS IS a TEST SENTENCE Foo BAR';` [download] If you mean "a sequence of upper case words", the answer would be the following three strings (maybe not including the last, if your definition of sequence means "strictly more than one"): `THIS IS TEST SENTENCE BAR` [download] Other responses in this thread point you at a solution for this. OTOH, if you mean all upper case words (and want to collect them separately), then you want the following five words: `THIS IS TEST SENTENCE BAR` [download] In that case, look into the `g` modifier for regular expressions and how to capture multiple matches into an array (using parens to capture the matches you like; see perlretut for a nice introduction). The basic idea (you need to supply the right regex for your needs) is: `my @uc_words = $data =~ /(appropriate_regex_goes_here)/g;` [download]	[reply] [d/l] [select]
Re: Extract sequence of UC words? by amarquis (Curate) on Aug 18, 2008 at 13:56 UTC
It grabs only one word because you are matching for a sequence of upper case letters only. To have it match 'TEST SENTENCE' you'll have to have it match upper case letters OR spaces. But wait! The regex will then actually match ' TEST SENTENCE ' (including the space before and after the capitalized sequence). So what you really need is to make a match of: One upper case letter Any number of upper case letters/spaces One upper case letter The requirement to match a beginning and ending upper case letter will also make it not match just the 'F' of 'Foo'. Edit: gaal is smarter than I, heh.	[reply]
Re^2: Extract sequence of UC words? by Anonymous Monk on Aug 18, 2008 at 14:04 UTC
Thanks, I modified it like so: `/([A-Z\\|\s+]+)+/` [download]	[reply] [d/l]
Re^3: Extract sequence of UC words? by FunkyMonk (Chancellor) on Aug 18, 2008 at 14:15 UTC
`\|` and `+` inside a character class aren't special, they're just regular characters, so your regex would match "FO O\|B+++A R". `/[A-Z ]+/` (which is what I think you probably meant) won't work either. Bonus points will be given if you tell us why! Update: BrowserUK has already seen what was missing. You missed out AnonyMonk	[reply] [d/l] [select]
Re^3: Extract sequence of UC words? by Anonymous Monk on Aug 18, 2008 at 14:20 UTC
Note that the regex expression `[A-Z\\|\s+]` defines a set of characters that includes the '`\|`' ('pipe') character. Within a character set, the pipe has no special meaning; i.e., it is not the regex alternation metacharacter.	[reply] [d/l] [select]
Re: Extract sequence of UC words? by massa (Hermit) on Aug 18, 2008 at 14:27 UTC
Use the `/g` modifier. `$ perl -Mutf8 -mopen=:locale -e ' my $data = "this is a TEST SENTENCE Foo Bar GRUB jo �� co"; my @uc_strings = $data =~ /\b\p{Lu}+\b/g; print "@uc_strings\n"; ' TEST SENTENCE GRUB ��` [download] []s, HTH, Massa (κς,πμ,πλ)	[reply] [d/l] [select]
Re: Extract sequence of UC words? by logie17 (Friar) on Aug 18, 2008 at 16:38 UTC
I think you're close but you'll probably have better luck using /g modifier in list context. The following gets the results for me: `my @uc_string = ($data =~ /(\b[A-Z]+\b)/g);` [download] Thanks, s;;5776?12321=10609$d=9409:12100$xx;;s;(\d*);push @_,$1;eg;map{print chr(sqrt($_))."\n"} @_;	[reply] [d/l]
Re: Extract sequence of UC words? by JavaFan (Canon) on Aug 18, 2008 at 16:22 UTC
What I would use depends on your definition of "word" and "sequence". Clearly, in your example, "TEST" and "SENTENCE" are capitalized words. But is "STRATFORD-UPON-AVON" a capitalized word? How about "HE'S"? Is that a capitalized word? Or a sequence of capitalized words? What about "This is a TEST. THE test ends here."? Is "TEST. THE" a sequence of words?	[reply]


Perl: the Markov chain saw
	PerlMonks