regex: only want [a-zA-Z] and comma chars in a string

heezy has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: regex: only want [a-zA-Z] and comma chars in a string by tilly (Archbishop) on Oct 12, 2003 at 18:36 UTC
If you are trying to figure out regular expressions, I highly recommend japhy's module YAPE::Regex::Explain. When installing it you will have trouble unless you first install YAPE::Regex first. There is a dependency there, but he didn't properly indicate it. I reported that bug. If you have trouble installing things (eg you are on Windows and aren't familiar with CPAN and CPANPLUS) you can get the sources by following these links to .\YAPE\Regex.pm and .\YAPE\Regex\Explain.pm. Save those files with those paths (I assumed a Windows delimiter) and then write the following script: `#! perl use strict; use YAPE::Regex::Explain; print YAPE::Regex::Explain->new(shift)->explain;` [download] And now you can get explanations like the following: tilly@localhost:~$ perl re-explain foo The regular expression: (?-imsx:foo) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- foo 'foo' ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- tilly@localhost:~$ perl re-explain '(foo\|bar)' The regular expression: (?-imsx:(foo\|bar)) matches as follows: NODE EXPLANATION ---------------------------------------------------------------------- (?-imsx: group, but do not capture (case-sensitive) (with ^ and $ matching normally) (with . not matching \n) (matching whitespace and # normally): ---------------------------------------------------------------------- ( group and capture to \1: ---------------------------------------------------------------------- foo 'foo' ---------------------------------------------------------------------- \| OR ---------------------------------------------------------------------- bar 'bar' ---------------------------------------------------------------------- ) end of \1 ---------------------------------------------------------------------- ) end of grouping ---------------------------------------------------------------------- [download] (I actually ran this under Linux. On Windows you will want to quote the RE with ", not '.) This makes it easier to for beginners to understand what a given regular explanation should do.	[reply] [d/l] [select]
Re: regex: only want [a-zA-Z] and comma chars in a string by davido (Cardinal) on Oct 12, 2003 at 17:34 UTC
You're close. `unless ( $tax_collection =~ /^[a-zA-Z]+(?:,[a-zA-Z]+)*?$/ ) { print "<font color=\"#ff0000\"><i>Incorrect format</i></font>"; $errors++; }` [download] What that does is it says match `[a-zA-Z]` as many times as possible (the first word) followed by a sequence that can be repeated as many times as necessary (or no times). That sequence may start with a comma, and finish with as many `[a-zA-Z]` characters as possible. The match is anchored from the front of the string to the end (assuming a single-line string). Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein	[reply] [d/l] [select]
Re: regex: only want [a-zA-Z] and comma chars in a string (Don't use a regex!) by dragonchild (Archbishop) on Oct 12, 2003 at 19:22 UTC
Try this: `sub is_valid_string { my $string = shift; # Handle the empty case return 0 unless length $string; # Preserve trailing empty fields my @string = split ',', $string, -1; # No commas at the beginning, end, or next to one another. return 0 if grep { !$_ } @string; return 0 if grep { /[^a-zA-Z]/ } @string; return 1; }` [download] It was tested with your test strings and passed with flying colors. ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: regex: only want [a-zA-Z] and comma chars in a string by Anonymous Monk on Oct 12, 2003 at 18:03 UTC
How about just: `unless($tax_collection ~= /^([a-zA-Z]+\|\b,\b)+$/) {` [download]	[reply] [d/l]
Re: Re: regex: only want [a-zA-Z] and comma chars in a string by davido (Cardinal) on Oct 12, 2003 at 18:31 UTC
`unless($tax_collection ~= /^([a-zA-Z]+\|\b,\b)+$/) {...` No, that's not going to work because it allows the comma to appear at the beginning and/or end of the string, which doesn't meet the original post's spec. Beginning and End of string count as word boundries, so your method fails. Update: Now it's a matter of public record: I made a simple mistake in intrepreting Anonymous Monk's regexp. (S)He is correct in his assertion. The method works. Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein	[reply] [d/l]
Re: Re: Re: regex: only want [a-zA-Z] and comma chars in a string by Anonymous Monk on Oct 12, 2003 at 18:34 UTC
Yes, Beginning and End of string count as word boundaries, but only adjacent to a \w character, which ',' is not a member of.	[reply]
Re: Re: Re: Re: regex: only want [a-zA-Z] and comma chars in a string by davido (Cardinal) on Oct 12, 2003 at 18:36 UTC
Re: regex: only want [a-zA-Z] and comma chars in a string by delirium (Chaplain) on Oct 12, 2003 at 20:56 UTC
If you're new to regexes, it might be helpful (although it will slow down processing slightly) to break this down into each thing you are trying to test. That may help you later when coming back to the code to figure out what you were testing for. For example: `print "$_\n" if !/^,/ && !/,$/ && !/,,/ && /^[a-zA-Z,]$/;`	[reply] [d/l]
Re: regex: only want [a-zA-Z] and comma chars in a string by Anonymous Monk on Oct 12, 2003 at 17:27 UTC
`@strings = ("notalot", "england,rugby" , "chicken,egg,duck,feet" , ",southampton" , "bristol," , "iran, canada,france" , "" , ); for (@strings) { print "$_\n" if /^[a-zA-Z]([a-zA-Z]+\|,(?!,\|$))*$/; }` [download]	[reply] [d/l]
Re: regex: only want [a-zA-Z] and comma chars in a string by jeroenes (Priest) on Oct 12, 2003 at 17:36 UTC
Hi heezey, Why not try without regex? `print "yes" if not tr/a-zA-Z,//c and not ( str( $_, ',', 0) or str $_, + ',' ,-1);` [download] Did not check it but should work. Cheers, Jeroen "We are not alone"(FZ) After some self-flaggalation: `for (qw/ ,asd asd:asd asd, asd,asd asd,,asd/){ my $test = $_; print "$test: "; print "yes" unless $test=~tr/a-zA-Z,//c or index( $test, ',') == 0 or + index( $test, ',') == -1 + length( $test) or index( $test, ',,') > 0 +; print "\n"; }` [download] My perl clearly is rusty, and I even don't know anymore in which language str is valid ;). However, still possible without regexes {grin}.	[reply] [d/l] [select]
Re: Re: regex: only want [a-zA-Z] and comma chars in a string by Anonymous Monk on Oct 12, 2003 at 17:42 UTC
bzzt! Why post guesses when 10 seconds would have allowed you to check? The strings are not allowed to have consecutive commas, and `str($_,',',0)` isn't even a valid Perl function.	[reply] [d/l]
Re: regex: only want [a-zA-Z] and comma chars in a string by mshiltonj (Sexton) on Oct 13, 2003 at 22:02 UTC
This example... ----- `#!/usr/bin/perl -w use strict; my %commas = ( 'notalot' => 1, 'england,rugby' => 1, 'chicken,egg,duck,feet' => 1, ',southampton' => 0, 'bristol,' => 0, 'iran,,canada,france' => 0, ); foreach my $key (keys %commas) { unless (($key =~ /^,\|,,\|,$/)) { print "PASS"; } else { print "FAIL"; } print ": $key\n"; }` [download] ---- ... seems to work for me. This was it's output: `FAIL: bristol, PASS: notalot PASS: england,rugby FAIL: iran,,canada,france PASS: chicken,egg,duck,feet FAIL: ,southampton` [download]	[reply] [d/l] [select]
Re: Re: regex: only want [a-zA-Z] and comma chars in a string by dragonchild (Archbishop) on Oct 13, 2003 at 23:31 UTC
Ahhh ... the pitfalls of coding to the test cases instead of the specification. The OP specifically stated that he needed the characters to be in the class `[a-zA-Z]`, which your code doesn't check for. Try the following string "1", which should fail. ------ We are the carpenters and bricklayers of the Information Age. The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6 ... strings and arrays will suffice. As they are easily available as native data types in any sane language, ... - blokhead, speaking on evolutionary algorithms Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.	[reply] [d/l]
Re: Re: regex: only want [a-zA-Z] and comma chars in a string by nevyn (Monk) on Oct 14, 2003 at 19:39 UTC
`unless (($key =~ /^,\|,,\|,$/))` Cute, testing for the negative is much more readable here ... but probably something I wouldn't have tried. You need to check for the characters being in the valid class though (and I'd also put brackets around the anchors) so that would make it... `unless (($key =~ /(?:^,)\|,,\|[^a-zA-Z,]\|(?:,$)))` [download] Which isn't quite as nice anymore :(. So I'd probably still use the "obvious" non-negative... `if (($key =~ /^[a-zA-Z]+(?:,[a-zA-Z]+)$))` [download] Or if I was feeling really special, I might even do... `my $field = qr([a-zA-Z]+); if (($key =~ /^$field(?:,$field)$))` [download] ...which is slightly more readable IMO. update (broquaint): changed `<pre>` to `<code>` tags	[reply] [d/l] [select]
Re: regex: only want [a-zA-Z] and comma chars in a string by Anonymous Monk on Oct 15, 2003 at 03:00 UTC
Hello Monks I really learned some things here. My solution forgot to include the ^ and $ anchors at first and didn't work. After reading Dragonchilds reply, I went back to examine the behavior of split (was instructive). By suppling the -1 (infinitely large) number of field to split into, he preseved the trailing item. Really liked Nevyn's solution - very staightforward and clear, easy to understand. Once I put in the anchors, my regex worked correctly - surprise! `my $pat = qr{^([A-Za-z]\|[A-Za-z],(?=[A-Za-z]))+$};` [download] Nice to meet all, Chris	[reply] [d/l]


Problems? Is your data what you think it is?
	PerlMonks