How do I keep anything other than alphanumeric out of a variable?

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How do I keep anything other than alphanumeric out of a variable? by DrHyde (Prior) on Aug 26, 2003 at 12:48 UTC
The following gets rid of non-alphanumerics and underscores: `$user_name =~ s/[\W_]//g;` [download] The pattern `[\W_]` breaks down as follows: `[...]` any of the characters from the character class consisting of ... `\W` any "non-word" character ... `_` or an underscore However, instead of just silently cleaning data, I'd prefer to check the string for undesirable characters and notify the user if it is bad, so that they can fix it: `$user_name =~ /[\W_]/ and warn "user name is bad\n"` [download]	[reply] [d/l] [select]
Re: Answer: How do I keep anything other than Alpha/Numeric data out of a variable? by davido (Cardinal) on Aug 26, 2003 at 19:01 UTC
One caviet here: POSIX. POSIX can, on some systems, alter the definition of \W so tht its conventional meaning, "`[^a-zA-Z0-9_]`", is not exactly what you expect it to be. According to Friedl (the Owls book "Mastering Regular Expressions", 1st edition, pp. 65-66 and 257) (paraphrasing...): POSIX can alter the meaning of \w and \W to include what other languages consider to be word characters. "Locales can influence many tools that do not aspire to POSIX compliance, sometimes without their knowledge! ... If the non-POSIX utility is compiled on a system with a POSIX-compliant C library, some support can be bestowed, although the exact amount can be hit or miss. For example, the tool's author might have used the C library functions for capitalization issues, but not for \w support." It is sometimes necessary to use `[a-zA-Z0-9_]` rather than /w. According to Friedl: "...a friend ran into a problem in which his version of Perl treated certain non ASCII bytes as `[`accented characters`]`..." Therefore, it is in some cases advisable to use the following construction to accomplish the task described in the subject line of this thread: `$user_name =~ s/[^a-zA-Z0-9]//g;` [download] Or with case insensitivity: `$user_name =~ s/[^a-z0-9]//gi;` [download] Of course this solution more accurately answers the question: "How do I purge anything other than Alpha/Numeric data from a variable?" Dave "If I had my life to do over again, I'd be a plumber." -- Albert Einstein	[reply] [d/l] [select]
Re: How do I keep anything other than alphanumeric out of a variable? by turnstep (Parson) on Apr 22, 2000 at 00:56 UTC
As long as you are willing to concede that an underscore `'_'` is alphanumeric, you can use this: `$user_name =~ s/\W//g;` [download] `\w` is shorthand for the character class `[A-Za-z0-9_]` and `\W` is the inverse of that set, i.e. `[^\w]`.	[reply] [d/l] [select]
Re: How do I keep anything other than alphanumeric out of a variable? by btrott (Parson) on Apr 21, 2000 at 19:31 UTC
Your regex says, "find the alphanumeric characters in $user_name, and replace them with nothing." You want the opposite: `$user_name =~ s/[^a-zA-Z0-9]//g;` [download] The `^` at the beginning of the character class inverts the set, i.e. "all things not in this character class".	[reply] [d/l] [select]
Re: How do I keep anything other than alphanumeric out of a variable? by Roy Johnson (Monsignor) on Nov 04, 2003 at 14:53 UTC
The right tool for character classes is `tr///`, not `s///`: `$user_name =~ tr/0-9a-zA-Z//dc;` [download] (You can add the underscore character, or any others you like, of course.) If you wanted the username to look like a valid Perl identifier (i.e., begin with a letter, then alphanumerics + underscores), you would then want to strip off the leading non-letters: `$user_name =~ s/^[^a-z]*//i;` [download]	[reply] [d/l] [select]


Do you know where your variables are?
	PerlMonks