Can you identify token separators, and break the input up into stuff which isn't a problem, and stuff which might be ?
Starting by tidying up:
$query =~ s/\s+/ /g ; # that's the whitespace
$query =~ s/\A\s// ; # strip leading
$query =~ s/\s\Z// ; # strip trailing
$query = lc($query) ; # all lower case
$query =~ s/(["'])((?:\\\1|\1\1|.)*?)\1/mash_s($1, $2)/eg ;
# Eliminate separators from quoted string
+s
sub mash_s {
my ($q, $s) = @_ ;
$s =~ tr/0-9a-z/\\/c ;
return $q.$s.$q ;
} ;
which, in particular, leaves all
"..." or
'...' strings containing only
[0-9a-z\\]. Means that can then attack anything between separator characters:
$query =~ s/([^ !#\$%()*,\/:;<=>?\@[\]^{|}~]+)/mash_l($1)/eg ;
sub mash_l {
my ($s) = @_ ;
return $s if $s =~ /^(?:[a-z]+|\+|\-)$/ ;
return 'N' if $s =~ /^[+-]?(?:
(?:\d+(?:\.\d*)? | \.\d+) (?:e[+-]\d+
+)?
|(?:0(?:
x[0-9a-f]+
|b[01]+
)
)
|x'[0-9a-f]+'
|b'[01]+'
)$/x ;
return 'S' if $s =~ /^(["']).*?\1$/ ;
return $s ;
} ;
Sadly, what this shows most clearly is that distinguishing unary and binary '
+' and '
-' is tricky. The above will cope with
12 + -17 and
12*-5, but will fail on
12+13 or
12 +-13 and so on...
...using a parser, where somebody else has done all the hard work, looks like a good trick !