Re^2: Special character not being captured

in reply to Re: Special character not being captured
in thread Special character not being captured

choroba, It confuses me why when I use make_hash, it returns the correct strings as keys and values without having to specify an encoding; but when I go to get the first character with either the first_alpha subroutine or substr, I suddenly need to specify the encoding. All of these subroutines are in the same module where encoding is not specified anywhere. Some subroutines return the correct strings without having to specify encoding while others do not is confusing.

If this helps, I am including my locale.

me@office:~$ locale
LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE=C
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
[download]

As an aside, I rewrote first_alpha. The original horrified me. I hope the rewrite is cleaner.

Original Rewrite

Original	Rewrite
`sub first_alpha { my $alpha = shift; $alpha = ucfirst($alpha) if $alpha =~ /^\l./; $alpha =~ s/\s*\b(A\|a\|An\|an\|The\|the)(_\|\s)//xi; if ($alpha =~ /^\d/) { $alpha = '#'; } elsif ($alpha !~ /^\p{uppercase}/) { $alpha = '!'; } else { $alpha =~ s/^(.)(\w\|\W)+/$1/; } return $alpha; }` [download]	`sub first_alpha { my $string = shift; $string =~ s/\s*\b(A\|a\|An\|an\|The\|the)(_\|\s)//xi; my $alpha = uc substr($string, 0, 1); if ($alpha =~ /^\d/) { $alpha = '#'; } elsif ($alpha !~ /^\p{uppercase}/) { $alpha = '!'; } return $alpha; }` [download]

sub first_alpha {
  my $alpha = shift;
  $alpha = ucfirst($alpha) if $alpha =~ /^\l./;
  $alpha =~ s/\s*\b(A|a|An|an|The|the)(_|\s)//xi;
  if ($alpha =~ /^\d/) {
    $alpha = '#';
  }
  elsif ($alpha !~ /^\p{uppercase}/) {
    $alpha = '!';
  }
  else {
    $alpha =~ s/^(.)(\w|\W)+/$1/;
  }
  return $alpha;
}
[download]

sub first_alpha {
  my $string = shift;
  $string =~ s/\s*\b(A|a|An|an|The|the)(_|\s)//xi;

  my $alpha = uc substr($string, 0, 1);
  if ($alpha =~ /^\d/) {
    $alpha = '#';
  }
  elsif ($alpha !~ /^\p{uppercase}/) {
    $alpha = '!';
  }
  return $alpha;
}
[download]

No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!

Lady Aleena

Comment on Re^2: Special character not being captured Select or Download Code

Replies are listed 'Best First'.
Re^3: Special character not being captured by choroba (Cardinal) on Jun 21, 2019 at 07:15 UTC
> when I go to get the first character (...) I suddenly need to specify the encoding UTF-8 is a multi-byte encoding. It means that some characters, Ć being one of them, are encoded by more than one byte (in this case, two bytes: 0xC3 0x86). If a string starts with such a character, but Perl doesn't know the encoding, it assumes Latin-1, which is a single byte encoding. First character then corresponds to the first byte only, which is 0xC3. It doesn't have any meaning in UTF-8, so it's transformed into �, the replacement character. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l]
Re^4: Special character not being captured by Lady_Aleena (Priest) on Jun 23, 2019 at 17:47 UTC
One last thing, I've been trying to figure out how to add utf8 to `first_alpha`, which I posted earlier. I am not having any success with it. So, how should I add it to that subroutine? *No matter how hysterical I get, my problems are not time sensitive. So, relax, have a cookie, and a very nice day!* Lady Aleena	[reply] [d/l]
Re^5: Special character not being captured by choroba (Cardinal) on Jun 24, 2019 at 07:19 UTC
It doesn't belong there. You should always decode the input, as soon as possible; and similarly encode the output immediately before sending it out. `first_alpha` should receive an already decoded string. `map{substr$_->[0],$_->[1]\|\|0,1}[\\|\|{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^ARGV,3]`	[reply] [d/l] [select]

In Section Seekers of Perl Wisdom