http://qs321.pair.com?node_id=644786

Note: the following is not about a bug in Perl (but instead actually a misused feature), but a bug that might be in your Perl program. Here's a detailed discussion of what was discovered and shared with the audience.

At the Technical Dutch Open Source Event, T-DOSE, during the workshop "Exploiting Open Source" led by security consultant Tim Hemel, a flaw was discussed that exists in several Perl programs.

Technical background

Perl has Unicode strings, that internally are encoded as either ISO-8859-1 or UTF8. A flag, called "SvUTF8", a.k.a. "the UTF8 flag", is set to 1 for strings that are UTF-8 internally, and to 0 for strings that are ISO-8859-1 (or raw binary) internally. On the Perl side of things, regardless of the internal encoding, you have a string that consists of characters (not bytes).

Once the UTF8 flag is set, Perl does not check the validity of the UTF8 sequences further. Typically, this is okay, because it was Perl that set the flag in the first place. However, some people set the UTF8 flag manually. They circumvent protection built into encoding/decoding functions and PerlIO layers, either because it's easier (less typing), for performance reasons, or even because they don't know they're doing something wrong.

The :utf8 PerlIO layer sets the UTF8 flag, without checking the byte sequences, on incoming data. This is not a bug or a flaw, but the very function of this PerlIO layer. It is used internally by other layers (most importantly the :encoding layer), after they have (safely) converted the input to UTF8. A function that sets the UTF8 flag, _utf8_on is available from the Encoding module.

Several XS modules set the UTF8 flag on incoming data from a file or a socket (think of databases and network protocols), sometimes without checking the validity of the UTF8 sequences.

Perl's functions use Unicode semantics by default (except for some bug, but see Unicode::Semantics for a workaround), which means that \w matches any alphanumeric character or underscore. This does match quite a huge number of Unicode characters. Similar semantics are in effect for \d and \s, but many people assume that \w is short for [A-Za-z0-9_], that \d is short for [0-9], and that \s is short for [ \f\t\r\n]. This is not true. Since 5.8, released more than 5 years ago, they match with Unicode semantics.

Proof of concept exploit

The (contrived) proof of concept exploit:

test.bin is a file containing the following 7 bytes:

66 6f 6f c9 3b 69 64 f o o ***** i d
***** represent an invalid UTF8 byte sequence, with a starting byte indicating a character length of 2 bytes, and a byte that in ASCII is a semicolon (!).

sploit.pl is the following simple Perl program:

#!/usr/bin/perl -T use strict; %ENV = ( PATH => '/usr/bin' ); open my $filehandle, "< :utf8", "test.bin" or die $!; my $word = readline $filehandle; my ($untainted) = $word =~ /^(\w+)$/; if ($untainted) { # It passed the regex, so it is "safe". system "echo $untainted"; }

When this program is executed, the C9 3B together will be interpreted as the Unicode character U+027B (which when UTF8 encoded properly would have been C9 BB), but the shell sees a semicolon and executes not only echo, but also id.

For some reason, with warnings enabled, this program throws a fatal exception (not a warning) "Malformed UTF-8 character (unexpected non-continuation byte 0x3b, immediately after start byte 0xc9)". Because this is probably a side effect of something, and because warnings are often disabled dynamically (at a distance), this does not provide sufficient protection.

The solution is very simple: do not use :utf8, but use :encoding(UTF8) (or for strict Unicode compliant UTF-8, use :encoding(UTF-8) (same, but with a hyphen)), as should have been done in the first place.

More subtle vulnerabilities exist when a module like a database library assumes that data (e.g. from the database) is valid UTF8, but it isn't (for example, because the database engine allows inserting arbitrary binary data into the field). This was not tested at T-DOSE, but a quick look at the source code makes me think that while DBD::SQLite may be vulnerable (uses SvUTF8_on without checking), DBD::mysql (uses sv_utf8_decode) and DBD::Pg (uses is_utf8_string) are probably not.

The security vulnerability is the result of naive use of Perl's API, possibly inspired by misleading documentation. It is not a bug in perl itself.

There may be other vectors of attack for abusing malformed UTF8 sequences.

Recommendations

Please do not set the UTF8 flag unless you are fully convinced that your data is actually valid UTF8, and remember that :utf8 sets the UTF8 flag without checking.

Instead of the :utf8 PerlIO layer, use :encoding(UTF8) or :encoding(UTF-8).

Instead of _utf8_on, use utf8::decode or Encode::decode_utf8 or Encode::decode("UTF8", ...), or Encode::decode("UTF-8", ...),

Instead of SvUTF8_on, use sv_utf8_decode, or check validity first, with is_utf8_string.

Instead of writing \w, \d, or \s, write a literal character class if you do not want non-ASCII parts to match, or filter/forbid non-ASCII characters (those with a codepoint (numeric value) greater than 127) beforehand.

Perl documentation flaws

Several official Perl documents use :utf8 in code examples. This has already been changed in the current development version earlier this year, and will be updated in the next release. My own document perlcheat is wrong about equivalencies for \w, \d and \s, and I will try to have this repaired soon.

Update: license added (requested): This report (© 2007 Juerd Waalboer <#####@juerd.nl>) may be copied with attribution, under the CC:by license.
Update: system() with a unicode string is a violation of text/binary separation, but encoding $untainted to UTF8 or UTF-8 explicitly (as should have been done) does not solve the security problem because these are optimized and use the internal value when Perl believes it is valid UTF(-)8.

Replies are listed 'Best First'.
Re: UTF8 related proof of concept exploit released at T-DOSE
by graff (Chancellor) on Oct 14, 2007 at 22:26 UTC
    Given that the exploit relies on using byte sequences that cannot be interpreted as valid utf8 strings, I would think that anyone writing a script that uses the "-T" flag, and expects to handle utf8 data from a tainted source, would prefer to read such input as ":raw", and always use Encode::decode() to convert it to perl-internal utf8 form.

    And in doing so, it would usually be prudent to do it like this (adapting the sample code given in the OP):

    #!/usr/bin/perl -T use strict; use Encode; %ENV = ( PATH => '/usr/bin' ); open my $filehandle, "< :raw", "test.bin" or die $!; my $word = readline $filehandle; eval { $word = decode( "utf8", $word, Encode::FB_CROAK ) }; if ( $@ ) { warn "unusable input from test.bin\n"; } else { my ($untainted) = $word =~ /^(\w+)$/; if ($untainted) { # It passed the regex, so it is "safe". system "echo $untainted"; } }

      I would think that anyone writing a script that uses the "-T" flag, and expects to handle utf8 data from a tainted source, would prefer to read such input as ":raw", and always use Encode::decode() to convert it to perl-internal utf8 form.

      Why go through that trouble if ":encoding(UTF-8)" does exactly the same thing, the same safe way, only with less code?

      Using :raw with decode is exactly as safe as using :encoding(UTF-8), because it literally does the same things internally, only through a different wrapper :)

      Now, :utf8 is unsafe (when reading), but this has nothing to do with taint mode. Of course, in the contrived example in the root node, an informed careful programmer would have done two things differently: they would have used :encoding and they would not have used \w. The scary part, however, is that many careful programmers don't know that what they're doing is dangerous!

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        Why go through that trouble if ":encoding(UTF-8)" does exactly the same thing, the same safe way, only with less code?

        If it is sufficient that the app simply never gets to see a malformed byte sequence (or anything following a malformed character) when reading from a source that is expected to be utf8, you're right -- better to handle it via the ":encoding(utf8)" layer in PerlIO.

        But if there's any need to diagnose the nature of the malformedness, or to recover any amount of usable data following a bad byte sequence within a given input record, then the extra steps involving "decode('utf8',$string,...)" are the only way to do that, I think.

Re: UTF8 related proof of concept exploit released at T-DOSE
by demerphq (Chancellor) on Oct 15, 2007 at 23:50 UTC

    Because this is probably a side effect of something

    I'm not sure what you mean. My guess is that it comes from the internals when the regex engine tries to read a codepoint from the string, since its not valid it dies.

    The solution is very simple: do not use :utf8, but use :encoding(UTF8) (or for strict Unicode compliant UTF-8, use :encoding(UTF-8) (same, but with a hyphen)), as should have been done in the first place.

    Thats really crappy. Its huffman coded all wrong. IMO this should be raised on perl5porters with some thought to changing it for the better.

    ---
    $world=~s/war/peace/g

      Because this is probably a side effect of something
      I'm not sure what you mean.

      I mean that I find it surprising that enabling warnings suddenly makes the program die. It should warn, not die. Or, alternatively, it should die even without "use warnings".

      "use warnings" without FATAL argument should not introduce fatal errors to the language. I suspect that the fatal exception is a side effect, not the intended behaviour.

      The solution is very simple: do not use :utf8, but use :encoding(UTF8) (or for strict Unicode compliant UTF-8, use :encoding(UTF-8) (same, but with a hyphen)), as should have been done in the first place.
      Thats really crappy. Its huffman coded all wrong. IMO this should be raised on perl5porters with some thought to changing it for the better.

      I agree that the huffman coding here is entirely wrong. Everything surrounding identifiers for the UTF8 flag, including its own names "svUTF8" and "the UTF8 flag" is very unfortunate. The very short name for the :utf8 PerlIO layer is downright dangerous, if :encoding(utf8) is the correct style.

      However, I insist that :utf8 must not be made an abbreviation for :encoding(UTF-8), because that would encourage people to use :utf8, which in 5.8.0 thru 5.8.8 is a security risk, and these versions will stay around for a long time.

      One solution that comes to mind is:

      1. Rename :utf8 to :_svUTF8. It is a direct interface to internals and should look like that.
      2. Keep support for :utf8 for backwards compatibility, but issue a mandatory warning.

      Optionally:
        3. Allow ":enc" as an abbreviation for ":encoding"
        4. Allow "=foo" as an abbreviation for "(foo)" so you can have ":enc=utf8" which is doable

      1 and 2 are, IMO, a good solution for a real problem. I'm not so sure 3 and 4 would be good: they'd make programs and modules depend on a new version of Perl only for syntactic sugar.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }