http://qs321.pair.com?node_id=632683

tinita has asked for the wisdom of the Perl Monks concerning the following question:

hello monks,

i recently wondered why some of my utf8 strings missed their utf8 flag. i found the point where they were used as arguments to Digest::MD5::md5_hex.

$ perl -wle' use Digest::MD5 qw(md5_hex); use Devel::Peek; use Encode; my $string = ""; Encode::_utf8_on($string); Dump $string; my $md5 = md5_hex($string); Dump $string ' SV = PV(0x8153b00) at 0x8153684 REFCNT = 1 FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8) PV = 0x8174d48 "\303\244\303\266\303\274"\0 [UTF8 "\x{e4}\x{f6}\x{fc +}"] CUR = 6 LEN = 8 SV = PVMG(0x81ee3e0) at 0x8153684 REFCNT = 1 FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK) IV = 0 NV = 0 PV = 0x8174d48 "\344\366\374"\0 CUR = 3 LEN = 8 MAGIC = 0x81cbca0 MG_VIRTUAL = &PL_vtbl_utf8 MG_TYPE = PERL_MAGIC_utf8(w) MG_LEN = 3
shouldn't the function leave its arguments alone?

Replies are listed 'Best First'.
Re: md5_hex changes its argument
by Joost (Canon) on Aug 15, 2007 at 10:14 UTC
      this has been reported in rt.cpan.org over a year ago
      oh thanks =)
      i should have been looking there myself.
      too bad though the bug doesn't seem to get solved.
        In the meantime you could use my $md5 = md5_hex("$string"); to work around the problem.
        too bad though the bug doesn't seem to get solved.

        "Patches speak louder than words", as the saying goes...

        I have looked into the thing and made up a tentative patch:

        that seems to fix it. There are still problems with two failing tests. One of the fails can be fixed in a straightforward manner, the other one I haven't followed up yet.

        If you want the fix quickly you're invited to pick it up for submission at p5p. You can mail me through my berlin.pm address for more details if you want to. Otherwise I'll come back to it later, which may be much later.

        Anno

Re: md5_hex changes its argument
by graff (Chancellor) on Aug 16, 2007 at 04:48 UTC
    Until the bug is fixed, you might want to consider a small change in how you use the "md5_hex" function. There are a variety of ways to do this, depending on your preference, but they would all boil down to something like:
    my $md5 = md5_hex( encode( 'utf8', $string ));
    (update: the right function to use here is "encode", not "decode" as originally posted -- sorry for the confusion)

    That will pass a copy of the original string to md5_hex, and the copy will have the utf8 flag already turned off.

    (update: probably the best way to do this is to write your own "wrapper" module for Digest::MD5 -- the functions in "MyMD5.pm" would check the string being passed in, and only de encode() if the utf8 flag is on. Then you just need to change the module name in the scripts that run md5 on ut8 strings.)