Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

The module version works, but the standlone version crashes with "Malformed UTF-8 character"

by Perlfan52 (Novice)
on Mar 09, 2020 at 15:56 UTC ( [id://11114010]=perlquestion: print w/replies, xml ) Need Help??

Perlfan52 has asked for the wisdom of the Perl Monks concerning the following question:

I have two versions of the same perl search script, one ist standalone and the other one is a three-lines perl script, that calls the main function in a perl module. Both versions have the same code. The module version works as supposed and the standalone version crashes with the error "Malformed UTF-8 character" at the line with the regex /romantic/

This must be an internal bug of perl. It is well known that perl had (and probably has) unicode/utf8 issues. I am using Strawberry perl 5.30.0 (built for MSWin32-x64-multi-thread) on windows 10 pro client with recent updates.

Here are the codes and the files that are needed to reproduce this problem.

- The searched files are in http://ftp.freedb.org/pub/freedb/freedb-update-20200201-20200301.tar.bz2
I am extracting this file into C:/MyScripts/freedb-update-20200201-20200301

- The module version of the script C:/MyScripts/search_script_with module.pl is:

BEGIN{push(@INC,'C:/MyScripts');} use searchFreedb; mainSearchFreedb('C:/MyScripts/freedb-update-20200201-20200301'); print "End of script\n";
- The corresponding module C:/MyScripts/searchFreedb.pm is
package searchFreedb; use strict; use utf8; use vars qw($VERSION @ISA @EXPORT @EXPORT_OK); require Exporter; @ISA = qw(Exporter); @EXPORT = qw(mainSearchFreedb); @EXPORT_OK = qw(mainSearchFreedb); $VERSION = 1.0; $| = 1; ############################################################# # mainSearchFreedb ############################################################# sub mainSearchFreedb { my ($searchdir) = @_; open(FILE, ">C:/MyScripts/sresult_module.txt") || die "$!\n"; binmode FILE, ":utf8"; recursivSearchFreedb($searchdir); close(FILE); } ############################################################# # recursivSearchFreedb ############################################################# sub recursivSearchFreedb { my ($dir) = @_; die "dir $dir!\n" if(!$dir || !(-e $dir && -d $dir)); $dir =~ s/[\/\\]+/\//og; $dir = $dir . '/' if( $dir !~ /\/$/o ); my ($dirname) = ( $dir =~ /^.*\/([^\/]+?)\/*$/o ); opendir(DIR,$dir) || warn __LINE__."$!\n"; my @all_dir_files = readdir(DIR); closedir(DIR); print "Folder: $dir => $dirname\n"; foreach my $dir_file ( sort @all_dir_files ) { $dir_file =~ /^\.+$/o && next; my $abspath = $dir . $dir_file; if( -d $abspath ) { recursivSearchFreedb($abspath); } else { if($dir_file =~ /(^COPYING$|^README$)$)/io) { print "skipping $dir_file\n"; next; } elsif(-z $abspath) { next; } my ($content); open(IN, "<$abspath") || die "$!\n"; while(my $line = <IN>) { next if not $line =~ /^#\s+xmcd/o; $content .= $line; my ($TITLEALL,$DISCID,$GENRE); for(;;) { my $line2 = <IN>; if($line2=~/^\s*DTITLE\s*=(.*)$/o) {$TITLEALL .= $1;} if($line2=~/^\s*DISCID=\s*(.+?)\s*$/o) {$DISCID = $1;} if($line2=~/^\s*DGENRE\s*=(.*)$/o) {$GENRE .= $1;} $content .= $line2; if($line2 =~ /^PLAYORDER=/o) { if( $TITLEALL =~ /Romanti[cqk]/io ) { print FILE "$content\n"; } last; } } } close(IN); } } } ############################################################## # end of package ############################################################## 1;
- The standalone version of the script C:/MyScripts/search_script_standalone.pl is:
use strict; use utf8; $| = 1; ############################################################# # recursivSearchFreedb ############################################################# sub recursivSearchFreedb { my ($dir) = @_; die "dir $dir\n" if(!$dir || !(-e $dir && -d $dir)); $dir =~ s/[\/\\]+/\//og; $dir = $dir . '/' if( $dir !~ /\/$/o ); my ($dirname) = ( $dir =~ /^.*\/([^\/]+?)\/*$/o ); opendir(DIR,$dir) || warn __LINE__."$!\n"; my @all_dir_files = readdir(DIR); closedir(DIR); print "Folder: $dir => $dirname\n"; foreach my $dir_file ( sort @all_dir_files ) { $dir_file =~ /^\.+$/o && next; my $abspath = $dir . $dir_file; if( -d $abspath ) { recursivSearchFreedb($abspath); } else { if($dir_file =~ /(^COPYING$|^README$)/io) { print "skipping $dir_file\n"; next; } elsif(-z $abspath) { next; } my ($content); open(IN, "<$abspath") || die "$!\n"; while(my $line = <IN>) { next if not $line =~ /^#\s+xmcd/o; $content .= $line; my ($TITLEALL,$DISCID,$GENRE); for(;;) { my $line2 = <IN>; if($line2=~/^\s*DTITLE\s*=(.*)$/o) {$TITLEALL .= $1;} if($line2=~/^\s*DISCID=\s*(.+?)\s*$/o) {$DISCID = $1;} if($line2=~/^\s*DGENRE\s*=(.*)$/o) {$GENRE .= $1;} $content .= $line2; if($line2 =~ /^PLAYORDER=/o) { if( $TITLEALL =~ /Romanti[cqk]/io ) { print FILE "$content\n"; } last; } } } close(IN); } } } ############################################################ # main starts here ############################################################ open(FILE, ">C:/MyScripts/sresult_standalone.txt") || die "$!\n"; binmode FILE, ":utf8"; recursivSearchFreedb('C:/MyScripts/freedb-update-20200201-20200301'); close(FILE); print "End of script\n";
I am starting the module version with
"perl -CDS search_script_with module.pl"

The result is:

Folder: C:/MyScripts/freedb-20200201-20200301/ Folder: C:/MyScripts/freedb-20200201-20200301/blues/ Folder: C:/MyScripts/freedb-20200201-20200301/classical/ Folder: C:/MyScripts/freedb-20200201-20200301/country/ Folder: C:/MyScripts/freedb-20200201-20200301/data/ Folder: C:/MyScripts/freedb-20200201-20200301/folk/ Folder: C:/MyScripts/freedb-20200201-20200301/jazz/ Folder: C:/MyScripts/freedb-20200201-20200301/misc/ Folder: C:/MyScripts/freedb-20200201-20200301/newage/ Folder: C:/MyScripts/freedb-20200201-20200301/reggae/ Folder: C:/MyScripts/freedb-20200201-20200301/rock/ Folder: C:/MyScripts/freedb-20200201-20200301/soundtrack/ End of script
I am starting the standalone version with
"perl -CDS search_script_standalone.pl"

The result is (it crashes very quickly):

Folder: C:/MyScripts/freedb-20200201-20200301/ Folder: C:/MyScripts/freedb-20200201-20200301/blues/ Malformed UTF-8 character: \xf6\x6e\x20\x26 (unexpected non-continuati +on byte 0x6e, immediately after start byte 0xf6; need 4 bytes, got 1) + in pattern match (m//) at C:\MYSCRI~1\SEARCH~2.PL line 55, <IN> line + 67. Malformed UTF-8 character (fatal) at C:\MYSCRI~1\SEARCH~2.PL line 55, +<IN> line 67.
Any ideas why the standalone version crashes? Can you reproduce the problem on your own pc? Thank you for your answers or ideas.

Replies are listed 'Best First'.
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by choroba (Cardinal) on Mar 09, 2020 at 21:51 UTC
    > perl -CDS

    "D" corresponds to "i + o" whose documentation in perlrun states (emphasis mine):

    > The "io" options mean that any subsequent open() (or similar I/O operations) in the current file scope will have the ":utf8" PerlIO layer implicitly applied to them, in other words, UTF-8 is expected from any input stream, and UTF-8 is produced to any output stream.

    If you put the call to open into a module, it falls out of the current file scope.

    The -C is intended for oneliners, in larger programs and modules, use binmode, explicit layers with 3-arg open, or open.pm.

    map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      You are right my friend. I must have overseen it. If I start with "perl -CS" instead of "perl -CDS" it works in both versions. I thank you very much.
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by 1nickt (Canon) on Mar 09, 2020 at 16:11 UTC

    Hi, since you are on Windows I would pay attention to the encoding of your file. MSFT often uses UTF-16, and if you try to decode that as if it were UTF-8 you could see that error IIUC.

    Related info on "middle byte" that has bitten me with JSON data: https://tools.ietf.org/html/rfc4627#section-3.

    Hope this helps!


    The way forward always starts with a minimal test.
Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by jo37 (Deacon) on Mar 09, 2020 at 18:52 UTC

    The error message states:

    Malformed UTF-8 character: \xf6\x6e\x20\x26

    In latin-1 encoding this would be "ön &". This string (or something similar in another encoding) apparently occurs in any of your files and at least this file is not utf-8 encoded.

    -jo

      Yes, it's file blues/020d9511 which is Latin-1 encoded. In UTF-8, the problematic line would be
      TTITLE2=Jung, schön & stylish feat. Justus

      map{substr$_->[0],$_->[1]||0,1}[\*||{},3],[[]],[ref qr-1,-,-1],[{}],[sub{}^*ARGV,3]
      The question is not why the standalone version does not work or where it doesn't work. If you delete blues/020d9511 it crashes immediately at next UTF-8 encoded file.
      The question is why the standalone version does not work and the module version works without any problem, although the code is absolutely the same. What makes a module so different internally so that perl has a different interpretation in both cases? For me as a developer a particular code must always give the same result, but in this case I am really helpless.

        Here I disagree. The input is malformed and a crashing program is the right thing™ here. I would not care why the other version does not crash but instead correct the input data.

        -jo

Re: The module version works, but the standlone version crashes with "Malformed UTF-8 character"
by LanX (Saint) on Mar 09, 2020 at 16:06 UTC
    > This must be an internal bug of perl. It is well known that perl had (and probably has) unicode/utf8 issues.

    you mean "well known" among first time posters who still use the /o modifier and can't boil down their problem to an SSCCE ?

    edit

    did you really safe both files as UTF8?

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      All three files are all saved as UTF-8 NO BOM/Unix Terminators (with UltraEdit).
      All scripts have only ascii encoding, that means I could also save them as ANSI/ASCII. I tested it too. No changing in result.
      Without /o modifier same result.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11114010]
Approved by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (3)
As of 2024-03-28 15:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found