Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

HTML Entities segfault - malformed utf8

by qq (Hermit)
on Jun 29, 2004 at 18:15 UTC ( [id://370565]=perlquestion: print w/replies, xml ) Need Help??

qq has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

The following script causes a segmentation fault on redhat but is fine on OSX (see perl -V's below). I'm not sure where to start looking, so any advice much appreciated.

I know when you install HTML:::Entities you are asked if you want it to encode unicode characters. I have no idea if thats relevant, but while I know it was compiled with that support on the OSX box, I don't know if it was on the redhat box.

#!/usr/bin/perl use strict; use warnings; use XML::RAI; use HTML::Entities; my $xml = do { local $/; <DATA> }; my $r = XML::RAI->parse( $xml ); foreach ( @{$r->items} ) { my $t = $_->title; print "$t\n"; $t = decode_entities($t); print "$t\n"; $t = encode_entities($t); print "$t\n"; } __DATA__ <?xml version="1.0" ?> <rss version="0.91"> <channel> <title>Smartmoney.com - Consumer Action</title> <link>http://www.smartmoney.com/consumer/?nav=RSS091</ +link> <description>Investing, Saving and Personal Finance</d +escription> <language>en-us</language> <copyright>Copyright 2004 Smartmoney.com, joint ventur +e of Dow Jones &amp; Co. and Hearst Communications, Inc.</copyright> <item> <title>The Modern R&amp;eacute;sum&amp;eacute; +</title> <link>http://www.smartmoney.com/consumer/index +.cfm?story=20040505&amp;nav=RSS091</link> <description>R&amp;eacute;sum&amp;eacute;s tha +t worked even a few years ago aren&apos;t effective today. Here are f +ive essential updates.</description> </item> </channel> </rss>

#### output

[jollyr@devbox jollyr]$ ./test.pl The Modern R&eacute;sum&eacute; Wide character in print at ./test.pl line 16, <DATA> line 1. The Modern R?sum? Malformed UTF-8 character (unexpected end of string) at /usr/lib/perl5 +/site_perl/5.8.3/i386-linux-thread-multi/HTML/Entities.pm line 435, < +DATA> line 1. Malformed UTF-8 character (unexpected non-continuation byte 0x73, imme +diately after start byte 0xe9) in substitution iterator at /usr/lib/p +erl5/site_perl/5.8.3/i386-linux-thread-multi/HTML/Entities.pm line 43 +5, <DATA> line 1. Segmentation fault

#### broken on this

> perl -V Summary of my perl5 (revision 5.0 version 8 subversion 3) configuratio +n: Platform: osname=linux, osvers=2.4.21-9.elsmp, archname=i386-linux-thread-mu +lti uname='linux bugs.devel.redhat.com 2.4.21-9.elsmp #1 smp thu jan 8 + 17:08:56 est 2004 i686 i686 i386 gnulinux ' config_args='-des -Doptimize=-O2 -g -pipe -march=i386 -mcpu=i686 - +Dversion=5 .8.3 -Dmyhostname=localhost -Dperladmin=root@localhost -Dcc=gcc -Dcf_b +y=Red Hat, Inc. -Dinstallprefix=/usr -Dprefix=/usr -Darchname=i386-linux -Dvendo +rprefix=/u sr -Dsiteprefix=/usr -Duseshrplib -Dusethreads -Duseithreads -Duselarg +efiles -Dd _dosuid -Dd_semctl_semun -Di_db -Ui_ndbm -Di_gdbm -Di_shadow -Di_syslo +g -Dman3ex t=3pm -Duseperlio -Dinstallusrbinperl -Ubincompat5005 -Uversiononly -D +pager=/usr /bin/less -isr -Dinc_version_list=5.8.2 5.8.1 5.8.0' hint=recommended, useposix=true, d_sigaction=define usethreads=define use5005threads=undef useithreads=define usemulti +plicity=de fine useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='gcc', ccflags ='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS + -DDEBUGGI NG -fno-strict-aliasing -I/usr/local/include -D_LARGEFILE_SOURCE -D_FI +LE_OFFSET_ BITS=64 -I/usr/include/gdbm', optimize='-O2 -g -pipe -march=i386 -mcpu=i686', cppflags='-D_REENTRANT -D_GNU_SOURCE -DTHREADS_HAVE_PIDS -DDEBUGGI +NG -fno-st rict-aliasing -I/usr/local/include -I/usr/include/gdbm' ccversion='', gccversion='3.3.2 20031218 (Red Hat Linux 3.3.2-5)', + gccosandv ers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=1234 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=1 +2 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize =8 alignbytes=4, prototype=define Linker and Libraries: ld='gcc', ldflags =' -L/usr/local/lib' libpth=/usr/local/lib /lib /usr/lib libs=-lnsl -lgdbm -ldb -ldl -lm -lcrypt -lutil -lpthread -lc perllibs=-lnsl -ldl -lm -lcrypt -lutil -lpthread -lc libc=/lib/libc-2.3.2.so, so=so, useshrplib=true, libperl=libperl.s +o gnulibc_version='2.3.2'

#### fine on this

11 ~>perl -V Summary of my perl5 (revision 5 version 8 subversion 4) configuration: Platform: osname=darwin, osvers=7.3.0, archname=darwin-2level uname='darwin noras-computer.local 7.3.0 darwin kernel version 7.3 +.0: fri mar 5 14:22:55 pst 2004; root:xnuxnu-517.3.15.obj~4release_pp +c power macintosh powerpc ' config_args='-des -Dprefix=/opt/local -Dccflags=-I'/opt/local/incl +ude' -Dldflags=-L/opt/local/lib -Dvendorprefix=/opt/local' hint=recommended, useposix=true, d_sigaction=define usethreads=undef use5005threads=undef useithreads=undef usemultipl +icity=undef useperlio=define d_sfio=undef uselargefiles=define usesocks=undef use64bitint=undef use64bitall=undef uselongdouble=undef usemymalloc=n, bincompat5005=undef Compiler: cc='cc', ccflags ='-I/opt/local/include -pipe -fno-common -DPERL_D +ARWIN -no-cpp-precomp -fno-strict-aliasing -I/usr/local/include -I/op +t/local/include', optimize='-Os', cppflags='-no-cpp-precomp -I/opt/local/include -pipe -fno-common - +DPERL_DARWIN -no-cpp-precomp -fno-strict-aliasing -I/usr/local/includ +e -I/opt/local/include' ccversion='', gccversion='3.3 20030304 (Apple Computer, Inc. build + 1495)', gccosandvers='' intsize=4, longsize=4, ptrsize=4, doublesize=8, byteorder=4321 d_longlong=define, longlongsize=8, d_longdbl=define, longdblsize=8 ivtype='long', ivsize=4, nvtype='double', nvsize=8, Off_t='off_t', + lseeksize=8 alignbytes=8, prototype=define Linker and Libraries: ld='env MACOSX_DEPLOYMENT_TARGET=10.3 cc', ldflags ='-L/opt/local/ +lib -L/usr/local/lib' libpth=/usr/local/lib /opt/local/lib /usr/lib libs=-lgdbm -ldbm -ldl -lm -lc perllibs=-ldl -lm -lc libc=/usr/lib/libc.dylib, so=dylib, useshrplib=false, libperl=libp +erl.a gnulibc_version='' Dynamic Linking: dlsrc=dl_dyld.xs, dlext=bundle, d_dlsymun=undef, ccdlflags=' ' cccdlflags=' ', lddlflags='-L/opt/local/lib -bundle -undefined dyn +amic_lookup -L/usr/local/lib' Characteristics of this binary (from libperl): Compile-time options: USE_LARGE_FILES Built under darwin Compiled at Jun 24 2004 19:12:14 %ENV: PERL5LIB="/opt/local/lib/perl5/site_perl/5.8.2/" @INC: /opt/local/lib/perl5/site_perl/5.8.2/ /opt/local/lib/perl5/5.8.4/darwin-2level /opt/local/lib/perl5/5.8.4 /opt/local/lib/perl5/site_perl/5.8.4/darwin-2level /opt/local/lib/perl5/site_perl/5.8.4 /opt/local/lib/perl5/site_perl /opt/local/lib/perl5/vendor_perl/5.8.4/darwin-2level /opt/local/lib/perl5/vendor_perl/5.8.4 /opt/local/lib/perl5/vendor_perl .

thanks, qq

(cross posted to perl-unicode, but no answers yet)

Replies are listed 'Best First'.
Re: HTML Entities segfault - malformed utf8
by matija (Priest) on Jun 29, 2004 at 19:08 UTC
    It works OK on perl-5.8.4 (Debian Unstable). Some redhat distros have very bad problems with UTF - you should supersearch for that. I think you need to set up the locale correctly before it will work on redhat.

      Thanks. SuperSearch points to problems with redhat and 5.8.0, particularly with the locale settings. But this seems to have been fixed post .8.0, so I'm assuming its something different.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://370565]
Approved by sunadmn
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (1)
As of 2024-04-25 01:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found