comment on

First, thanks A LOT for the insightful answers!

Both of you are talking about filehandles, I wish I were using some... In fact I am wrangling here with the output of various modules, in this case the LWP lib.

"decode it properly to get 'perl's internal format' "
means I use the $mess->decoded_content() function of HTTP::Message about which the doc says: "Returns the content with any Content-Encoding undone and strings mapped to perl's Unicode strings." For me this means: "internal format", which is in fact a utf8-encoding-dialect (but I should forget about that anyway..)

The utf8 flag of the modules output is ON - and this is where the confusion happens: I thought that the utf8 flag is set although the data is really unicode octets.. But know I understand it as:

decoded_content() returns utf8 encoded unicode (step 1),

my perl script and its regexes should handle utf8 encoded unicode (step 2) - so everything is fine.

And output should also be utf8 encoded unicode. Which it already is so I modified the step to skip the wrong encode step (new step 3) - am I doing it right now?

For the interested reader: in fact I use storable to serialize my resulting data structure as whole, then I gzip the freeze'd data and write it to disk with a simple binmode (and thus not :utf8) filehandle. Any problems here? utf8 data and utf8-flag should stay intact over the pipeline.

In reply to Re: The unicode / utf8 struggle, part 2: regexes by isync
in thread The unicode / utf8 struggle, part 2: regexes by isync

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


"be consistent"
	PerlMonks