Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

Re: Handling utf-8 characters when scraping

by haukex (Archbishop)
on Dec 26, 2018 at 08:35 UTC ( [id://1227702]=note: print w/replies, xml ) Need Help??


in reply to Handling utf-8 characters when scraping

Assuming that your source file is encoded correctly in UTF-8, then the output you've shown is correct - \x{2026} is U+2026 HORIZONTAL ELLIPSIS. Could you show an SSCCE of code you're having trouble with?

Replies are listed 'Best First'.
Re^2: Handling utf-8 characters when scraping
by nysus (Parson) on Dec 26, 2018 at 09:44 UTC

    The code I'm having trouble with works just like the sample code above. I don't want the ellipsis getting to get encoded into \x{2026}.

    $PM = "Perl Monk's";
    $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
    $nysus = $PM . ' ' . $MCF;
    Click here if you love Perl Monks

      I don't want the ellipsis getting to get encoded into \x{2026}.

      That's Data::Dumper doing that, and I don't think there's a way to turn it off. Data::Dump seems to be similar. Data::Printer does seem to do what you want:

      use warnings; use strict; use open qw/:std :utf8/; use Data::Printer { print_escapes=>1 }; my $str = "(\N{U+2026}\n)"; p $str;

      Gives me: "(…\n)"

      ... although on the other hand, these modules are all debugging tools, and not really tools for generating consistently formatted output. For that, other formats are better - if you could explain the application, then perhaps we could make other suggestions.

        OK, I thought I had ruled out the possibility that it was Dumper doing the encoding but looking at it again, I think you are right.

        For now, I'm just storing the data in a file with Storable. The data will probably end up in a simple database like sql lite eventually. I've been bit in the past by utf8 encoding and my objective for now is to nip the problem in the bud by ensuring the data stays in utf8 from start to finish to avoid any problems.

        Thanks for looking at this and providing assistance.

        $PM = "Perl Monk's";
        $MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
        $nysus = $PM . ' ' . $MCF;
        Click here if you love Perl Monks

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1227702]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-03-29 15:57 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found