Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re^4: Handling utf-8 characters when scraping

by nysus (Vicar)
on Dec 26, 2018 at 18:02 UTC ( #1227720=note: print w/replies, xml ) Need Help??


in reply to Re^3: Handling utf-8 characters when scraping
in thread Handling utf-8 characters when scraping

OK, I thought I had ruled out the possibility that it was Dumper doing the encoding but looking at it again, I think you are right.

For now, I'm just storing the data in a file with Storable. The data will probably end up in a simple database like sql lite eventually. I've been bit in the past by utf8 encoding and my objective for now is to nip the problem in the bud by ensuring the data stays in utf8 from start to finish to avoid any problems.

Thanks for looking at this and providing assistance.

$PM = "Perl Monk's";
$MCF = "Most Clueless Friar Abbot Bishop Pontiff Deacon Curate Priest Vicar";
$nysus = $PM . ' ' . $MCF;
Click here if you love Perl Monks

  • Comment on Re^4: Handling utf-8 characters when scraping

Replies are listed 'Best First'.
Re^5: Handling utf-8 characters when scraping
by haukex (Bishop) on Dec 27, 2018 at 08:17 UTC

    One way to verify what Perl is really storing internally is Devel::Peek. What I would look for is that the UTF8 flag is on, and the string when shown as UTF-8 is correct:

    use Devel::Peek; my $str = "\x{20AC}"; Dump($str); __END__ SV = PV(0x15c0d70) at 0x15e0440 REFCNT = 1 FLAGS = (POK,IsCOW,pPOK,UTF8) PV = 0x15e4e10 "\342\202\254"\0 [UTF8 "\x{20ac}"] CUR = 3 LEN = 10 COW_REFCNT = 1

    Here, [UTF8 "\x{20ac}"] is correct. There's also utf8::is_utf8($str) to check for the UTF8 flag, although I'd recommend only using that for debugging as well. If you don't want all the extra output, you might just say:

    use Data::Dump; my $str = "\x{20AC}"; dd $str, utf8::is_utf8($str); __END__ ("\x{20AC}", 1)

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: note [id://1227720]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others wandering the Monastery: (8)
As of 2020-08-07 13:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Which rocket would you take to Mars?










    Results (45 votes). Check out past polls.

    Notices?