A problem with dash typography

hsmyers has asked for the wisdom of the Perl Monks concerning the following question:

I have a great deal of text that has both endashes and emdashes (– and — respectively) within html files as plain text. Since my editor gladly converts this (nary a complaint) I usually don't pay any attention. However I recently noticed a problem with HTML::Entities encode_entities function; i.e.

encode_entities("How the Chimney–sweeper's cry,")
[download]

produces:

How the Chimney&#150;sweeper&#39;s cry,
[download]

rather than:

How the Chimney&#8221;sweeper&#39;s cry,
[download]

Now that I've spotted the problem, I can easily do the necessary regex massage and have it go away, but I was wondering if anyone knows the necessary Unicode/UTF-8 incantation magic to avoid the problem in the first place (if in fact that is what is)? Note that the emdash is translated to  instead of „ I have not checked the other typical HTML typographical elements as yet, these are so common that the problem surfaced fairly quickly.

Note:I leave the typos as written, but I really meant &#8212 and &#8211 *sigh*

Note: https://stackoverflow.com/questions/631406/what-is-the-difference-between-em-dash-151-and-8212 seems pertainent...

--hsm

"Never try to teach a pig to sing...it wastes your time and it annoys the pig."

Comment on A problem with dash typography Select or Download Code

Replies are listed 'Best First'.
Re: A problem with dash typography by 1nickt (Canon) on Sep 08, 2015 at 18:57 UTC
Hm, the correct numerical entity code for the em-dash is `&#8212` ... otherwise `—` It works for me with the correct character encoding in and out: `[12:04][nick:~/monks]$ perl -Mstrict -Mutf8 -MHTML::Entities -E ' binmode STDOUT,":utf8"; > say encode_entities("Chimney—sweeper"); > say encode_entities("Chimney–sweeper"); > say decode_entities("Chimney—sweeper"); > say decode_entities("Chimney–sweeper"); > ' Chimney—sweeper Chimney–sweeper Chimney—sweeper Chimney–sweeper` [download] Hope this helps! Edit: Decoded characters may not display properly here ... The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: A problem with dash typography by hsmyers (Canon) on Sep 09, 2015 at 15:03 UTC
Sorry about the typos...that aside I believe you have nailed the necessary magic with the ':utf8'...excepting in this case it is required before I read the file. Will see what happens, thanks! --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply]
Re^3: A problem with dash typography by 1nickt (Canon) on Sep 09, 2015 at 16:03 UTC
If you put: `use utf8;` [download] at the top of the script, this tells Perl that your source code contains UTF8-encoded unicode characters. If you want to read and write UTF8, do this at the top of the script: `binmode STDIN, ':utf8'; binmode STDOUT, ':utf8';` [download] Hope this helps! The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^4: A problem with dash typography by hsmyers (Canon) on Sep 10, 2015 at 16:54 UTC
Re: A problem with dash typography by kcott (Archbishop) on Sep 09, 2015 at 02:57 UTC
G'day hsmyers, When your source code contains UTF-8, you need to tell Perl about this. You do this with the utf8 pragma. See that documentation for more complete details on that (somewhat oversimplified) advice. Here's my test: `#!/usr/bin/env perl -l use strict; use warnings; #use utf8; use HTML::Entities qw{encode_entities}; my $dash = 'DASH: "-"'; my $emdash = 'EMDASH: "—"'; my $endash = 'ENDASH: "–"'; print encode_entities($_) for ($dash, $emdash, $endash);` [download] Output from this code: `DASH: "-" EMDASH: "â" ENDASH: "â"` [download] Output after uncommenting "`#use utf8;`": `DASH: "-" EMDASH: "—" ENDASH: "–"` [download] — Ken	[reply] [d/l] [select]
Re^2: A problem with dash typography by hsmyers (Canon) on Sep 09, 2015 at 14:58 UTC
I suspicioned as much, will fiddle with this...thanks! --hsm "Never try to teach a pig to sing...it wastes your time and it annoys the pig."	[reply]


Clear questions and runnable code get the best and fastest answer
	PerlMonks