http://qs321.pair.com?node_id=11137077


in reply to Re^2: Parsing/regex help required
in thread Parsing/regex help required

Perhaps you mean "em dash" instead of "en dash"?
This is called "em" because it is similar to the with of "M" in a variable width font.
An en dash is shorter, like the width of the letter "n"

In any event, you will have to be reading using UTF-8 encoding. My dev environment for Perl only can do ASCII. I cannot easily write code for this.

As far as regex goes:
You need to group an or'd expression something like this (-|em_dash)
To make it "non capturing", (?:-|em_dash);

The question is what "em_dash" should be and how that relates to how the data decoding that was used during the read.

update: under some coding scenarios an em dash is \x{2014}.
I think you need "use utf8;" for that to work, but I am not sure.

Some Monks here are quite experienced with utf8 encoding.
Bring it on!