note
Jim
<blockquote><b><i>How much speed are you willing to sacrifice for a new feature?</i></b></blockquote>
<p>As much as it takes, ’cause I <em>need</em> the damn feature! ☺</p>
<p>IMHO, a modern CSV parser must be able to parse Unicode text encoded in any Unicode [http://www.unicode.org/glossary/#character_encoding_scheme|character encoding scheme] and with any arbitrary Unicode characters (code points, or even <<em>gulp</em>> [http://www.unicode.org/glossary/#extended_grapheme_cluster|extended grapheme clusters]) used for CSV metacharacters. And it must properly handle the Unicode [http://www.unicode.org/glossary/#byte_order_mark|byte order mark] as prescribed by the Unicode Standard.</p>
<p>I'm the monk responsible for these related posts and threads:</p>
<ul>
<li>[1007942|Best Way To Parse Concordance DAT File Using Modern Perl?]</li>
<li>[1008017|Peculiar Reference To U+00FE In Text::CSV_XS Documentation] (especially [1008165|this] final post in the thread)</li>
<li>[1062031|Re: Text::CSV and Unicode ]</li>
</ul>
<p>In the case of the Concordance DAT file, the <c>sep_char</c> separator character, U+0014, is encoded in one byte in UTF-8: <c>"\x14"</c>. It's the <c>quote_char</c> quote character (and consequently also the <c>escape_char</c> quote escape character), U+00FE, that happens to be encoded in two bytes in UTF-8: <c>"\xC3\xBE"</c>.</p>
<p>Jim</p>
1095245
1095479