comment on

Heck, if I were implementing a CSV parsing module, I'd probably have separate code for the case of single-character separators, quotes, and escapes. Because the reasonable way to implement CSV parsing efficiently is rather different between when "quote" is a single character and when it is more than 1 character.

So I see no problem having a whole separate module for dealing with multi-character quotes. Use the standard module if you don't have to deal with such. Use the other module when you do. Each module is simpler because the multi-character one doesn't have to also try to include code to maximize efficiency for when a quote is a single character.

Do you mean character or byte?

I think you're using "multi-character" when what you actually mean is a single character (i.e., a single Unicode code point) that is encoded using multiple bytes in any one of the Unicode character encoding schemes: UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, and UTF-32LE. I don't think you truly mean a user-perceived character that consists of two or more Unicode code points (e.g., g̈ — U+0067 LATIN SMALL LETTER G + U+0308 COMBINING DIAERESIS).

In my Academy Award Best Picture winners example, every CSV metacharacter is a single character. The field separator character is 🎬 (U+1F3AC CLAPPER BOARD), and both the string delimiter character and the string delimiter escape character are 🎥 (U+1F3A5 MOVIE CAMERA). These two characters are, or course, encoded using multiple bytes in every one of the Unicode character encoding schemes. In UTF-8, they're encoded using four bytes. In UTF-16, they're also encoded using four bytes (two surrogate code points). And in UTF-32, they're encoded using four bytes, naturally.

I'd like to see a truly Unicode-conformant CSV parser/generator module in Perl 5. It would leverage Perl's existing Unicode and character encoding capabilities; it wouldn't roll its own encoding handling. It would parse already-decoded CSV records. The input to the finite-state machine would be Unicode code points, not bytes. (More ambitiously, the input to the FSM might be any arbitrary user-perceived character, or extended grapheme cluster.)

Why not?

In reply to Re^8: Speeds vs functionality by Jim
in thread Speeds vs functionality by Tux

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


XP is just a number
	PerlMonks