http://qs321.pair.com?node_id=198828

samurai has asked for the wisdom of the Perl Monks concerning the following question:

My ex-supervisor and I recently had a conversation about what the best way to delimit data is in relation to OTHER languages, not just perl.

I argued any character would be as "optimal" as any other character (although I prefer \t). Heck, "a" could be used as the delimiter as long as you backslash-escape it. He argued that tab should ALWAYS be used simply because it's "the standard" and it's not used as much in data as "a".

Does everyone agree on what the "optimal" (or easier if you'd rather) delimeted file format is? Is there one format that's better for perl than other languages? Is there one delimeted format that's easier for every language to play with?

--
perl: code of the samurai

Replies are listed 'Best First'.
Re: Best X-delimited format?
by Abigail-II (Bishop) on Sep 18, 2002 at 16:59 UTC
    I doubt everyone will agree what the "optimal" delimited format is. A few points:
    • For a program, it doesn't matter what the delimiter is - an "a" is as easy as a tab or a comma.
    • For humans, it matters.
    • I give much kudos to things that are debuggable with vi and telnet.
    • Tabs lose points, because they are not always easy to distinguish from spaces. Furthermore, it's not uncommon to configure editor to expand tabs to spaces.
    • Printable punctuation characters are better than letters, digits or control characters.
    • The delimiter should be choosen in such a way it's not a common character in the data, to avoid use of a backslash. Don't use a dot as a delimiter when delimiting decimal numbers.
    • I've preferences for colons (because important files in /etc do so, semi-colons, dots, hyphens (all three because it's natural) and "horizontal whitespace", that is, any sequence of one or more spaces or tabs. Then you can make columns.

    Abigail

Re: Best X-delimited format?
by katgirl (Hermit) on Sep 18, 2002 at 14:18 UTC
    I use "|" (pipe) for most of mine... another question:

    Is there any delimiter that should definitely not be used? Ever ever under any circumstances? On pain of... er... pain?

      That probably depends on the nature of the data that you plan to delimit. I find that when I'm delimiting numbers that have decimals or commas, a decimal or comma delimiter tends to get really confusing :^). The same goes for delimiting strings with any common punctuation...I tend to use Text::ParseWords and don't run into much trouble. Jason

      I'd never use a  \ because it hurts.

       - and . are some other bad ideas.

Re: Best X-delimited format?
by mce (Curate) on Sep 18, 2002 at 14:23 UTC
    Hi,
    There are plenty standards for delimiters.
    The easiest is CSV (with a ,) or Tabs as you mention.

    but the de facto standard now-a-days is XML to delimit data, or more general SGML.
    ---------------------------
    Dr. Mark Ceulemans
    Senior Consultant
    IT Masters, Belgium

      XML is a bit of overkill, isn't it? I mean, a delimiter is usually one byte, whereas XML tags are much larger in bytes by comparison.

      --
      perl: code of the samurai

        XML does result in larger files. That said though, larger files are very rarely a problem at this stage. If you are involved in flat files that would be so large that XML encoding them would be prohibitive, then the flat files are probably too big already. Also, XML allows much flexibility in adding, moving, and rearranging the data. It's obvioiusly not the one answer for every situation, but it's the best answer for most situations.
Re: Best X-delimited format?
by sauoq (Abbot) on Sep 18, 2002 at 16:50 UTC
    I argued any character would be as "optimal" as any other character (although I prefer \t).

    I tend to agree with this in theory but in real life, commas, tabs, and pipes are probably used most frequently.

    You can get more efficiency out of the case where you are positive that your delimiter won't show up in your data. Otherwise, you really need at least two characters: the delimiter and a quoting (or escaping) character. The less either of those shows up in your data, the less processing you will have to do and the smaller your input will be. So, it is better to use a less frequently used character as the delimiter.

    I disagree with you ex-supervisor's assertion that tab is "the standard." There simply is no standard.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Best X-delimited format?
by fglock (Vicar) on Sep 18, 2002 at 14:14 UTC

    it's not used as much in data

    That's it: it depends on what data you have. Anyway, I like both \t and #, but I also use ',' and ';' sometimes.

    update: \t is not so good because some text editors will change it to spaces. It might get very hard to debug that.

Re: Best X-delimited format?
by Zaxo (Archbishop) on Sep 18, 2002 at 18:57 UTC

    If da boss wants standards, you can go retro. There is the ASCII control set, chars 0..31. It has RS == chr(30), FS == chr(28), ESC == chr(27), and more to choose from for fancier formats.

    That violates much of the good sense in the other replies. They aren't printable, so text editors and human readers will have trouble. Efficiency for text data is great.

    After Compline,
    Zaxo

Re: Best X-delimited format?
by zengargoyle (Deacon) on Sep 19, 2002 at 05:22 UTC

    I just want to jump in for my favorite, "^".

    It seems to be rarely used. If you have a "comment" field for humans to use almost all of the other punctuation on the keyboard will be used by somebody.

    • Tom's junk.
    • 13 @ $4.32 each
    • primary john; secondary phil;
    • upstairs (behind the door)

    But rarely 'tell mark ^ paul'. Plus it hangs high in the line and has space below for visual hooking.