comment on

This can get to be rather complicated to parse. The problems I've seen with this type of data can throw a wrench into your parsing methods. I haven't found a good module that covers all the subtlties with quoted delimited data. Just as an example if your data looks like you describe:

121212, "Simpson, Bart", Springfield

this is a trivial matter to parse, but what if your data looks like:

121212,"2" tape, white", springfield

If the case is that you'd never encounter quotes embed within your fields then it is less of a problem. If you are dead set against using some of the fine CPAN modules or even as previously suggested Text::Balance (core module) you could do something like this:

Untested psudeo-code

RECORD:
while (<DATA>){
    # read data 1 byte at a time
    for (my $i=0;$i < length($_);$i++) {
        $byte = substr($_, $i, 1);
        if ($byte eq "\""){
            $i++;
            $next_byte = substr($_, 1, $i)
            if ($next_byte ne ",") {
                $quoting = 1;
            } else { 
                $quoting = 0;
            }
            if ($quoting) {
                print $byte$next_byte;
                next;
            } else {
                print $nextbyte
                next;
            }
        }
    }elsif ($byte =~ /\n/) {
        $quoting = 0;
        next RECORD;
    } else {
        print $byte;
    }
    $quoting = 0;
}
[download]

The idea is to read a record then walk through the record 1 byte at a time trying to determine if a delimiter is inside a set of protecting quotes.
It gets more difficult if you have more complex data like the above examples and worse.

One other thing, the above method is not very rapid so if you have tons (100's of megs/gigs/terras) you may have to wait awhile.

In the end, your probably best off using a module.

In reply to Re: regular expression (search and destroy) by sweetblood
in thread regular expression (search and destroy) by data67

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks