Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Re: regular expression (search and destroy)

by sweetblood (Prior)
on Nov 12, 2003 at 21:39 UTC ( [id://306645]=note: print w/replies, xml ) Need Help??


in reply to regular expression (search and destroy)

This can get to be rather complicated to parse. The problems I've seen with this type of data can throw a wrench into your parsing methods. I haven't found a good module that covers all the subtlties with quoted delimited data. Just as an example if your data looks like you describe:

121212, "Simpson, Bart", Springfield

this is a trivial matter to parse, but what if your data looks like:

121212,"2" tape, white", springfield

If the case is that you'd never encounter quotes embed within your fields then it is less of a problem. If you are dead set against using some of the fine CPAN modules or even as previously suggested Text::Balance (core module) you could do something like this:

Untested psudeo-code

RECORD: while (<DATA>){ # read data 1 byte at a time for (my $i=0;$i < length($_);$i++) { $byte = substr($_, $i, 1); if ($byte eq "\""){ $i++; $next_byte = substr($_, 1, $i) if ($next_byte ne ",") { $quoting = 1; } else { $quoting = 0; } if ($quoting) { print $byte$next_byte; next; } else { print $nextbyte next; } } }elsif ($byte =~ /\n/) { $quoting = 0; next RECORD; } else { print $byte; } $quoting = 0; }

The idea is to read a record then walk through the record 1 byte at a time trying to determine if a delimiter is inside a set of protecting quotes.
It gets more difficult if you have more complex data like the above examples and worse.

One other thing, the above method is not very rapid so if you have tons (100's of megs/gigs/terras) you may have to wait awhile.

In the end, your probably best off using a module.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://306645]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (2)
As of 2024-04-20 03:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found