http://qs321.pair.com?node_id=306618

data67 has asked for the wisdom of the Perl Monks concerning the following question:

Background
I have a csv file and i am reading this file in using split.something like this:
while (<>) { @array = split(/,/, $line); }

Now the problem is that one of my csv record contains a "," in the record itself. this is confusing the split int breaking the line in the wrong place.
data looks like :
121212, "Simpson, Bart", springfield

I am trying to get it to look like:
121212, Simpson Bart, springfield

Notice Bart Simpson is quoted in the original, that is a function of MS Excel. it adds quotes around any data that is being exported to csv and contains a comma in the data.
I really can't use modules like :Text::CSV or DBI::CSV. As a mater of fact i can't use any modules. Thus i am trying to use regular expressions.
All i need to do is take out the extra "," comma that is appearing in the the field

Again, I already know that i can do this with some modules but i dont have root to the machine right now and it is not possible for me to install any new modules. I am trying to do this by using reqular expressions. Here is what i have so far.

open (SYST, "$upload_dir/$filename"); while (defined ($line = <SYST>)) { next if ($line =~ /^\s+/) ; if ($line =~ /"(.*)"/) { print "found line= $line\n"; $line =~ s! "(.*)" ! (s/,/sprintf("_", $&)) !ee; print "now line= $line\n"; } } close (SYST);

Here i am searhing if a line contains a "(.*)" (something with quotes around it.
then i am run a sub expression on it that will take what ever is in $1 and run a replace on one character (,) only.
My regular expresion skills are not high and what you see above only needs a slight tweak from what i can tell.
I need to know i can make the secord half work!

Here is a example of what i am trying to do:

line = 121212, "Simpson, Bart", Springfield<br> if ($line = /"(.*)"/) { $1 = "Simpson, Bart"; Now contains # now run replace on Simpson, Bart $1 =~ s/,/_/g; $1 = "Simpson_ Bart"; # Now contains now put everyting back together. 121212, Simpson_ Bart, Sprigfield }

Thanks in advance!

Replies are listed 'Best First'.
Re: regular expression (search and destroy)
by sauoq (Abbot) on Nov 12, 2003 at 20:29 UTC
    As a mater of fact i can't use any modules.

    Why not?

    Not even core modules? Text::Balanced may come in handy.

    If you really can't use modules for some unknown reason, you could just take the code from Text::CSV and paste it into your script.

    Using regular expressions alone to try to robustly handle CSV isn't a good idea. Especially as it is a problem with an existing solution that has already been well-tested.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: regular expression (search and destroy)
by davido (Cardinal) on Nov 12, 2003 at 20:56 UTC
    Insisting on using a Regular Expression to parse CSV data, and refusing to look at modules that have been well written for such tasks, is a bit like going to the mechanic and insisting on having your car's engine removed using sky hooks and telekinesis instead of engine lifts and elbow grease.

    If the core module (that comes with Perl) for handling balanced text is unavailable to you because your professor requires that you not use it, at very least look under its hood at its source code so that you can understand how it works rather than trying to build your own internal combustion engine with coke-cans and superglue.

    "Junkyard Wars" is on a different channel, this is Perl. ;)


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      a bit like going to the mechanic and insisting on having your car's engine removed using sky hooks and telekinesis instead of engine lifts and elbow grease.
      Genius!
      I actually agree with you surprise. But mabe you did'nt read the question properly. I was not able to install them!! My suggestion is stop looking at this question like a csv problem. all i need to know is how to do s sub function in the middle of a regex.

      example :

      s/ "(.*)" #search for something that is enclosed by quotes. / Now take $1 and run a sub search replace on it. /e;

      What is so bad about doing this? I already know that if i have a quoted data in my file then it is bound to have a comma in there!. Now i'll say this, if i cant do it this way then i would have to wait for our sysadmin to come back and install the relevant modules.

        I cannot enumerate all the difficulties associated with trying to parse a string where delimiters between quotes should be treated as text and not delimiters. But here are a few problems...

        • How do you provide for escaping delimiters aside from the quoting?
        • How do you provide for allowing quotes that are acting as text items rather than as quoting mechanisms?
        • How do you allow for nested levels of quote like characters?
        • Is ' the same as ", the same as `, the same as ....?

        Those reasons and others make it much better to parse such entities using a "balanced text" module. And what sauok was telling you was that Text::Balanced is a "core" module, meaning, a module that if you have Perl, you have that module already, without waiting for a sysadmin to install it.

        And what others have also suggested is that even if you can't use a CORE module (a module that comes with Perl, just like stdio.h comes with C), at very least you can use your web browser to view the source of a module, on the CPAN website. Then, you can have a look at what tools are used to accomplish your objective.

        Otherwise, you're wasting your efforts and building a broken solution.


        Dave


        "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
        Just to be crystal-clear: Do you realize that a standard perl installation comes with some modules already installed? They are called 'core' modules, because they're there by default.

        Pardon me for saying so, but you don't seem to be listening to what people are telling you. Also, as hardburn correctly pointed out, installing modules in user directories is a FAQ.

Re: regular expression (search and destroy)
by hardburn (Abbot) on Nov 12, 2003 at 21:01 UTC

    i dont have root to the machine right now and it is not possible for me to install any new modules.

    Non-root module installation is covered in the CPAN FAQ (point 5).

    ----
    I wanted to explore how Perl's closures can be manipulated, and ended up creating an object system by accident.
    -- Schemer

    : () { :|:& };:

    Note: All code is untested, unless otherwise stated

Re: regular expression (search and destroy)
by BrowserUk (Patriarch) on Nov 13, 2003 at 02:28 UTC

    As you have had all the obligatory warnings about not using a module, or at least copying from a module to do this, let me be the one to caution you that if you opt to use a module, look at them very carefully. They are not all equal.

    The first thing to check for is that the modules idea of what constitutes CSV data, is the same as Excel's idea. For example, Excel can generate CSV data with quoted fields that contain embedded newlines. And don't blame MS for this extension to the standard (if you can find a standard definition for CSV), many other spreadsheets also do this going right back to the once ubiquitous Lotus 123 I believe. To date, Tillys Text::xSV is the only module I found that will handle this.

    If you have large volumes of CSV to parse, many of the CSV modules around are less than sparkling in the performance department. The best performer I have found is Text::CSV_XS, but it fails to handle embedded newlines. In any case, if you cannot or will not install modules, being XS, it will not be useful to you.

    It is possible to do this yourself with regexes, but it is quite difficult to get it right and cover all the edge cases.


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!
    Wanted!

      Text::CSV_XS does handle embedded new lines indeed, you just have to configure your new object properly using the binary option.

      I quote from the man page:

      binary
      
          If this attribute is TRUE, you may use binary characters in quoted fields,
          including line feeds, carriage returns and NUL bytes. (The latter must
          be escaped as "0.) By default this feature is off.
      

      I find Text::CSV_XS to be a nice solution, I never had problems once I set up object attributes correctly. Performance wise is ligthing speed, especially using the print and getline methods.


      $|=$_="1g2i1u1l2i4e2n0k",map{print"\7",chop;select$,,$,,$,,$_/7}m{..}g

Re: regular expression (search and destroy)
by Roger (Parson) on Nov 12, 2003 at 21:09 UTC
    You surely *can* split the record in one go -

    Version 1 - '\' escaped quotes:
    use strict; use Data::Dumper; while (<DATA>) { chomp; my @rec; # was - foreach (split /"(.*,.*)"|,/) ... foreach (split /"((?:\\"|.)*?)"|,/) { push @rec, $_ if $_ } print Dumper(\@rec); } __DATA__ 1,"Hello, world",This is good,2 121212,"Simpson, Bart",Springfield,"Roger" 121212,"2\" tape, \"white",springfield,"Roger" 121212,"Simpson \", Bart",Springfield,"Roger"
    And the output is -
    $VAR1 = [ '1', 'Hello, world', 'This is good', '2' ]; $VAR1 = [ '121212', 'Simpson, Bart', 'Springfield', 'Roger' ]; $VAR1 = [ '121212', '2\\" tape, \\"white', 'springfield', 'Roger' ]; $VAR1 = [ '121212', 'Simpson \\", Bart', 'Springfield', 'Roger' ];
    Update: There was a minor flaw in the original solution, I did not search for escaped quotes inside the quote, here's the enhanced version.

    Version 2 - '"' escaped quotes:
    use strict; use Data::Dumper; while (<DATA>) { chomp; my @rec; foreach (split /"(.*?)(?:(?<!")"(?!")|(?<="")"(?!"))|,/) { s/""/"/g, push @rec, $_ if $_ } print Dumper(\@rec); } __DATA__ 1,"Hello, world",This is good,2 121212,"Simpson, Bart",Springfield,"Roger" 121212,"2"" tape, ""white",springfield,"Roger" 121212,"Simpson "", Bart",Springfield,"Roger" 121212,"2""",springfield,"Roger"
    And output is -
    $VAR1 = [ '1', 'Hello, world', 'This is good', '2' ]; $VAR1 = [ '121212', 'Simpson, Bart', 'Springfield', 'Roger' ]; $VAR1 = [ '121212', '2" tape, "white', 'springfield', 'Roger' ]; $VAR1 = [ '121212', 'Simpson ", Bart', 'Springfield', 'Roger' ]; $VAR1 = [ '121212', '2"', 'springfield', 'Roger' ];
    Update: Thanks to antirice to point out that the quotes are escaped by quote ("), not backslash (\) inside the quote.

      If I recall correctly, you can have empty fields and the way you actually escape a double quote is as "" instead of \". In other words, these are allowed:

      1,"Hello world",,2 2,"""Hello world""",,2

      I also believe that if the field contains a double quote, then the entire field must be enclosed in double quotes so escaping rules could apply. There are more rules, but they're mostly allowed values and input record separator issues more than anything else, but that's just for saving and verification that the file you have is really CSV. I think you can read more about them at the end of the documentation here.

      Of course, this is only if you want valid CSV or CSV produced by programs such as Excel and whatnot.

      Hope this helps.

      antirice    
      The first rule of Perl club is - use Perl
      The
      ith rule of Perl club is - follow rule i - 1 for i > 1

Re: regular expression (search and destroy)
by sweetblood (Prior) on Nov 12, 2003 at 21:39 UTC
    This can get to be rather complicated to parse. The problems I've seen with this type of data can throw a wrench into your parsing methods. I haven't found a good module that covers all the subtlties with quoted delimited data. Just as an example if your data looks like you describe:

    121212, "Simpson, Bart", Springfield

    this is a trivial matter to parse, but what if your data looks like:

    121212,"2" tape, white", springfield

    If the case is that you'd never encounter quotes embed within your fields then it is less of a problem. If you are dead set against using some of the fine CPAN modules or even as previously suggested Text::Balance (core module) you could do something like this:

    Untested psudeo-code

    RECORD: while (<DATA>){ # read data 1 byte at a time for (my $i=0;$i < length($_);$i++) { $byte = substr($_, $i, 1); if ($byte eq "\""){ $i++; $next_byte = substr($_, 1, $i) if ($next_byte ne ",") { $quoting = 1; } else { $quoting = 0; } if ($quoting) { print $byte$next_byte; next; } else { print $nextbyte next; } } }elsif ($byte =~ /\n/) { $quoting = 0; next RECORD; } else { print $byte; } $quoting = 0; }

    The idea is to read a record then walk through the record 1 byte at a time trying to determine if a delimiter is inside a set of protecting quotes.
    It gets more difficult if you have more complex data like the above examples and worse.

    One other thing, the above method is not very rapid so if you have tons (100's of megs/gigs/terras) you may have to wait awhile.

    In the end, your probably best off using a module.

Re: regular expression (search and destroy)
by injunjoel (Priest) on Nov 12, 2003 at 22:46 UTC
    Greetings all,
    Let me preface this post with the fact that I am only answering the question posted... Namely the ' All i need to do is take out the extra "," comma that is appearing in the the field' part. The following code is not as universal or portable as most code should be, but this will work give the example line of input from the original post.
    That being said... Here is what I would suggest.
    #!/usr/bin/perl -w use strict;#Always ###the sample line of input my $line = qq*121212, "Simpson, Bart", springfield*; ###capture all quoted strings and place them in @elms my @elms = $line =~ /("[^"]*")/g; ###got through the captures you found in the string $line; for(@elms){ ###make two copies for later use my $original_elm = $_; my $new_elm = $original_elm; ###clean up time $new_elm =~ s/[,"]//g; ###replace the old with the new element. $line =~ s/\Q$original_elm\E/$new_elm/; } my @elements = split(/,/,$line); print "@elements"."\n"; exit; ___OUTPUT___ 121212 Simpson Bart springfield
    Of course this logic will need to be grouped in such a way as to deal with each line of input.
    Not all that elegant but you get the general idea.
    Hope that helps.
    -injunjoel
Re: regular expression (search and destroy)
by Lhamo Latso (Scribe) on Nov 12, 2003 at 21:06 UTC

    Here are my 2 cents. This code would do what you are asking, but wouldn't handle other problems like a " in the quoted string.

    #!/usr/bin/perl -w my $line = '121212, "Simpson, Bart", Springfield<br>'; print "Before: $line\n"; $line =~ s/"(.*?),(.*?)"/$1_$2/; print "After: $line\n";

    The output is:

    Before: 121212, "Simpson, Bart", Springfield<br> After: 121212, Simpson_ Bart, Springfield<br>

    To explain, adding a ? after *, as in .*? causes minimal matching. It will match the first ", then as few characters as possible, then a comma, then more of the same until another " is found. It all gets replaced with the s///.

      This code would do what you are asking

      It wouldn't really do what he is asking. It would just work on the example he gave. Consider other possible input... A CSV format generally defines an espape character, often either a doublequote or a backwack. There may be more than one quoted field. Etc. So, what would your code do to a line like:

      42, "Simpson, Homer ""Mr. Donuts"", "Springfield"
      And how would you fix it?

      -sauoq
      "My two cents aren't worth a dime.";
      
Re: regular expression (search and destroy)
by inman (Curate) on Nov 13, 2003 at 12:41 UTC
    Quoting from the excellent "Perl Cookbook" -

    Comma-seperated input is a deceptive and complex format. It sounds simple but involves a fairly complex escaping systembecause the fields themselves can contain commas. This makes pattern matching a solution complex and rules out the simple split /,/ .

    This sums up the situation pretty well. There is no definition of a CSV format and something that works with one data format may not work with another. The book goes on to provide the following example code which I have modified to trim whitespace. This should work with Excel generated CSV files.

    #! /usr/bin/perl # use strict; while (<DATA>) { chomp; # avoid matching against newline my $words = join "=>", parse_csv($_); print "$words\n"; } sub parse_csv { my $text = shift; # record containing comma-separated values my @new = ( ); push @new, $+ while $text =~ m{ \s*"([^\"\\]*(?:\\.[^\"\\]*)*)"\s*,? # Matches a phrase tha +t may contain commas | \s*([^,]+)\s*,? # Something that is not a comma | \s*, # Just a comma - no data }gx; push (@new, undef) if substr($text, -1,1)eq ','; return @new; } __DATA__ 1,the,simple,case "with","quoted" , "strings that contain spaces" "with","quoted" , "comma, internally" "with","quoted" , "comma, internally, with null data",,,

    inman

      This should work with Excel generated CSV files.

      Nope. That code doesn't do Excel CSV files at all. Excel CSV files are composed of records terminated by an unquoted newline, which are composed of fields that are either quoted with double quotes, with embedded quotes being doubled as the only form of escaping, or unquoted sections composed of non comma (actually seperator) and non quote data.

      This means that Excel CSV files cannot be correctly parsed by a simple "read a line and seperate out the fields" approach. You need to know about the state of the previous line to determine if the current line is in fact part of a multiline record, or if it is not. The following code, as far as I can tell will handle CSV files correctly. However I personally would use a module, or if I really couldnt install a module (as anybody experienced knows, this is almost never really true for pure perl code) I would steal tillys code from Text::xSV and cut and paste it. But anyway, this was a fun exercise while I waited for something else to complete.

      use strict; use warnings; use Data::Dumper; my $text=""; while (<DATA>) { $text.=$_; my @new; pos($text)=0; my $last=0; while ($text=~m{ \G (?: "((?:[^"]+|"")*)" # quoted section | ([^",]+) # unquoted | ((?=\s*,)) ) (?:\s*,\s*|\s*\z) }gmx) { push (@new, $+); $last=$+[0]; $new[-1]=~s/""/"/g; } if ($last==length($text)) { push (@new, "") if $text=~/,$/; print "Rec:".Data::Dumper->new([\@new])->Terse(1) ->Useqq(1)->Indent(0)->Dump()."\n"; $text=""; } } print "Error: $text\n" if $text; __DATA__ 1,the,simple,"case " "with","quoted" , "strings that contain spaces" "with","quoted" , "comma, internally" "with","quoted" , "comma, internally, with ""null"" data",,, """"""""

      Incidentally it outputs:

      Rec:[1,"the","simple","case\n"] Rec:["with","quoted","strings that contain spaces"] Rec:["with","quoted","comma,\ninternally"] Rec:["with","quoted","comma, internally, with \"null\" data","","",""] Rec:["\"\"\""]

      ---
      demerphq

        First they ignore you, then they laugh at you, then they fight you, then you win.
        -- Gandhi


Re: regular expression (search and destroy)
by Anonymous Monk on Nov 13, 2003 at 17:10 UTC
    Te answer to this is in the Perl Cookbook recipe 1.15 page 31 but is taken from mastering regular expressions
    sub parse_csv { my $text = shift; # record containing cs values my @new = (); push(@new, $+) while $text =~m{ #the first part groups the phrase inside the quotes #see expl of this pattern in MRE "([^\"\\]*(?:\\.[^\"\\]*)*)",? | ([^,]+),? | , }gx; push(@new,undef) if substr($text, -1,1) eq ','; return @new; #list of values that were comma separated } or standard module use Text::ParseWords; sub parse_csv { return quoteword(",",0,$_[0]); }

    Hope my typing is okay