http://qs321.pair.com?node_id=11116857

anita2R has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I am trying to construct a regex to use in a split command.

The string to be split consists of from one to four comma separated fields. The second field can contain text which may use backslashes to escape non-separating commas '\,', the backslash itself '\\', and 2 character hex codes '\x2B' hex codes '\0x2B'. Here is an example:

$text = '1,Something\,\\text\\text\0x2B,X,99';

I can split this with a regex using a negative lookbehind:

my $regex = '(?<!\\\),'; split( /$regex/, "$text" );

so it splits on a comma unless it is preceded by a backslash.

Actual outcome:

1 Something\,\text\text\0x2B X 99

(In a terminal the double backslash appears as a single \, but this is not an issue as the split data is processed further in the perl script).

Now my problem:

If an escaped backslash comes before the next separating comma the regex 'sees' a backslash before the separating comma and does not split.

$text = '1,This is a problem->\\,B,2';

Actual output:

1 This is a problem->\,B 2

I have tried several regexe's based on the concept that there needs to be a match on a comma except when preceded by a backslash, using negative lookbehind OR there is a match on a comma when preceded by two backslashes using positive lookbehind. Here are two that I have tried

$regex = '(?<!\\\),|(?<=\\\\),';

$regex = '(?<!\\\),|(?<=[\\\]{2}),';

Neither gives the required output - so maybe I am lost in ever more escaped escaped escaped backslashes! I have tried variations with more backslashes in the positive lookbehind section of the regex. I also tried \Q...\E to avoid escaping the backslashes but this results in an error:'Unrecognized escape \Q passed through in regex'. I tried building the regex with qr like this

my $regex = qr /(?<!\\),|(?<=\\\\),/;

but it still didn't split on '\\,'.

As a test of my concept I replaced all backslashes with colons

my $regex = '(?<!:),|(?<=[:]{2}),'; my $text = '1,This:, is not a problem->::,B,2'; my @test = split( /$regex/, "$text" ); foreach( @test ) { print "$_\n"; }

The output was 'correct'

1 This:, is not a problem->:: B 2

In summary: I need to split a string at each comma or comma preceded by two backslashes but don't split at a comma preceded by only one backslash.

Any suggested approaches to this problem would be appreciated.

Replies are listed 'Best First'.
Re: Regex with Backslashes
by tybalt89 (Monsignor) on May 17, 2020 at 17:02 UTC

    Your problem is confusing single quotes and double quotes.

    '\,' is exactly equal to '\\,' and so is "\\,"

    $text = '1,This is a problem->\\,B,2';

    has only one backslash before the comma and therefor should *not* split there.
    my $regex = qr/(?<!\\),|(?<=\\\\),/;
    is correct and should not split at the '\\,' (which is exactly equal to '\,')

    I'd recommend you play around with single quotes and backslashes to understand how they work.

      Thanks

      That helps my understanding of the problem and suggests to me that I am not going to be able to parse my data using a single regex

      I could change the data, perhaps using ',,' for a non-splitting comma instead of '\,' which I can't differentiate from '\\,'

        No.

        Look again at your output

        1 This is a problem->\,B 2
        this output is correct, it is *not* a problem...

        For the last part of your comment
        instead of '\,' which I can't differentiate from '\\,'
        If you would restate that as double-quoted strings it would look like this
        instead of "\\," which I can't differentiate from "\\,"
        which is true, but not what you meant, you meant to say (in double-quoted strings)
        instead of "\\," which I can't differentiate from "\\\\,"
        which you can and have done.

        In a single-quoted string, the backslash ONLY does quoting if it occurs before a ' or a \
        otherwise it stands for itself. That's why '\,' and '\\,' actually represent the same string.

        Try looking at the printed string instead of the perl form of the string to see what you actually have.

Re: Regex with Backslashes
by haukex (Archbishop) on May 17, 2020 at 15:35 UTC

    I'm having trouble understanding your inputs because I'm not sure how many backslashes the strings actually contain, for example in '1,This is a problem->\\,B,2' this string actually contains only one backslash, where you probably meant two. And in my $regex = '(?<!\\\),';, the string actually only contains two backslashes because '\\' becomes \ but '\)' remains as \) (see Quote Like Operators).

    My suggestion is to use double quotes for strings, since those will force you to escape all backslashes that you want to appear in the string, and so it'll be less confusing. For regexes, definitely use qr// instead of quotes (that's the reason for your "Unrecognized escape \Q passed through in regex" problem). For looking at the strings you've got and showing them to us, use either Data::Dumper with $Data::Dumper::Useqq=1;, or Data::Dump.

    Your question is also inconsistent in that you say "2 character hex codes '\x2B'" but then show '\0x2B' in the string.

    Anyway, one approach to this task is Text::CSV, like what jo37 showed. However, if I understand your requirement "2 character hex codes" correctly, does this mean that your input string could be "1,Something\\,\\\\text\\\\text\\x2B\\\\,X,99" and you want the output to be ("1","Something,\\text\\text+\\","X","99")? (Next time please show your expected output as well!)

    Although if we're lucky, Tux might enlighten us to the correct options for Text::CSV_XS to handle this case, in the meantime one possible solution is to write a somewhat-decent parser to your specification.

    use warnings; use strict; sub parse { local $_ = shift; my @out = (''); pos=undef; while (1) { if ( m{\G , }xgc ) { push @out, '' } elsif ( m{\G \\x([0-9a-fA-F]{2}) }xgc ) { $out[-1] .= chr hex $1 } elsif ( m{\G (?| \\([,\\]) | ([^,\\]+) ) }xgc ) { $out[-1] .= $1 } else { last } } pos==length or die "parse of '$_' failed at pos ".pos; return @out; } use Test::More tests=>1; my @o = parse "1,Something\\,\\\\text\\\\text\\x2B\\\\,X,99"; is_deeply \@o, ["1","Something,\\text\\text+\\","X","99"];

      Thanks for your reply. The data that is received consists of a single quoted string of characters. This '1,//Text//Text,C,150' is 22 characters long, as shown when I use print length($cmdValues) . "\n";, so double backslash is really two characters in the data I am processing. Later in the script if a backslash is followed by another backslash it is replaced with a single character that is displayed on an lcd screen.

      The hex code was meant to mean two hex characters, rather than literally two characters, in the form 0xFF, sorry for any confusion.

      I will use qr for all further work, as suggested.

      As to

      "1,Something\\,\\\\text\\\\text\\0x2B\\\\,X,99"

      , no, the input data is

      '1,Something\,\\text\\text\\0x2B\\,X,99'

      The expected outcome is an array containing the following:

      1 Something\,\\text\\text\\0x2B\\ X 99

      The string should be split on every comma and every comma preceded by two backslash characters, but not on a comma preceded by a single backslash.

      I have looked at the quote-like operators and had already been through an interesting discussion starting at Ways of quoting

      ..."in my $regex = '(?<!\\\),';, the string actually only contains two backslashes because '\\' becomes \ but '\)' remains as \)" ... Using Data::Dumper, I can now see (I think) why my original regex worked:

      The string I was splitting looks like this

      my $text = '1,This\, is a problem->\\,B,99';

      but when printed with Dumper it looks like this

      $VAR1 = "1,This\\, is a problem->\\,B,99";

      So both '\,' and '\\,' appear the same during processing. Is there a way I can stop '\,' being processed as '\\,'.

      I may have to go down the route of a custom parser as you have suggested

        The string I was splitting looks like this

        my $text = '1,This\, is a problem->\\,B,99';

        but when printed with Dumper it looks like this

        $VAR1 = "1,This\\, is a problem->\\,B,99";

        So both '\,' and '\\,' appear the same during processing. Is there a way I can stop '\,' being processed as '\\,'.

        Both Data::Dumper, which is core, and Data::Dump, which I prefer, but it's not core, represent a string in the form of the double-quote constructor needed to reproduce that string, not as the "actual" string. I think this is one source of your confusion.

        I think the critical point you're missing is that there is a fundamental difference between a single- or double-quoted string constructor, e.g., '...' or "...", and the string that is constructed.

        So both '\,' and '\\,' appear the same during processing.

        No. A string may have one or two or any number of uniquely distinguishable sequential backslashes. The question is how to construct the desired string. Consider

        c:\@Work\Perl\monks>perl -wMstrict -le "my $sq = '\ \\ \\\ \\\\ \\\\\ \\\\\\'; print qq{<$sq> \n}; ;; my $dq = qq{\\ \\\\ \\\\\\}; print qq{>$dq< \n}; " <\ \ \\ \\ \\\ \\\> >\ \\ \\\<
        In a single-quoted string constructor,  \ and  \\ are different representations of the same constructed character. This peculiarity of single-quoted string constructors allows a string so constructed (update: to have a single-quote character in a '...'-quoted string, or) to end in a single-quote or backslash character:
        c:\@Work\Perl\monks>perl -wMstrict -le "my $sqsq = 'abc\''; print qq{<$sqsq> \n}; ;; my $sqbs = 'abc\\'; print qq{>$sqbs< \n}; " <abc'> >abc\<
        (Note that in my code examples, I use  qq{...} as the double-quote "constructor," as I will call it in this reply, due to peculiarities of the Windoze command line interpreter.)

        It's possible to (fairly easily) split the double-quoted string you give as an example and get your desired result:

        c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{1,Something\\,\\\\text\\\\text\\0x2B\\\\,X,99}; print qq{<$s> \n}; ;; my @ra = split qr{ (?<! (?<! \\) \\) , }xms, $s; print qq{[$_]} for @ra; " <1,Something\,\\text\\text\0x2B\\,X,99> [1] [Something\,\\text\\text\0x2B\\] [X] [99]
        The string is split on the pattern "comma that is not preceded by a backslash that is not preceded by a backslash." This sort of tricksy, double-negative logic is part of the reason that a well-tested module like Text::CSV is so often and enthusiastically recommended for this seemingly-simple parsing application. (I hope this module or one like it is what you're referring to when you write about going "the route of a custom parser.")


        Give a man a fish:  <%-{-{-{-<

Re: Regex with Backslashes (updated)
by haukex (Archbishop) on May 17, 2020 at 21:54 UTC

    To avoid further confusion, I suggest we take a step back and agree on how to communicate the strings appropriately. I think what is causing confusion here is that you are using single quotes to show strings*, and we, being Perl programmers, are assuming that Perl's rules for single-quoted string literals apply, but based on what you've written I don't think that's the definition you're using. So:

    1. When you write 'foo \x \\ \' \ bar', due to Perl's rules for single-quoted strings (Quote Like Operators: "A backslash represents a backslash unless followed by the delimiter or another backslash, in which case the delimiter or backslash is interpolated."), this string is actually the 16-character string «foo \x \ ' \ bar», as you can see when you execute the Perl code print 'foo \x \\ \' \ bar', "\n";.
      • Note: I'm using these special quoting characters here to make it clear that I don't mean Perl's quotes. In PerlMonks' HTML, what I've written is &laquo;<c>my string here</c>&raquo;. This is not an established standard, just something I'm doing in this node to differentiate between "", '', and "the characters the string literals actually represent".
    2. When you write "foo \x22 \\ \' \" bar", due to Perl's rules (same link as above), this string is actually the 15-character string «foo " \ ' " bar» (try print "foo \x22 \\ \' \" bar", "\n";). This is the format that tools like Data::Dump and Data::Dumper (with $Data::Dumper::Useqq=1; turned on, which I always recommend) will output. Because of this, I suggested you use this format to show us what strings you're working with.
    3. When you want to show us a string without any quoting/escaping/interpolation, then don't use '''s or ""'s. Just show us the string in PerlMonks' <code> tags, as in: My input is the 14-character string <code>my string here</code>., optionally add some special quotes like I showed above, and tell us the actual length of the string so we can verify.
      • * Update: Another option is heredocs, as tybalt89 showed here; just make sure to put the heredoc marker into single quotes, as in my $str = <<'END'; ... END, to disable interpolation inside the heredoc. This might be useful because from your reply here, I seem to understand the single quotes are actually part of the string, which would also help explain the confusion we've been having. (Note the other quoting methods still work too, as in '\'...\'' and "'...'".)
    4. When you want to show us a regex, show us the Perl code and use a qr// operator, don't use quotes (and don't use qr'' either). Again, this is the least ambiguous format. (See also Regexp Quote Like Operators.)
    5. If you wanted to be really, really thorough, or there is some real confusion as to what your inputs are, then you could also show us the output of Devel::Peek's Dump(), or, for files, show us a hex dump of the file: On Linux, either hexdump -C filename or od -tx1c filename (see also).

    I think once we've got that cleared up and we understand what your actual strings are, we'll be able to help much more effectively :-)

      Thanks for taking the time to point out the issues with my presentation of strings, which has caused confusion.

      If I post again I will take your advice on the presentation and the use of quoting.

      Having considered the problem I originally posted, I have decided that I should take a slightly different approach which I touched on in a response to another monk, and my data will use two commas where a non-splitting comma is required and two backslashes where a backslash is required. This changes the regex requirements substantially.

      My data would look like this: 1,Text,,with,,commas,X,99 and my regex is: my $regex = qr /(?<!,),(?!,)|(?<=,,),/;

      This is working in my script with this output:

      1 Text,,with,,commas X 99

      Thank you to all who responded.

      Maybe I will take the plunge and post my 'lcd daemon with battery meter script' once it is completed. Not exactly an Earth-shattering piece of work, but quite fun.

        If you have control over the format the string is generated in, then why not use a well-established format like CSV? The defaults of Text::CSV are that fields are separated by commas, if a field contains commas (or whitespace), it is surrounded by double quotes, and if a double quote needs to be escaped, then it is doubled up. For example:

        use warnings; use strict; use Text::CSV; my $data = <<'END'; 1,"Text,with,commas and ""quotes""",X,99 END open my $fh, '<', \$data or die $!; my $csv = Text::CSV->new({ binary=>1, auto_diag=>2 }); while ( my $row = $csv->getline($fh) ) { print "<<$_>>\n" for @$row; } $csv->eof or $csv->error_diag; close $fh; __END__ <<1>> <<Text,with,commas and "quotes">> <<X>> <<99>>
        Maybe I will take the plunge and post my 'lcd daemon with battery meter script' once it is completed.

        Yes, that'd be interesting!

        How does that work if you have a null (absolutely empty) comma-separated field? Can you have such fields in your application? Why not just split the original non-escaped commas a la this or some similar approach if you do not want to use a module?


        Give a man a fish:  <%-{-{-{-<

Re: Regex with Backslashes
by jo37 (Deacon) on May 17, 2020 at 15:01 UTC

    Text::CSV may be helpful for your problem.

    #!/usr/bin/perl use strict; use warnings; use Text::CSV; my $data = <<'EOF'; 1 Something\,\text\text\0x2B X 99 1,This is a problem->\\,B,2 EOF open my $fh, '<', \$data; my $csv = Text::CSV->new({sep_char => ','}); while (my $row = $csv->getline($fh)) { print "<$_> " foreach @$row; print "\n"; } __DATA__ <1> <Something\> <\text\text\0x2B> <X> <99> <1> <This is a problem->\\> <B> <2>

    Greetings,
    -jo

    $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$

      Good point, although the way I understand the question, you'd need to supply the escape_char=>"\\" option.

        Looks like I confused input, actual output and desired output.

        Greetings,
        -jo

        $gryYup$d0ylprbpriprrYpkJl2xyl~rzg??P~5lp2hyl0p$