Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Re: Regex with Backslashes

by haukex (Archbishop)
on May 17, 2020 at 15:35 UTC ( [id://11116862]=note: print w/replies, xml ) Need Help??


in reply to Regex with Backslashes

I'm having trouble understanding your inputs because I'm not sure how many backslashes the strings actually contain, for example in '1,This is a problem->\\,B,2' this string actually contains only one backslash, where you probably meant two. And in my $regex = '(?<!\\\),';, the string actually only contains two backslashes because '\\' becomes \ but '\)' remains as \) (see Quote Like Operators).

My suggestion is to use double quotes for strings, since those will force you to escape all backslashes that you want to appear in the string, and so it'll be less confusing. For regexes, definitely use qr// instead of quotes (that's the reason for your "Unrecognized escape \Q passed through in regex" problem). For looking at the strings you've got and showing them to us, use either Data::Dumper with $Data::Dumper::Useqq=1;, or Data::Dump.

Your question is also inconsistent in that you say "2 character hex codes '\x2B'" but then show '\0x2B' in the string.

Anyway, one approach to this task is Text::CSV, like what jo37 showed. However, if I understand your requirement "2 character hex codes" correctly, does this mean that your input string could be "1,Something\\,\\\\text\\\\text\\x2B\\\\,X,99" and you want the output to be ("1","Something,\\text\\text+\\","X","99")? (Next time please show your expected output as well!)

Although if we're lucky, Tux might enlighten us to the correct options for Text::CSV_XS to handle this case, in the meantime one possible solution is to write a somewhat-decent parser to your specification.

use warnings; use strict; sub parse { local $_ = shift; my @out = (''); pos=undef; while (1) { if ( m{\G , }xgc ) { push @out, '' } elsif ( m{\G \\x([0-9a-fA-F]{2}) }xgc ) { $out[-1] .= chr hex $1 } elsif ( m{\G (?| \\([,\\]) | ([^,\\]+) ) }xgc ) { $out[-1] .= $1 } else { last } } pos==length or die "parse of '$_' failed at pos ".pos; return @out; } use Test::More tests=>1; my @o = parse "1,Something\\,\\\\text\\\\text\\x2B\\\\,X,99"; is_deeply \@o, ["1","Something,\\text\\text+\\","X","99"];

Replies are listed 'Best First'.
Re^2: Regex with Backslashes
by anita2R (Scribe) on May 17, 2020 at 18:44 UTC

    Thanks for your reply. The data that is received consists of a single quoted string of characters. This '1,//Text//Text,C,150' is 22 characters long, as shown when I use print length($cmdValues) . "\n";, so double backslash is really two characters in the data I am processing. Later in the script if a backslash is followed by another backslash it is replaced with a single character that is displayed on an lcd screen.

    The hex code was meant to mean two hex characters, rather than literally two characters, in the form 0xFF, sorry for any confusion.

    I will use qr for all further work, as suggested.

    As to

    "1,Something\\,\\\\text\\\\text\\0x2B\\\\,X,99"

    , no, the input data is

    '1,Something\,\\text\\text\\0x2B\\,X,99'

    The expected outcome is an array containing the following:

    1 Something\,\\text\\text\\0x2B\\ X 99

    The string should be split on every comma and every comma preceded by two backslash characters, but not on a comma preceded by a single backslash.

    I have looked at the quote-like operators and had already been through an interesting discussion starting at Ways of quoting

    ..."in my $regex = '(?<!\\\),';, the string actually only contains two backslashes because '\\' becomes \ but '\)' remains as \)" ... Using Data::Dumper, I can now see (I think) why my original regex worked:

    The string I was splitting looks like this

    my $text = '1,This\, is a problem->\\,B,99';

    but when printed with Dumper it looks like this

    $VAR1 = "1,This\\, is a problem->\\,B,99";

    So both '\,' and '\\,' appear the same during processing. Is there a way I can stop '\,' being processed as '\\,'.

    I may have to go down the route of a custom parser as you have suggested

      The string I was splitting looks like this

      my $text = '1,This\, is a problem->\\,B,99';

      but when printed with Dumper it looks like this

      $VAR1 = "1,This\\, is a problem->\\,B,99";

      So both '\,' and '\\,' appear the same during processing. Is there a way I can stop '\,' being processed as '\\,'.

      Both Data::Dumper, which is core, and Data::Dump, which I prefer, but it's not core, represent a string in the form of the double-quote constructor needed to reproduce that string, not as the "actual" string. I think this is one source of your confusion.

      I think the critical point you're missing is that there is a fundamental difference between a single- or double-quoted string constructor, e.g., '...' or "...", and the string that is constructed.

      So both '\,' and '\\,' appear the same during processing.

      No. A string may have one or two or any number of uniquely distinguishable sequential backslashes. The question is how to construct the desired string. Consider

      c:\@Work\Perl\monks>perl -wMstrict -le "my $sq = '\ \\ \\\ \\\\ \\\\\ \\\\\\'; print qq{<$sq> \n}; ;; my $dq = qq{\\ \\\\ \\\\\\}; print qq{>$dq< \n}; " <\ \ \\ \\ \\\ \\\> >\ \\ \\\<
      In a single-quoted string constructor,  \ and  \\ are different representations of the same constructed character. This peculiarity of single-quoted string constructors allows a string so constructed (update: to have a single-quote character in a '...'-quoted string, or) to end in a single-quote or backslash character:
      c:\@Work\Perl\monks>perl -wMstrict -le "my $sqsq = 'abc\''; print qq{<$sqsq> \n}; ;; my $sqbs = 'abc\\'; print qq{>$sqbs< \n}; " <abc'> >abc\<
      (Note that in my code examples, I use  qq{...} as the double-quote "constructor," as I will call it in this reply, due to peculiarities of the Windoze command line interpreter.)

      It's possible to (fairly easily) split the double-quoted string you give as an example and get your desired result:

      c:\@Work\Perl\monks>perl -wMstrict -le "my $s = qq{1,Something\\,\\\\text\\\\text\\0x2B\\\\,X,99}; print qq{<$s> \n}; ;; my @ra = split qr{ (?<! (?<! \\) \\) , }xms, $s; print qq{[$_]} for @ra; " <1,Something\,\\text\\text\0x2B\\,X,99> [1] [Something\,\\text\\text\0x2B\\] [X] [99]
      The string is split on the pattern "comma that is not preceded by a backslash that is not preceded by a backslash." This sort of tricksy, double-negative logic is part of the reason that a well-tested module like Text::CSV is so often and enthusiastically recommended for this seemingly-simple parsing application. (I hope this module or one like it is what you're referring to when you write about going "the route of a custom parser.")


      Give a man a fish:  <%-{-{-{-<

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11116862]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others surveying the Monastery: (2)
As of 2024-04-26 05:19 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found