http://qs321.pair.com?node_id=607740

mattford63 has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Be foul means an evily generated report that I have no control over must be parsed; it takes the form:

"a value with ""quotes"" in"

As the report generator encloses text within double quotes, if it encounters any double quotes in text it's enclosing, it simply doubles them up. But this leaves me very stuck when trying to pull the data back out between the single quotes (via a regex tokenizer). The regex spirit is not lifting her filthy skirt for me tonight. How can I return everything between the single "double" quotes. i.e., make

$1='a value with ""quotes"" in'

I toyed around with look aheads/behinds but don't really get it. Maybe it's easier than I think it is and I've just confused.

my $s = qr/"(.*?)"(?!")\s*/xms;

Don't look at the above too long it's just my latest desperate attempt.

Is it possible in the general sense? Any other nice solutions to the problem (other than regex)?

These cases should work:

""

"a"""""

"a""b""c""d""f"

Thanks for any help!

Matt.

  • Comment on Help with Double Double Quotes regular expression

Replies are listed 'Best First'.
Re: Help with Double Double Quotes regular expression (precise)
by tye (Sage) on Apr 02, 2007 at 03:20 UTC

    The proper unambiguous way to parse this is:

    /"((?:[^"]+|"")*)"/

    (followed, of course, by s/""/"/g on the variable you saved $1 into).

    I guess the extra test case that proves this is the right approach is that it knows that

    "the ""quotes"" aren't closed

    is an error.

    - tye        

      At least in Perl, I'd do this instead:
      /"((?>[^"]+|"")*)"/
      As soon as you can see nested quantifiers, it should raise a red flag: in the worst case, this can get very slow.

      Only last week, I've had to deal with a similar nested quantifier regexp in Javascript, where processing of function with a regexp slowed down from 15ms, typical case, to 8 seconds in a bad case. That's a slowdown factor of 500. And it could even have been worse.

      And unfortunately, Javascript doesn't know (?>pattern). Perl does. Use it when appropriate.

        Actually, part of the point of being precise in a regex is that it prevents pathological performance problems even in the face of nested quantifiers; even they can't get really slow. If the regex is "precise", then it can only match one way and it will never thrash trying a ton of different combinations, even if used as part of some larger regex. It also means that it won't surprise you by matching in some unexpected way.

        The regex can backtrack out of what it tried to match but each step will find no alternative way to move forward and so it will backtrack out very directly. The use of (?>...) will only slightly speed up such a case.

        How you verify that a regex is precise is you look at each decision point in the regex and verify that there is only one way to move forward from there. So if you have /A(B|C)*D/, after you've matched A your regex will try the following ways to move forward (in this order): 1) Try to match a longer version of A, 2) Match B, 3) Match C, 4) Match D.

        If I'd not been in a bit of hurry when I'd posted, I would have done this and noticed my mistake. So let's look at the well-trodden territory of matching quotes when \" is used to esacpe the quote instead of "".

        So in the case of /"((?:[^\\"]+|\\")*)"/, A is ", B is [^\\"]+, C is \\", and D is ". So A can't be matched in any longer way (since it has no quantifiers). C can only match if the next character is a backslash. D can only match if the next character is a double quote. Finally, B can only match if the next character is neither \ nor ", so that decision point is unambiguous. Note that the use of + is important here. /"((?:[^\\"]*|\\")*)"/ (note the + was replaced with *) is not precise.

        Some who have read Mastering Regular Expressions would advise you to avoid the nested quantifiers and instead use /"((?:[^\\"]|\\")*)"/. That version is also precise. The main problem with it is a Perl quirk that prevents a regex part from matching more than 32K times and so that regex will die if used to try to match a string of more than 32K characters inside quotes. I also guess that it would be slightly slower.

        Now, the problem with /"((?:[^"]+|"")*)"/ is that C ("") and D (") leads to an ambiguity. If I'd bothered to test my own test case, I would have also seen this. But coming up with sufficient test cases is quite a challenge so I find it works better to also check that my regex is precise. So I should have posted:

        /"((?:[^"]+|"")*)"(?!")/

        which is precise but in one case must look at the next two characters before knowing which route must be taken.

        I prefer to not use (?>...) because avoiding it "forces" me to ensure that my regex is precise. But (?>...) can be useful both for preventing pathological performance problems and for preventing some surprises (though I haven't seen it used enough to get a feel for how often it will lead to other types of surprises).

        - tye        

Re: Help with Double Double Quotes regular expression
by GrandFather (Saint) on Apr 02, 2007 at 02:44 UTC

    Two steps is the magic:

    use strict; use warnings; while (<DATA>) { chomp; print "$_: "; s/"(.*)"/$1/g; s/""/"/g; print ">$_<\n"; } __DATA__ "" """" "a""""" "a""b""c""d""f"

    Prints:

    "": >< """": >"< "a""""": >a""< "a""b""c""d""f": >a"b"c"d"f<

    Note that this fails if there is more than one quoted string in the string being processed.


    DWIM is Perl's answer to Gödel
Re: Help with Double Double Quotes regular expression
by MonkE (Hermit) on Apr 02, 2007 at 02:57 UTC
    Why not just change the input string with s/""/"/g and the like. Then you can extract text between single quotes. Below you can see my attempt to extract the quoted data (tested).
    #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; # remove the pesky quotes at the beginning and end s/^"//; s/"$//; # change all doubled quotes into just singles s/""/"/g; print $_ . "\n"; } __DATA__ "" "a""""" "a""b""c""d""f" "This is a ""test""" "Here we have """"a nested set of double-quoted quotes"" or whatever." +"" "Harvey ""the Screwdriver"" Ledbetter"
    Output:
    a"" a"b"c"d"f" This is a "test" Here we have ""a nested set of double-quoted quotes" or whatever." Harvey "the Screwdriver" Ledbetter
Re: Help with Double Double Quotes regular expression
by izut (Chaplain) on Apr 02, 2007 at 14:58 UTC

    I had that problem when writing SQL::Tokenizer. I ended with this:

    ".*?(?:(?:""){1,}"|(?<!["\\])"(?!")|\\"{2})

    I had to prevent also scaped quotes -\" and all other cases you mentioned.

    Igor 'izut' Sutton
    your code, your rules.