http://qs321.pair.com?node_id=1133403

uhClem has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks -- longtime listener; first time caller. I just tried this one over at StackOverflow but the one responder apparently got upset when I suggested his quite nice solution wasn't right for my situation -- and deleted his own answer and all comments! His ball and he's going home! Oh well.

So. I've got big files of hand-entered data in various formats that needs cleaning and rearranging; the current one looks like this:

C31 6 3 2.4 1.5 2.6 C32 2 7 3 1.0 H31 1 1 0 21.0 11.2 5.3 1.4 T11 2 1 0 6.0 1.1 2.2 L06 1 1 0 1.0 3.3 L06 1 4 0 1.1 1.8

That first line is bad -- missing its fourth field, which should be [0-3]; all sorts of typo-like errors like that. Catch those, send them to the Bad file, cut up good lines into a hash for redistribution. I've got this one matched like so:

($t, $p, $s, $d) = (/^([A-Z]\d\d) +(1?\d) +(\d?\d(?:\.1)?) +([0-3]) /) + or ($bad{$line++} = $_) && next; @cts = ($' =~ /(\d?\d\.\d)/g);

That works but I'd really like to do it in a single pattern so I can simply swap patterns for the many differently similar files still to come. I couldn't figure out anything for this that would do both the careful pattern matching and the variable-length lines all in one go (It's easy to catch every field with just /(\S+)\s+/g but then I have to check each catch separately for its proper form, which makes it messy when I retool the script for the next stinking input file).

At this point I'm mainly interested in the theoretico-mechanical question of whether what I want is *possible*. Can you do a match like

@allFields = (/patt1 patt2 patt3 patt19+/);
where the first three patts each occur once and patt19 occurs {1,n} times, you validate all catches with picky matching or next, and however many patt19s there are in a given line everything winds up in @allFields? Everything I tried got the first three fields and either the first patt19 or the last but I could never get them all.

Thanks!

Replies are listed 'Best First'.
Re: validate variable-length lines in one regex?
by poj (Abbot) on Jul 06, 2015 at 19:32 UTC

    I'm not sure this is of any use but I'll offer it anyway. The idea is to create a mask of codes which is then used to select the correct regex for that column.

    #!perl use strict; my %REGEX = ( 'A' => qr'^[A-Z]\d\d$', 'N' => qr'^[0-9]+$', 'N3' => qr'^[0-3]+$', 'D' => qr'^\d+\.\d+$', ); my @p = qw(A N N N3 D D D D D D); while (<DATA>){ chomp; my @f = split '\s+'; my $chk = 'OK '; for my $i (0..$#f){ if ($f[$i] !~ $REGEX{$p[$i]}){ $chk = 'ERR'; $f[$i] = '**'.$f[$i]." $REGEX{$p[$i]}**"; } } print join ' ',$chk,@f,"\n"; } __DATA__ C3 6 3 2.4 1.5 2.6 C32 2 7 3 1.0 H31 1 1 0 21.0 11.2 5.3 1.4 T11 2 1 0 6.0 1.1 2.2 L06 1 1 0 1.0 3.3 L06 1 4 0 1.1 1.8
    poj

      Oho! Now that is slick, in a gruesome way. I like it; just might use that. Bonus for doing exactly what I want in an almost unrelated way. It even seems like that could be the basis of a script that could figure out for itself what the probable pattern for each lousy file is, and just yank any outliers... Let's see how big a mess I can make with THAT!

      And just the same, there is still that "D D D D D D" -- a non-indeterminate sequence so you just have to hope you don't run into any lines with seven Ds. I bet there's a way around that (and I know I'll never have more than nine -- in this file...) but that does point back to my original question: Can you make a single regex carefully validate a variable number of fields (and return all matches)? Will perl regex do that, or does it exceed the possibilities?

      Anyway, thanks!

        Another thought - If it's possible to edit the file I would put the mask as the first line, no need to edit the script then. Failing that put something in the filename that chooses the correct mask for you. This would of course mean editing the file for each new mask.

        poj
Re: validate variable-length lines in one regex?
by stevieb (Canon) on Jul 06, 2015 at 20:15 UTC

    This is a huge stab in the dark here, but I thought I'd give it a try. On the last line of the regex, it uses a code evaluation expression (which I believe is still experimental) to set the @cts array using the last unused portion of the line in an internal, separate regex.

    #!/usr/bin/perl use warnings; use strict; my %bad; while (<DATA>){ chomp; my @cts; my ($t, $p, $s, $d) = (/^([A-Z]\d\d)\s+ # $t (1?\d)\s+ # $p (\d?\d(?:\.1)?)\s+ # $s ([0-3])\s+ # $d (?{@cts = $' =~ m#(\d?\d\.\d)#g}) # @cts /x) or ($bad{$_}++) && next; print join(' ', @cts); print "\n"; } print "\nPrinting bad lines:\n"; while (my ($k, $v) = each %bad){ print "$k: $v\n"; } __DATA__ C31 6 3 2.4 1.5 2.6 C32 2 7 3 1.0 H31 1 1 0 21.0 11.2 5.3 1.4 T11 2 1 0 6.0 1.1 2.2 L06 1 1 0 1.0 3.3 L06 1 4 0 1.1 1.8 __END__ 1.0 21.0 11.2 5.3 1.4 6.0 1.1 2.2 1.0 3.3 1.1 1.8 Printing bad lines: C31 6 3 2.4 1.5 2.6 : 1

    On one long line, the regex would look like this (note I've replaced your literal spaces with \s):

    /^([A-Z]\d\d)\s+(1?\d)\s+(\d?\d(?:\.1)?)\s+([0-3])\s+(?{@cts = $' =~ m +#(\d?\d\.\d)#g})/

    -stevieb

      Hm. Also cool, and formally satisfies my Big Question -- but on the other hand... you know... eval.... Seems sneaky. I'm just saying. Might not want my daughter to marry one.
Re: validate variable-length lines in one regex?
by BrowserUk (Patriarch) on Jul 06, 2015 at 18:23 UTC

    You could do something like this:

    $s = 'a1 1 2 1.0 1.1 1.2 1.3';; print grep defined, $s =~ m[([a-z]\d)\s+(\d)\s+(\d)(?:\s+(\d\.\d))?(?: +\s+(\d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\ +d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\d\.\d))?];; a1 1 2 1.0 1.1 1.2 1.3 $s = 'a1 1 2 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7';; print grep defined, $s =~ m[([a-z]\d)\s+(\d)\s+(\d)(?:\s+(\d\.\d))?(?: +\s+(\d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\ +d\.\d))?(?:\s+(\d\.\d))?(?:\s+(\d\.\d))?];; a1 1 2 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7

    Not elegant, but it works.

    You'd probably want to generate the regex:

    $re = '([a-z]\d)\s+(\d)\s+(\d)' . join '', '(?:\s+(\d\.\d))?' x 10;; $re = qr"$re";; $s = 'a1 1 2 1.0 1.1 1.2 1.3';; print grep defined, $s =~ $re;; a1 1 2 1.0 1.1 1.2 1.3 $s = 'a1 1 2 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7';; print grep defined, $s =~ $re;; a1 1 2 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
    I'm with torvalds on this Agile (and TDD) debunked I told'em LLVM was the way to go. But did they listen!
      Fun, but hairy. I think for this project I'd just get myself in trouble trying to follow that route.
        I'd just get myself in trouble trying to follow that route.

        Hm. As you will.

Re: validate variable-length lines in one regex?
by AnomalousMonk (Archbishop) on Jul 07, 2015 at 00:13 UTC

    Here's another approach. It doesn't extract all the fields in one swell foop (the patterns I'm calling  $real numbers have to be extracted in a separate step), but one can imagine some degree of customization is possible for different types of records:

    c:\@Work\Perl\monks>perl -wMstrict -le "my @lines = ( 'C31 6 3 2.4 1.5 2.6 ', 'C32 2 7 3 1.0 ', 'H31 1 1 0 21.0 11.2 5.3 1.4', 'T11 2 1 0 6.0 1.1 2.2', 'L06 1 1 0 1.0 3.3', 'L99 1 1 0 1.1 2.2 3.3 4.4 5.5', ); ;; my $int = qr{ (?<! \d) \d+ (?! \d) }xms; my $real = qr{ $int [.] $int }xms; my $header = qr{ [[:upper:]] \d\d }xms; ;; my $n = 4; my $extract = qr{ ($header) \s+ ($int) \s+ ($int) \s+ ($int) ((?: \s+ $real){1,$n}) +\s* }xms; ;; for my $line (@lines) { printf qq{'$line' -> }; my $got = my ($h, $d1, $d2, $d3, $r) = $line =~ m{ \A $extract \z } +xms; ;; if ($got) { my @reals = $r =~ m{ $real }xmsg; print qq{'$h' '$d1' '$d2' '$d3' (@reals)}; } else { print 'unknown'; } } " 'C31 6 3 2.4 1.5 2.6 ' -> unknown 'C32 2 7 3 1.0 ' -> 'C32' '2' '7' '3' (1.0) 'H31 1 1 0 21.0 11.2 5.3 1.4' -> 'H31' '1' '1' '0' (21.0 11.2 5.3 1.4) 'T11 2 1 0 6.0 1.1 2.2' -> 'T11' '2' '1' '0' (6.0 1.1 2.2) 'L06 1 1 0 1.0 3.3' -> 'L06' '1' '1' '0' (1.0 3.3) 'L99 1 1 0 1.1 2.2 3.3 4.4 5.5' -> unknown

    Update: Tested under Perl versions 5.14.4 and 5.8.9.


    Give a man a fish:  <%-(-(-(-<

Re: validate variable-length lines in one regex? ([OT] stackoverflow)
by toolic (Bishop) on Jul 06, 2015 at 18:42 UTC
    and deleted his own answer and all comments!
    Cross-post

    Fact: The answer was downvoted.

    Opinion: The answer was likely deleted to avoid further downvotes. My guess is that the person felt the time and effort expended was not appreciated. Maybe the person did not understand what was wrong with the answer, and therefore did not know how to improve it.

      Well, not downvoted by me anyway. I thought it was quite neat -- a good answer in search of a better problem.
Re: validate variable-length lines in one regex?
by Loops (Curate) on Jul 06, 2015 at 23:12 UTC
    This doesn't really answer your main question, which seems to only be possible through some gymnastics. Maybe just cry uncle and don't attempt to combine the operation of validation and extraction:
    /^[A-Z]\d\d +1?\d +\d?\d(?:\.1)? +[0-3](?: +\d?\d\.\d)+ *$/ or ($bad +{$line++} = $_) && next; my ($t, $p, $s, $d, @c) = split / +/;
    And with use v5.22 you can clean the regex up even more by making all groups non-capturing with the "n" modifier:
    /^[A-Z]\d\d +1?\d +\d?\d(\.1)? +[0-3]( +\d?\d\.\d)+ *$/n
    There is something to be said for keeping it simple and clear. So maybe even remove repeated space beforehand:
    use v5.10; use warnings; my $valid = qr/^[A-Z]\d\d 1?\d \d?\d(\.1)? [0-3]( \d?\d\.\d)+ ?$/; $, = ', '; while (<DATA>) { chomp; tr/ / /s; say /$valid/ ? split : "Error:$_"; }; __DATA__ C31 6 3 2.4 1.5 2.6 C32 2 7 3 1.0 H31 1 1 0 21.0 11.2 5.3 1.4 T11 2 1 0 6.0 1.1 2.2 L06 1 1 0 1.0 3.3 L06 1 4 0 1.1 1.8
Re: validate variable-length lines in one regex?
by Anonymous Monk on Jul 06, 2015 at 18:14 UTC

    Hi monks -- longtime listener; first time caller. I just tried this one over at StackOverflow but the one responder apparently got upset when I suggested his quite nice solution wasn't right for my situation -- and deleted his own answer and all comments! His ball and he's going home! Oh well.

    I wouldn't call that upset, just not feeding the trolls, SO is nice like that, if the OP is being a troll, you can delete your communications and forget about it

      But my goodness -- I spent all the livelong morning being ever so nice! And it was really a very lovely solution he provided, and I said so. If I'm a troll I'm not very good at it.

        But my goodness -- I spent all the livelong morning being ever so nice! And it was really a very lovely solution he provided, and I said so. If I'm a troll I'm not very good at it.

        Why bring up experiences on another site (waah, they didn't help me)?

        If you're trying to get help from the perlmonks, why wouldn't you improve your question instead ?

        The guy who was trying to help you didn't take his ball and go home -- he just stopped trying to help a vague troll

Re: validate variable-length lines in one regex?
by sundialsvc4 (Abbot) on Jul 06, 2015 at 18:21 UTC

    I don’t readily think so ... although, “this being Perl, I’m sure it can be done.”   Most of what would concern me, in addition to the potential complexity, is also the notion of coupling between the various alternatives.   You want to be able to change this thing easily ... to add new “stinkin’ input files” without stinking-up the old ones.   ;-)

    One idea that comes to mind is to build-up the process, maybe as a list of (anonymous?) subs, or maybe as very-repetitive subs which call other ones (e.g. “until one of them returns false.”)   Each of those subroutines would match against a particular regex.   (The reason why I didn’t suggest a list of qr// regexes is that you might well need to do “slightly more” in one-or-another of those subroutines . . .)

      Yah, this is basically what the StackOverflow fellow suggested, and it looked like something I'd really enjoy doing. Except in this situation, where I may have to do a hundred of these alternate patterns so it'd be a lot easier, albeit uglier, to keep just a list of patterns at the top of the script and toggle the one I want. They're only one line each, while the subs-pile could get huge and unreadable. We're not very concerned with best practices here since this is a one-time torture; no maintainer worries.

      Anyway, it may be as you say, and impractical for other reasons after all, but it's gotten me interested in this basic question of whether the trick is possible at all. If it's true that for variable-length lines a single comparison can catch all fields but not validate, or can validate but not catch all fields, that's an interesting fundamental point.

        I, too, am interested.   I agree, based on this further explanation, that my original suggestion of subs could become unmanageable in the use-case that you describe.   I graciously withdraw my suggestion . . .

        The notion of building a Regex, as BrowserUK originally suggests below and as poj took and ran-with a little later on, is probably the right balance between readability and flexibility.   (All now liberally sprinkled with up-vote pixie-dust...)