Beefy Boxes and Bandwidth Generously Provided by pair Networks
laziness, impatience, and hubris
 
PerlMonks  

Regex to match 20 chars of some digits followed by some spaces

by leriksen (Curate)
on Dec 19, 2003 at 02:23 UTC ( [id://315724]=perlquestion: print w/replies, xml ) Need Help??

leriksen has asked for the wisdom of the Perl Monks concerning the following question:

My spec states that a valid account number is 20 characters long and consists of at least one digit followed by spaces (and just spaces, not everything that matches \s). The account number is extracted from a longer string.

I am using Parse::RecDescent to validate the longer string as

record : account address info account : /trusty regex here/ address : ... info : ...
There is a lot more to the grammar (about another 200 lines), but this is the bit relevant to my problem.

Everything in record is fixed length.

But I am having trouble writing a one-shot regex to validate this.

I think I want to do this

account : /\d(\d){0,19} {19 - length \1}/
but that doesn't work. I definitely want the account rule to consume 20 characters, so that the address rule starts at the right point.
01234567890123456789 '123 ' # OK '123 123 ' # NOK ' ' # NOK '123c ' # NOK
Am I being dense or is this tricky ?

+++++++++++++++++
#!/usr/bin/perl
use warnings;use strict;use brain;

Replies are listed 'Best First'.
Re: Regex to match 20 chars of some digits followed by some spaces
by tachyon (Chancellor) on Dec 19, 2003 at 03:07 UTC

    I can't see any good reason to use Parse::RecDescent to parse fixed width records. This would seem to be using a A-bomb to crack a walnut. Surely you would be better off to unpack the data into a structure and validate from there?

    In addition to the examples above you can for fun autogenerate one that does the job you want - rather ugly but it does work.

    for ( reverse 1..20 ) { $re .= sprintf "\\d{%d} {%d}|", $_, 20-$_; } chop $re; $re = qr/^(?:$re)$/; print $re, $/; @tests = ( '01234567890123456789', # OK '123 ', # OK '123 ', # NOK '123 123 ', # NOK ' ', # NOK '123c ', # NOK ); for(@tests){ print m/$re/ ? "'$_' #OK\n" : "'$_' #NOK\n" } __DATA__ (?-xism:^(?:\d{20} {0}|\d{19} {1}|\d{18} {2}|\d{17} {3}|\d{16} {4}|\d{ +15} {5}|\d{14} {6}|\d{13} {7}|\d{12} {8}|\d{11} {9}|\d{10} {10}|\d{9} + {11}|\d{8} {12}|\d{7} {13}|\d{6} {14}|\d{5} {15}|\d{4} {16}|\d{3} {1 +7}|\d{2} {18}|\d{1} {19})$) '01234567890123456789' #OK '123 ' #OK '123 ' #NOK '123 123 ' #NOK ' ' #NOK '123c ' #NOK

    cheers

    tachyon

      Not all the text I am parsing is fixed width, just this bit.

      The rest is somewhat like this (deboned to protect client)

      document : checkpoint address report(s?) doctrailer report : report1 | report2 | report3 | ... report1 : lt[100] report1_cost_centre(s?) lt[200] report1_cost_centre : lt[300] report1_txn(s?) lt[400] report1_txn : lt[500] lt[600] page_break(?) lt[700] lt : "<LT$arg[0]>" lt_data lt_data : /[^\\]*/ lt_end {$return = $item{__PATTERN1__} ...
      but for 15 different reports, hundreds of lt records, lots of options, repeats and alternations.

      +++++++++++++++++
      #!/usr/bin/perl
      use warnings;use strict;use brain;

        Ah that makes more sense. If you have to deal with fixed width records you may find this sub handy:

        $str = 'first name EOFlast name EOFaddress field + EOF'; my @rec_def = ( [ 'first_name', 20 ], [ 'last_name', 20 ], [ 'address', 30 ], ); sub parse_fixed_width { my ( $record, $rec_def ) = @_; my %struct; my $offset = 0; for my $rec(@$rec_def) { $struct{$rec->[0]} = substr $record, $offset, $rec->[1]; $offset += $rec->[1]; } return length($record) == $offset ? \%struct : ''; } use Data::Dumper; print Dumper parse_fixed_width( $str, \@rec_def ); __DATA__ $VAR1 = { 'first_name' => 'first name EOF', 'address' => 'address field EOF', 'last_name' => 'last name EOF' };

        cheers

        tachyon

Re: Regex to match 20 chars of some digits followed by some spaces
by Zaxo (Archbishop) on Dec 19, 2003 at 03:32 UTC

    You can take advanntage of the fixed width property with unpack and validate the data afterwards.

    # Given $record my %record; @record{ qw/account address info/ } = unpack 'A20 A42 A255', $record; # adjust widths to suit # ($record{'account'}) = $record{'account'} =~ /^(\d[\d ]*)$/ ($record{'account'}) = $record{'account'} =~ /^(\d+)$/ or die 'Bad Account ID'; # detaints, too # verify the rest
    The unpack width enforces the field width you expect. If spaces can't occur between digits, it becomes even simpler. The matching regex would then be /^(\d+)$/. 'An' is the unpack template for a space-padded field of bytes and results in stripping the trailing spaces. In the regex, [\d ] is a character class of digits and spaces.

    Update: Simplified the code to agree with leriksen's spec.

    After Compline,
    Zaxo

Re: Regex to match 20 chars of some digits followed by some spaces
by Roger (Parson) on Dec 19, 2003 at 05:16 UTC
    Hi leriksen, You were so close to getting it right, if you extend the regexp just a little bit with the match-time interpolation technique.
    use strict; use warnings; while (<DATA>) { chomp; print m/(\d{1,20})(??{' ' x (20 - length($1))})/ ? "match\n" : "not match\n"; } __DATA__ " 123451234512345" " 123451234512345 " "123451234512345 " "1234512345 " "123 451 2345 " " "
    And the output is exactly as expected -
    not match not match match match not match not match
Re: Regex to match 20 chars of some digits followed by some spaces
by leriksen (Curate) on Dec 19, 2003 at 03:02 UTC
    Some collegues are first to the punch

    mildside has

    m/^\d((?<=\d)\d(?=([ \d]|$))|(?<=[\d ]) (?=( |$))){19}$/
    another is
    m/^(?=\d*(?:\d ) *(?!\d)$)[0-9 ]{20}$/

    +++++++++++++++++
    #!/usr/bin/perl
    use warnings;use strict;use brain;

Re: Regex to match 20 chars of some digits followed by some spaces
by blokhead (Monsignor) on Dec 19, 2003 at 03:05 UTC
    The contents of {..} in your regex aren't interpolated. You actually need them to be (re)interpolated at the time of a possible match. You can do this with (??{ code }), which is a bit of an ugly hack... I have no idea if Parse::RecDescent will like these, presumably it just evals the regex so it may work.
    my $regex = qr/\[(\d{1,20})(??{ " {" . (20 - length $1) . "}" })\]/; while (<DATA>) { print /$regex/ ? "yes\n" : "no\n"; } __DATA__ [12345678901234567890] [123 ] [234223423 ] [23409234329c ]
    I don't know of a good way to do this without extended regex features (or multiple regexes). If there were some way to do this in general, I'd have to get to work on some regex abuse a la Abigail. There have been a few times when something like this would have been handy!

    blokhead

Re: Regex to match 20 chars of some digits followed by some spaces
by sauoq (Abbot) on Dec 19, 2003 at 06:34 UTC
    perl -nle 'print "match" if /^\d(?:\d| (?![^ ])){19}$/'

    Matches 1 digit followed by 19 digits or spaces-not-followed-by-a-non-space.

    -sauoq
    "My two cents aren't worth a dime.";
    
Re: Regex to match 20 chars of some digits followed by some spaces
by ysth (Canon) on Dec 19, 2003 at 06:19 UTC
    /\d(?!.{0,18} \d)[\d ]{19}/
    Matches 20 digits and spaces, beginning with a digit, where there are no digits following spaces.

    Update: had 0,19, meant 19

      Excellent. Piqued my interest in this approach, which led to a few observations:
      • you probably want to anchor your match
      • you could have used dot-star instead of .{0,18}, or you could have specified the chars for the dot (TIMTOWTDI) -- probably the most efficient thing would be to specify (?!\d* +\d), which made me realize...
      • it can be done with a positive lookahead like so: /^\d(?=\d* *$)[\d ]{19}$/ ("leading digit, followed by digits, then spaces, then end of string, etc.")

      The PerlMonk tr/// Advocate
        I actually have never used Parse::RecDescent but was assuming it was working on an input buffer using the supplied regex as something like /\G$regex/gc so I didn't supply a beginning anchor and an ending anchor is not usable. I haven't got around to checking my assumption yet...and you know what they say when you assUme.
(YAWTDI) Regex to match 20 chars of some digits followed by some spaces
by Zaxo (Archbishop) on Dec 19, 2003 at 08:05 UTC

    No extraction of data this time, just a little check on $record, print substr( $record, 0, 20) =~ /^\d+ *$/ ? 'OK' : 'NOK'; I like to keep the regexen as simple as possible.

    After Compline,
    Zaxo

Re: Regex to match 20 chars of some digits followed by some spaces
by BrowserUk (Patriarch) on Dec 19, 2003 at 06:53 UTC

    Here's my attempt which seems a little simpler than some of the others.

    m[^ \d (?: (?<! \x20 ) \d | \x20 ){19} $]x

    Which says that the entire string must constist of a digit followed by 19 ((digits not preceeded by spaces) or spaces).

    print m[^ \d (?: (?<!\x20) \d | \x20 ){19} $]x ? 'Yes:' . $_ : ' No:' . $_ for @t; No: 123451234512345 No: 123451234512345 Yes:123451234512345 Yes:1234512345 No:123 451 2345 No:

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
    Hooray!

Re: Regex to match 20 chars of some digits followed by some spaces
by Enlil (Parson) on Dec 19, 2003 at 07:24 UTC
    In the spirit of TMTOWTDI:
    #!/usr/bin/perl use strict; use warnings; while (<DATA>) { chomp; print m/\d(?:(?:\d| (?!\d))){19}/ ? "$_ matches\n" : "$_ does not match\n"; } __DATA__ "01234567890123456789" "123" "123 123 " "123c " " " "11232424525252423 "

    enlil

Re: Regex to match 20 chars of some digits followed by some spaces
by duff (Parson) on Dec 19, 2003 at 03:06 UTC

    Sounds like you just need to move the checking for digits bit into your program logic and out of your rule. I.e., match 20 chars and then in your code, check that you got the requisite number of digits. Something like:

    account: /\d[ \d]{18} /

    I interpret your example to mean that you always want one digit and one space. Of course, that will let things like "12 56 90123456789 " through, but that's where you use one of those nifty code block after the rule :)

Re: Regex to match 20 chars of some digits followed by some spaces
by Chmrr (Vicar) on Dec 19, 2003 at 14:21 UTC

    Yet another way to do it: /^\d+ *(?<=^.{20})$/ That is, one or more digits, followed by zero or more spaces -- and only thereafter do we check that it summed to 20 characters total. Probably not as efficient (involves more backtracking) but easier for my eyes to understand.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://315724]
Approved by PERLscienceman
Front-paged by broquaint
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others wandering the Monastery: (4)
As of 2024-03-29 05:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found