http://qs321.pair.com?node_id=1074981

demichi has asked for the wisdom of the Perl Monks concerning the following question:

Hi all,

I really stuck for a while with a regular expression. I get an output (from an application) like this:
PID POLS U(%) POOL_NAME Seq# Num LDEV# H(%) VCAP(%) TYPE PM 003 POLN 0 Bad name with spaces 13453 2 61443 80 - OPEN N 002 POLN 52 DemoSolutions 54068 7 61454 80 - OPEN N

In words : Line starts, 1-n characters, 1-n spaces as delimiter, 1-n characters with 0-n spaces, 1-n spaces as delimiter, 1-n characters, 1-n spaces as delimiter, 1-n characters with 0-n spaces, 1-n spaces as delimiter, 1-n spaces as delimiter, 1-n characters,1-n spaces as delimiter, 1-n characters,1-n spaces as delimiter, 1-n characters,1-n spaces as delimiter, 1-n characters,1-n spaces as delimiter, 1-n characters,1-n spaces as delimiter, 1-n characters,1-n spaces as delimiter, 1-n characters, line ends

What I would like to generate is a file like:

PID;POLS;U(%);POOL_NAME;Seq#;Num;LDEV#;H(%);VCAP(%);TYPE;PM; 003;POLN;0;Bad name with spaces;13453;2;61443;80;-;OPEN;N; 002;POLN;52;Demo;54068;7;61454;80;-;OPEN;N;

I don't get it managed to sort out the spaces in the names and as delimiter. I tried something like this

$line =~ /^(\w+)\s+(\w+)\s+([\w\(\)\%-]+)\s+([\s\w]*?\w+)\s+([#\w]+)\s+(\w+)\s+([#\w]+)\s+([\w\(\)\%-]+)\s+([\w\(\)\%-]+)\s+(\w+)\s+(\w+)\s+/;

and get something like this:
PID;POLS;U(%);POOL_NAME;Seq#;Num;LDEV#;H(%);VCAP(%);TYPE;PM; 003;POLN;0;Bad;name;with;spaces;13453;2;61443;80; 002;POLN;52;Demo;54068;7;61454;80;-;OPEN;N;
=> Without sucess.

I would be very happy if you can help me.

regards deMichi

Replies are listed 'Best First'.
Re: Regular Expression - delimiter/spaces problem
by toolic (Bishop) on Feb 14, 2014 at 17:43 UTC
    Are you sure the application doesn't output fixed-width data? If it did, then you could use unpack. Otherwise, maybe you can use \d instead of \w if some of your columns always output numbers.

    UPDATE: This seems to work. Probably no uglier than a regex:

    use warnings; use strict; while (<DATA>) { chomp; my @cols = split; my @a1 = splice @cols, 0, 3; my @a2 = splice @cols, -7, 7; my $pool = join ' ', @cols; print join(';', @a1, $pool, @a2), "\n"; } __DATA__ PID POLS U(%) POOL_NAME Seq# Num LDEV# H(%) VCAP(%) TYPE PM 003 POLN 0 Bad name with spaces 13453 2 61443 80 - OPEN N 002 POLN 52 DemoSolutions 54068 7 61454 80 - OPEN N
      Your solution works with the examples, but the specification has for the second field: "1-n characters with 0-n spaces". You solution will break when the second field includes a space.

      CountZero

      A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      My blog: Imperial Deltronics
Re: Regular Expression - delimiter/spaces problem
by Util (Priest) on Feb 14, 2014 at 21:50 UTC

    toolic++ ; By handling the first line separately, you can remove all the special cases that you had embedded in your regex, and thereby make it harder to mis-parse your data. The version below correctly handles whitespace in the second field.

    Working, tested code:

    #!/usr/bin/env perl use Modern::Perl; my $dash_or_digits = qr{ (?: - | \d+ ) }msx; my $string_with_spaces = qr{ \w [\s\w]*? }msx; my $re = qr{ \A \s* ( \d+ ) # PID \s+ ( $string_with_spaces ) # POLS \s+ ( \d+ ) # U(%) \s+ ( $string_with_spaces ) # POOL_NAME \s+ ( \d+ ) # Seq# \s+ ( \d+ ) # Num \s+ ( \d+ ) # LDEV# \s+ ( $dash_or_digits ) # H(%) \s+ ( $dash_or_digits ) # VCAP(%) \s+ ( \w+ ) # TYPE \s+ ( \w+ ) # PM \s* \z }msx; my $first_line = <DATA>; chomp $first_line; my @field_names = split ' ', $first_line; say join ';', @field_names; while ( <DATA> ) { chomp; my @fields = /$re/ or die; warn if scalar(@field_names) != scalar(@fields); say join ';', @fields; } __DATA__ PID POLS U(%) POOL_NAME Seq# Num LDEV# H(%) VCAP(%) TYPE PM 003 POLN 0 Bad name with spaces 13453 2 61443 80 - OPEN N 002 POLN 52 DemoSolutions 54068 7 61454 80 - OPEN N

    Output:

    PID;POLS;U(%);POOL_NAME;Seq#;Num;LDEV#;H(%);VCAP(%);TYPE;PM 003;POLN;0;Bad name with spaces;13453;2;61443;80;-;OPEN;N 002;POLN;52;DemoSolutions;54068;7;61454;80;-;OPEN;N

      Thanks a lot for your solution. I tried and it worked fine. I have questions regarding the dash_or_digit reg.

      my $dash_or_digits     = qr{ (?: - | \d+ ) }msx

      Why are using (?:)?

      I never used this expression before and checked http://perldoc.perl.org/perlreref.html but it is not clear why use it. I tried it without and it worked also:

      my $dash_or_digits     = qr{ - | \d+  }msx

      Can you please let me know the reason? Thank you.

      regards deMichi
        Why are using (?:)?

        The "?:" reduces the parens effect to grouping. That means, the content of the parens will not be captured to populate a $< digit> variable ($1, $2, $3, ...). It is good practice to state exactly what is meant; and if you want to just group alternatives, then (?:) is the fitting expression.

        perl -le'print map{pack c,($-++?1:13)+ord}split//,ESEL'
Re: Regular Expression - delimiter/spaces problem
by CountZero (Bishop) on Feb 14, 2014 at 20:53 UTC
    If it has to be a regex, this works (at least with your example):
    use Modern::Perl; <DATA>; while (<DATA>) { chomp; my ( $PID, $POLS, $U, $POOL_NAME, $Seq, $Num, $LDEV, $H, $VCAP, $T +YPE, $PM ) = / ^ (\d+) \s+ (.+?) \s+ (\d+) \s+ (.+?) \s+ (\d+) \s+ (\d+) \s+ (\d+) \s+ (\d+) \s+ ([^ ]+) \s+ ([^ ]+) \s+ ([^ ]+) $ /x; say join ';', ( $PID, $POLS, $U, $POOL_NAME, $Seq, $Num, $LDEV, $H, $VCAP, $TY +PE, $PM ); } __DATA__ PID POLS U(%) POOL_NAME Seq# Num LDEV# H(%) VCAP(%) TY +PE PM 003 POLN 0 Bad name with spaces 13453 2 61443 80 - OP +EN N 002 POLN 52 DemoSolutions 54068 7 61454 80 - OPEN N
    Output:
    003;POLN;0;Bad name with spaces;13453;2;61443;80;-;OPEN;N 002;POLN;52;DemoSolutions;54068;7;61454;80;-;OPEN;N

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Regular Expression - delimiter/spaces problem
by kcott (Archbishop) on Feb 15, 2014 at 05:10 UTC

    G'day demichi,

    Welcome to the monastery.

    First, some issues with your post:

    • Second field:
      • Sample data shows no spaces: POLS and POLN (twice)
      • Attempted regex shows no matching of spaces for this field: /^(\w+)\s+(\w+)\s+.../
      • Description says: "1-n characters with 0-n spaces"
      I've assumed no spaces.
    • Third record, fourth field:
      • Sample data shows: DemoSolutions
      • Actual and expected output show: Demo
      I've assumed a typo.
    • Terminal spaces in sample data records:
      • First record has none
      • Second and third records have one
      • Attempted regex (/...\s+/) seems to indicate one or more (although, there's no end-of-line assertion)
      I've allowed for zero or more.
    • Actual and expected output show a terminal semicolon for each record but this doesn't equate with a field separator.
      I've used ($) to generate a final, zero-length field; you may want to change this.

    Taking those assumptions (and other points) into account, this regex achieves what I think you want:

    #!/usr/bin/env perl -l use strict; use warnings; my $re = qr{ ^ # start of l +ine (\S+)\s+(\S+)\s+(\S+)\s+ # 3 fields, +no spaces (.+?)\s+ # 1 field, + +/- spaces (\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+)\s+(\S+) # 7 fields, +no spaces \s* # possible s +paces ($) # end of lin +e }x; print join ';' => /$re/ while <DATA>; __DATA__ PID POLS U(%) POOL_NAME Seq# Num LDEV# H(%) VCAP(%) TYPE PM 003 POLN 0 Bad name with spaces 13453 2 61443 80 - OPEN N 002 POLN 52 DemoSolutions 54068 7 61454 80 - OPEN N

    Output:

    PID;POLS;U(%);POOL_NAME;Seq#;Num;LDEV#;H(%);VCAP(%);TYPE;PM; 003;POLN;0;Bad name with spaces;13453;2;61443;80;-;OPEN;N; 002;POLN;52;DemoSolutions;54068;7;61454;80;-;OPEN;N;

    -- Ken

      Hi Ken,

      thanks for your posting. Your regexp works fine! Sorry for my mistakes in my initial post.

      regards deMichi