How to split line with varying number of tokens?

zBernie has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: How to split line with varying number of tokens?
by davido (Cardinal) on Apr 28, 2013 at 06:18 UTC

If the FROM field is the only real wild-card then be specific about what you do know, and relaxed about what you don't. By anchoring with specifics to the left and the right of the FROM field, you can relax your specification of that one field and still build a relatively robust regular expression:

while( my $line = <DATA> ) {

  print $line;

  chomp $line;

  my( $reqid, $dest, $from, $date, $time, $npages, $rcv )
    = $line =~ m[
      ^                       # Beginning of input line.
      (\d+)\s+                # REQID
      (\w+)\s+                # DEST
      (\S.*?\S)\s+            # FROM (Accept non-space, anything [non-
                              # greedily], non-space)
      (\d{1,2}/\d{1,2})\s+    # DATE
      (\d{1,2}:\d{1,2})\s+    # TIME
      (\d+)\s+                # nPages
      (\w+)\s*                # RCV
      $                       # End of input line.
    ]x;

  print "REQID: [$reqid]\tDEST: [$dest]\tFROM: [$from]\n";
  print "DATE: [$date]\tTIME: [$time]\n";
  print "nPages: [$npages]\tRCV: [$rcv]\n\n";
}
[download]

(I'm assuming that the fact your columns are not vertically aligned is not a typo; ie, that the fields aren't fixed length. If they are fixed length, this solution would be silly.)

Dave

[reply]
[d/l]

Re: How to split line with varying number of tokens?
by kcott (Archbishop) on Apr 28, 2013 at 07:22 UTC

G'day zBernie,

Is the original data in a fixed format? If so, you can use unpack:

#!/usr/bin/env perl

use 5.010;
use strict;
use warnings;

while (<DATA>) {
    say '[', join(']~[' => map { s/\s*$//; $_ } unpack 'A8A14A22A6A9A3
+A*'), ']';
}

__DATA__
138454  mail_room     Marco's Pizza         12/26 21:52    1  rcv     
138446  custsvc       973 618 0577          12/26 18:44    1  rcv     
138445  county2       spam                  12/26 18:41    3  rcv     
138444  custsvc       spam                  12/26 18:30    1  rcv
138439  county2       7182737253            12/26 17:54    2  rcv     
138438  county2       Acme Products, Inc.   12/26 17:52    1  rcv
[download]

Output:

$ pm_split_space_sep_log.pl
[138454]~[mail_room]~[Marco's Pizza]~[12/26]~[21:52]~[1]~[rcv]
[138446]~[custsvc]~[973 618 0577]~[12/26]~[18:44]~[1]~[rcv]
[138445]~[county2]~[spam]~[12/26]~[18:41]~[3]~[rcv]
[138444]~[custsvc]~[spam]~[12/26]~[18:30]~[1]~[rcv]
[138439]~[county2]~[7182737253]~[12/26]~[17:54]~[2]~[rcv]
[138438]~[county2]~[Acme Products, Inc.]~[12/26]~[17:52]~[1]~[rcv]
[download]

-- Ken

[reply]
[d/l]
[select]

Re: How to split line with varying number of tokens?
by Athanasius (Archbishop) on Apr 28, 2013 at 04:38 UTC

If you can be sure that the FROM field never contains a string resembling the following DATE field, you can take this approach:

#! perl
use strict;
use warnings;

<DATA>;     # Discard header

while (<DATA>)
{
    chomp;
    my @tokens = split /\s+/;
    my @fields;

    for (my $i = 0; $i < @tokens; ++$i)
    {
        if ($i == 2)
        {
            my $from = $tokens[$i];

            until ($tokens[++$i] =~ m! ^ \d{1,2} / \d{1,2} $ !x)
            {
                $from .= ' ' . $tokens[$i];
            }

            push @fields, $from;
        }

        push @fields, $tokens[$i];
    }

    print join('|', @fields), "\n";    
}

__DATA__
REQID     DEST             FROM                     DATE   TIME    nPa
+ges  RCV
138454  mail_room     Marco's Pizza         12/26 21:52    1  rcv     
138446  custsvc      973 618 0577          12/26 18:44    1  rcv     
138445  county2       spam                  12/26 18:41    3  rcv     
138444  custsvc      spam                  12/26 18:30    1  rcv
138439  county2       7182737253            12/26 17:54    2  rcv     
138438  county2       Acme Products, Inc.   12/26 17:52    1  rcv
[download]

Output:

14:30 >perl 614_SoPW.pl
138454|mail_room|Marco's Pizza|12/26|21:52|1|rcv
138446|custsvc|973 618 0577|12/26|18:44|1|rcv
138445|county2|spam|12/26|18:41|3|rcv
138444|custsvc|spam|12/26|18:30|1|rcv
138439|county2|7182737253|12/26|17:54|2|rcv
138438|county2|Acme Products, Inc.|12/26|17:52|1|rcv

14:30 >
[download]

Update: If you know that only the third field can contain spaces, a better approach may be as follows:

shift @fields twice to get the first two fields
pop @fields four times to get the last four fields
join(' ', @fields) to get the third, remaining field.

Update 29^th April: Tidied the code.

Hope that helps,

Athanasius <°(((>< contra mundum Iustus alius egestas vitae, eros Piratica,

[reply]
[d/l]
[select]

Re: How to split line with varying number of tokens?
by jwkrahn (Abbot) on Apr 28, 2013 at 06:04 UTC

$ echo "REQID     DEST             FROM                     DATE   TIM
+E    nPages  RCV
138454  mail_room     Marco's Pizza         12/26 21:52    1  rcv     
138446  custsvc      973 618 0577          12/26 18:44    1  rcv     
138445  county2       spam                  12/26 18:41    3  rcv     
138444  custsvc      spam                  12/26 18:30    1  rcv
138439  county2       7182737253            12/26 17:54    2  rcv     
138438  county2       Acme Products, Inc.   12/26 17:52    1  rcv " | 
+perl -e'
while ( <> ) {
    my $line = reverse;
    my ( $rcv, $pages, $time, $date, $rest ) = map scalar reverse, spl
+it " ", $line, 5;
    my ( $reqid, $dest, $from ) = split " ", $rest, 3;
    print join( "   ", map qq/"$_"/, $reqid, $dest, $from, $date, $tim
+e, $pages, $rcv ), "\n";
    }
'
"REQID"   "DEST"   "FROM"   "DATE"   "TIME"   "nPages"   "RCV"   
"138454"   "mail_room"   "Marco's Pizza"   "12/26"   "21:52"   "1"   "
+rcv"   
"138446"   "custsvc"   "973 618 0577"   "12/26"   "18:44"   "1"   "rcv
+"   
"138445"   "county2"   "spam"   "12/26"   "18:41"   "3"   "rcv"   
"138444"   "custsvc"   "spam"   "12/26"   "18:30"   "1"   "rcv"   
"138439"   "county2"   "7182737253"   "12/26"   "17:54"   "2"   "rcv" 
+  
"138438"   "county2"   "Acme Products, Inc."   "12/26"   "17:52"   "1"
+   "rcv"
[download]

[reply]
[d/l]

Re: How to split line with varying number of tokens?
by hdb (Monsignor) on Apr 28, 2013 at 06:11 UTC

As you know what the first 2 fields are and what the last 4 fields are everything in between would be the name. So you could re-join the fields in the middle, possibly distorting the white space.

use strict;
use warnings;
<DATA>;
while(<DATA>){
  chomp;
  my @line = split /\s+/;
  my $from = join( " ", splice( @line, 2, $#line-5) );
  my ($reqid, $dest, $date, $time, $pages, $rcv) = @line;
  print join "|", ($reqid, $dest, $from, $date, $time, $pages, $rcv);
  print "\n";
}
__DATA__

REQID     DEST             FROM                     DATE   TIME    nPa
+ges  RCV
138454  mail_room     Marco's Pizza         12/26 21:52    1  rcv     
138446  custsvc      973 618 0577          12/26 18:44    1  rcv     
138445  county2       spam                  12/26 18:41    3  rcv     
138444  custsvc      spam                  12/26 18:30    1  rcv
138439  county2       7182737253            12/26 17:54    2  rcv     
138438  county2       Acme Products, Inc.   12/26 17:52    1  rcv
[download]

[reply]
[d/l]

Re^2: How to split line with varying number of tokens?

by AnomalousMonk (Archbishop) on Apr 29, 2013 at 00:33 UTC

... re-join the fields in the middle, possibly distorting the white space.

I, too, wondered about the significance of embedded whitespace in the FROM field of the data and about the fixed-field nature of the data, concerning all of which zBernie is silent in the OP and, to this moment, elsewhere in this thread. If embedded whitespace in the FROM field matters, it's simple enough to deal with it using split if the sub-strings corresponding to the separators are also captured and everything is re-assembled with a minor modification to your existing split approach. (Even so, I think I prefer a regex-based extraction approach like that of davido, which lends itself better to data validation efforts.)

>perl -wMstrict -le
"my @data = (
   'REQID     DEST             FROM                     DATE   TIME   
+ nPages  RCV',
   '138454  mail_room     Marco`s  Pizza        12/26 21:52    1  rcv'
+,
   '138446  custsvc      973   618    0577     12/26 18:44    1  rcv',
   '138445  county2       spam                  12/26 18:41    3  rcv'
+,
   '138444  custsvc      spam                  12/26 18:30    1  rcv',
   '138439  county2       7182737253            12/26 17:54    2  rcv'
+,
   '138438  county2       Acme Products, Inc.   12/26 17:52    1  rcv'
+,
   );
 ;;
 for my $record (@data) {
   my @fields = split /(\s+)/, $record;
   my $from = join '', splice @fields, 4, $#fields - 11;
   my ($reqid, $dest, $date, $time, $pages, $rcv) =
     @fields[ 0, 2, map { $#fields - $_ } 6, 4, 2, 0 ];
   printf qq{'%s' \n},
     join '|', $reqid, $dest, $from, $date, $time, $pages, $rcv;
   }
"
'REQID|DEST|FROM|DATE|TIME|nPages|RCV'
'138454|mail_room|Marco`s  Pizza|12/26|21:52|1|rcv'
'138446|custsvc|973   618    0577|12/26|18:44|1|rcv'
'138445|county2|spam|12/26|18:41|3|rcv'
'138444|custsvc|spam|12/26|18:30|1|rcv'
'138439|county2|7182737253|12/26|17:54|2|rcv'
'138438|county2|Acme Products, Inc.|12/26|17:52|1|rcv'
[download]

[reply]
[d/l]
[select]

Re: How to split line with varying number of tokens?
by hdb (Monsignor) on Apr 28, 2013 at 07:07 UTC

Another alternative is based on the fact that your data is nicely vertically aligned, even if not perfect. So you could specify which columns of characters belong to which field. This is something that Excel would also offer when importing such data.

use strict;
use warnings;
my %format = (#field  from   to  
          reqid  => [  0,  7],
          dest   => [  8, 19],
          from   => [ 20, 41],
          date   => [ 42, 48],
          time   => [ 49, 55],
          npages => [ 56, 59],
          rcv    => [ 60, 70],
);

<DATA>;
while(<DATA>){
  chomp;
  my %line;
  for my $item (keys %format) {
    $line{$item} = substr $_, $format{$item}->[0], $format{$item}->[1]
+-$format{$item}->[0]+1;
    $line{$item} =~ s/^\s*//; # remove leading spaces
    $line{$item} =~ s/\s*$//; # remove trailing spaces
    print "$item=$line{$item}, ";
  }
  print "\n";
}
__DATA__
REQID     DEST             FROM                     DATE   TIME    nPa
+ges  RCV
138454  mail_room     Marco's Pizza         12/26 21:52    1  rcv     
138446  custsvc      973 618 0577          12/26 18:44    1  rcv     
138445  county2       spam                  12/26 18:41    3  rcv     
138444  custsvc      spam                  12/26 18:30    1  rcv
138439  county2       7182737253            12/26 17:54    2  rcv     
138438  county2       Acme Products, Inc.   12/26 17:52    1  rcv
[download]

[reply]
[d/l]

Re: How to split line with varying number of tokens?
by igelkott (Priest) on Apr 28, 2013 at 17:31 UTC

Considering that you may have altered your data a bit (redacted) for this post, it looks like you may really have tab-separated values. If so, change to split(/\t/, $_);

[reply]
[d/l]

Re^2: How to split line with varying number of tokens?

by zBernie (Novice) on Apr 28, 2013 at 18:56 UTC

I wish it were tab separated!

[reply]

Re: How to split line with varying number of tokens?
by jakeease (Friar) on Apr 29, 2013 at 07:36 UTC


#!/usr/bin/perl
use strict;
use warnings;

<DATA>;     # Discard header

while (<DATA>)
{
    chomp;
    my ($reqid, $dest, $from, $datetime, $pages, $rcv) = split(/\s\s+/
+, $_);
    my ($date, $time) = split(/\s+/, $datetime);

    print join('|', ($reqid, $dest, $from, $date, $time, $pages, $rcv)
+), "\n";
}

__DATA__
REQID     DEST             FROM                     DATE   TIME    nPa
+ges  RCV
138454  mail_room     Marco's Pizza         12/26 21:52    1  rcv
138446  custsvc      973 618 0577          12/26 18:44    1  rcv
138445  county2       spam                  12/26 18:41    3  rcv
138444  custsvc      spam                  12/26 18:30    1  rcv
138439  county2       7182737253            12/26 17:54    2  rcv
138438  county2       Acme Products, Inc.   12/26 17:52    1  rcv
[download]

i. e., split on two+ spaces instead of one+; then fix date and time. Output:


138454|mail_room|Marco's Pizza|12/26|21:52|1|rcv
138446|custsvc|973 618 0577|12/26|18:44|1|rcv
138445|county2|spam|12/26|18:41|3|rcv
138444|custsvc|spam|12/26|18:30|1|rcv
138439|county2|7182737253|12/26|17:54|2|rcv
138438|county2|Acme Products, Inc.|12/26|17:52|1|rcv
[download]

[reply]
[d/l]
[select]


Just another Perl shrine
	PerlMonks