Beefy Boxes and Bandwidth Generously Provided by pair Networks
more useful options
 
PerlMonks  

Parsing a Variable Format String

by ozboomer (Friar)
on Jul 10, 2008 at 01:34 UTC ( [id://696590]=perlquestion: print w/replies, xml ) Need Help??

ozboomer has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to process the text output from a program where the output format keeps changing. I've tried using a format string and unpack() but this is getting unwieldy now that I have a few formats to deal with.

Not including the ">" and "<" characters, example data strings are:
>SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93< ...(i) >SS 21 PL 2#3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93< ...(ii +) >SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93< ...(ii +i)
For subsequent processing, I'd prefer to end-up with the format shown in (iii).

I'm thinking along the lines of using some sort of regex but I just can't get my head around the silly things. Maybe I can do the 'extraction' of the fields one by one or perhaps I can do it all in one hit, I don't know. Performance level isn't a huge priority, as most of the programs that need to do this run as overnight batch jobs.

I guess the sort of thing I'm looking for is something like:

(@items) = MAGIC($buf);
...after which I can glean the following info:
SS 21 => SS = 21 PL=2#3 => SP = 2 ...could be m.n, m.n#, m#n, m#n# SP# = yes LP = 3 LP# = no PVa51.3 => SP vote = a51 ...could be x.y LP vote = 3 CT^ 110 + 0 => CT = 110 CT# = yes Rot state = "+" Rot value = 0 RL126, => RL = 126 flag1 = true ...shown by , or ' flag2 = no ...shown by " SA 106 DS 93 => SA = 106 DS for SA = 93
Apologies for the cryptic/variable descriptions -- some of the ways the software formats its output is a little scatter-brained IMO(!)

I'd appreciate any suggestions on how I might attack the problem.

Thanks a heap.

Replies are listed 'Best First'.
Re: Parsing a Variable Format String
by pc88mxer (Vicar) on Jul 10, 2008 at 03:46 UTC
    Combine regular expression with a dispatch table. For each regular expression you define what action should take place. In this case, the actions will add values to a hash.

    Parsing is performed from left to right. As matches are found the matched text is removed from the input string. Parsing stops when no patterns match or the input string is reduced to the empty string.

    use Data::Dumper; my %patterns = ( qr/SS\s+(\d+)/ => sub { $_->{SS} = $1 }, qr/PL\s+([=#]?)(\d+)([=#])(\d+)/ => sub { $_->{SP} = $2; $_->{LP} = $4 }, qr/PV\s+([a-z]?\d+)(\.\d+)?/ => sub { }, # to be written... qr/RL\s+(\d+)([,'"])/ => sub { $_->{RL} = $1; }, qr/SA\s+(\d+)/ => sub { $_->{SA} = $1 }, qr/DS\s+(\d+)/ => sub { $_->{DS} = $1 }, qr/CT\s+([#^]?)\s*(\d+)\s+(\S+)\s+(\S+)/, => sub { }, qr/CL\s+([#^]?)(\d+)\s+(\S+)/, => sub { }, ); sub MAGIC { my $line = $_[0]; $_ = {}; $line =~ s/^\s*//; LOOP: while (length($line)) { for my $k (keys %patterns) { if ($line =~ s/^$k(?:\s+|\z)//) { $patterns{$k}->(); next LOOP; } } warn "got stuck on: $line\n"; $_->{leftover} = $line; last; } $_; } print Dumper( MAGIC("SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 1 +06 DS 93")); __END__ $VAR1 = { 'SA' => '106', 'DS' => '93', 'RL' => '126', 'SP' => '2', 'SS' => '21', 'LP' => '3' };
      OOooo, that's certainly elegant :)

      I've been trying to get things happening... with some success.

      My regex is still woeful... but I have a vague idea of what's going on here.. I still can't get the CT/CL items "usable" though, viz:
      use Data::Dumper; @opts = ( "SS 21M- PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93", "SS 21 + PL 2.3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93", "SS 21F PL2#3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" +, "SS 21#+ PL2.3# PV 51.3 CL 110 -13 RL 126' SA 106 DS 93 +", "SS 32 - PL2.3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" ); my ($ss, $mstate, $mvote); my %patterns = ( qr/SS\s+(\d+)([MF# ])([\+\- ])/ => sub { $ss = $1; $mstate = $2; $mvote = $3; }, # qr/PL\s+([=#]?)(\d+)([=#]?)(\d+)/ # qr/PL\s+(\d+)([\.\#])(\d+)([\.\#])/ qr/PL\s*(\d+)([\#\.])(\d+)([\#\.]?)/ => sub { $sp = $1; $spl = $2; $spl = "" if ($spl ne "#"); $lp = $3; $lpl = $4; $lpl = "" if ($lpl ne "#"); }, qr/PV\s*([a-z]?\d+)(\.)(\d+)/ => sub { $vsp = $1; $vlp = $3 }, qr/RL\s+(\d+)([,'"]?)/ => sub { $rl = $1; $rlflg = $2; }, qr/SA\s+(\d+)/ => sub { $csa = $1 }, qr/DS\s+(\d+)/ => sub { $cds = $1 }, qr/CT\s+([#^]?)\s*(\d+)\s+(\S+)\s+(\S+)/, => sub { }, qr/CL\s+([#^]?)(\d+)\s+(\S+)/, => sub { }, ); sub MAGIC { my $line = $_[0]; $_ = {}; $line =~ s/^\s*//; printf("Input: >%s<\n", $line); LOOP: while (length($line)) { for my $k (keys %patterns) { if ($line =~ s/^$k(?:\s+|\z)//) { $patterns{$k}->(); next LOOP; } } warn "got stuck on: $line\n"; $_->{leftover} = $line; last; } printf(" \$ss = $ss\n"); printf("\$mstate = >$mstate<\n"); printf(" \$mvote = >$mvote<\n"); printf(" \$sp = $sp\n"); printf(" \$spl = $spl\n"); printf(" \$lp = $lp\n"); printf(" \$lpl = $lpl\n"); printf(" \$ctl = ...\n"); printf(" \$ct = ...\n"); printf(" \$ctr = ...\n"); printf(" \$vsp = $vsp\n"); printf(" \$vlp = $vlp\n"); printf(" \$rl = $rl\n"); printf(" \$rlflg = $rlflg\n"); printf(" \$csa = $csa\n"); printf(" \$cds = $cds\n"); $_; } foreach $buf (@opts) { printf("START -----\n"); MAGIC($buf); # print Dumper( MAGIC($buf)); printf(" \n"); }
      For example, for the 4th item in @opts, I would like to capture the info that:

      Buf: "SS 21#+ PL2.3# PV 51.3 CL 110 +13 RL 126' SA 106 DS 93" Cycle Time = CT (or CL) = 110 Locks = (no #) = OFF Rotation = "that number" = 13
      Many thanks for the assistance; I appreciate it a lot.

        Something like:
        qr/(CT|CL)\s+([#]?)\s*(\d+)\s+(\S+)/, => sub { $_->{CTorCL} = $1; $_->{CycleTime} = $3; $_->{Locks} = ($2 eq "#" ? "ON" : "OFF"); $_->{Rotation} = $4; }
        Note I put the values in a hash. Otherwise you'll have to explicitly declare or reset your variables so one call won't interfere with subsequent calls.
Re: Parsing a Variable Format String
by toolic (Bishop) on Jul 10, 2008 at 01:52 UTC
    You could try to split the string on whitespace and count the number of whitespace-separated tokens on each line:
    my @items = split /\s+/, $buf; if (scalar(@items) == 13) { # process (ii) } else { # process (i) or (iii) }

    This assumes that the number of token in a line determines the line type.

    Update: try this:

    use strict; use warnings; while (<DATA>) { my @items = split /\s+/, $_; if (scalar(@items) == 13) { my (@new) = ($items[4] =~ /(PV)(.*)/); my (@new2) = ($items[8] =~ /(RL)(.*)/); splice @items, 4, 1, @new; splice @items, 9, 1, @new2; } print "@items\n"; # now @items always contains same number of to +kens # process items... } __DATA__ SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93 SS 21 PL 2#3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93 SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93

    prints:

    SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93 SS 21 PL 2#3 PV a51.3 CT^ 110 +0 RL 126, SA 106 DS 93 SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93
      Thanks for the suggestions... I'll put it in the pot(!)

      After some more walking and thinking, a woefully poor way of doing what I (ultimately) need might be:
      @opts = ( "SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93", "SS 21 PL 2.3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93", "SS 21 PL2#3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" +, "SS 21 PL2.3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" ); foreach $buf (@opts) { printf(" 1 2 3 4 5 + 6\n"); printf(" 0123456789012345678901234567890123456789012345678901 +234567890\n"); printf("\$buf: >%s<\n", $buf); printf("\n"); $ssos = index($buf, "SS", 0); $plos = index($buf, "PL", $ssos); $pvos = index($buf, "PV", $plos); $ctos = index($buf, "CT", $pvos); if ($ctos < 0) { $ctos = index($buf, "CL", $pvos); } $rlos = index($buf, "RL", $ctos); $saos = index($buf, "SA", $rlos); $dsos = index($buf, "DS", $saos); printf("\$ssos = $ssos\n"); printf("\$plos = $plos\n"); printf("\$pvos = $pvos\n"); printf("\$ctos = $ctos\n"); printf("\$rlos = $rlos\n"); printf("\$saos = $saos\n"); printf("\$dsos = $dsos\n"); printf("\n"); $ssstr = substr($buf, $ssos+2, $plos - $ssos - 2); $plstr = substr($buf, $plos+2, $pvos - $plos - 2); $pvstr = substr($buf, $pvos+2, $ctos - $pvos - 2); $ctstr = substr($buf, $ctos+2, $rlos - $ctos - 2); $rlstr = substr($buf, $rlos+2, $saos - $rlos - 2); $sastr = substr($buf, $saos+2, $dsos - $saos - 2); $dsstr = substr($buf, $dsos+2); $ssstr =~ s/\s+//g; $plstr =~ s/\s+//g; $pvstr =~ s/\s+//g; $ctstr =~ s/\s+//g; $rlstr =~ s/\s+//g; $sastr =~ s/\s+//g; $dsstr =~ s/\s+//g; printf("\$ssstr = >$ssstr<\n"); printf("\$plstr = >$plstr<\n"); printf("\$pvstr = >$pvstr<\n"); printf("\$ctstr = >$ctstr<\n"); printf("\$rlstr = >$rlstr<\n"); printf("\$sastr = >$sastr<\n"); printf("\$dsstr = >$dsstr<\n"); printf("\n"); } # another opt
      ...which gives a typical output:
      $ssstr = >21< $plstr = >2#3< $pvstr = >51.3< $ctstr = >#110+0< $rlstr = >126'< $sastr = >106< $dsstr = >93<
      ...but that's pretty dashed ugly, even if it does work.

      Now, if I could replicate all that index/substr garbage with something more elegant...

        The more elegant is, as you already guessed, a regex.

        When you are looking for 'CT', the regex is /CT/. when you are looking for the first number after 'CT', the regex becomes/CT .*? (\d*)/x. The x at the end of the regex allows me to insert spaces so that the regex is easier to read. They don't get matched. If you really need to match a space, you can put a slash before it or use \s which parses anything spacy, like tab characters too

        When this regex matches something, it returns true. In that case what was parsed between the first and only parens is now in $1. Further parens in the regex would be stored in $2,$3,$4 and so on

        The .*? matches anything, but tries to match as few characters as possible

        With slight variations of this regex you probably can substitute all your index thingies.

        You can even combine your regexes to one long regex by combining all of them with .*? inbetween.

Re: Parsing a Variable Format String
by GrandFather (Saint) on Jul 10, 2008 at 07:43 UTC

    The following code uses a regex to extract the various fields to an array:

    use strict; use warnings; while (<DATA>) { chomp; next unless length; my @items = / SS\s+ (\S+)\s+ # SS field PL\s+ (.*?) # PL field - may have trailing white space PV (.*?) # PV field - includes white space C[LT] (.*?) # CL field - includes white space RL (.*?) # RL field - includes white space SA\s+ (\S+)\s+ # SA field DS\s+ (\S+) # DS field /xi; # Allow comments and white space in regex. Ign +ore case if (@items != 7) { # Warn then skip badly formed lines warn "Unrecognised line format: $_\n"; next; } # Do stuff with the extracted fields print '>', join ('<, >', @items), "<\n"; } __DATA__ SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93 SS 21 PL 2#3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93 SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93

    Prints:

    >21<, >2#3 <, > 51.3 <, > #110 +0 <, > 126' <, >106<, >93< >21<, >2#3 <, >a51.3 <, >^ 110 +0 <, >126, <, >106<, >93< >21<, >2#3 <, > 51.3 <, > #110 +0 <, > 126' <, >106<, >93<

    Perl is environmentally friendly - it saves trees
Re: Parsing a Variable Format String
by ozboomer (Friar) on Jul 10, 2008 at 12:38 UTC
    Heh... Couldn't resist and had to have a look from home :)

    Hmm... Thx for those latest suggestions; I always love this about Perl - TMTOWTDI :)

    Will try a couple of these techniques when I get back to work tomorrow. These examples would be useful placed in the Snippets Section, I think... as extracting delimited fields (and not simple CSV-type data) from a text string is something that I certainly do a lot.

    Thanks again for the most useful information.

Re: Parsing a Variable Format String
by ozboomer (Friar) on Jul 11, 2008 at 02:37 UTC
    Well, here we go with the final code.

    I needed to get the thing working today, so I opted for the following way of doing it - a combination of in/elegance :) Viz:

    sub Parse_SM_Sub_Header { # # [...private notes...] # # The array returned by this sub contains the following items: # # 0 Sub system # 1 Marriage state # 2 Marriage vote # 3 Split plan # 4 Split plan lock # 5 Link plan # 6 Link plan lock # 7 Split plan vote # 8 Link plan vote # 9 Cycle time lock # 10 Cycle time # 11 Rotation state # 12 Rotation amount # 13 Requested cycle time # 14 VF state # 15 Critical SA # 16 Critical SA DS # my ($line) = @_; my ($tmpbuf); my (@retarg); my %patterns = ( qr/SS\s*(\d+)([MF# ])([\+\- ])/ => sub { push(@retarg, $1); # Sub system push(@retarg, $2); # Marriage state push(@retarg, $3); # Marriage vote }, qr/PL\s*(\d+)([\#\.])(\d+)([\#\.]?)/ => sub { push(@retarg, $1); # Split plan $tmpbuf = $2; # Split plan lock $tmpbuf = "" if ($tmpbuf ne "#"); push(@retarg, $tmpbuf); push(@retarg, $3); # Link plan $tmpbuf = $4; # Link plan lock $tmpbuf = "" if ($tmpbuf ne "#"); push(@retarg, $tmpbuf); }, qr/PV\s*([a-z]?\d+)(\.)(\d+)/ => sub { push(@retarg, $1); # Split plan vote push(@retarg, $3); # Link plan vote }, qr/C[LT]\s*([\#\^]?)\s*(\d+)\s*([\+\-])(\d+)\s*/, => sub { push(@retarg, $1); # Cycle time lock push(@retarg, $2); # Cycle time push(@retarg, $3); # Rotation state push(@retarg, $4); # Rotation amount }, qr/RL\s*(\d+)([,'"]?)/ => sub { push(@retarg, $1); # Requested cycle time push(@retarg, $2); # VF state }, qr/SA\s+(\d+)/ => sub { push(@retarg, $1); }, # Critical SA qr/DS\s+(\d+)/ => sub { push(@retarg, $1); }, # Critical SA DS ); $_ = {}; @retarg = (); $line =~ s/^\s*//; LOOP: while (length($line)) { for my $k (keys %patterns) { if ($line =~ s/^$k(?:\s+|\z)//) { $patterns{$k}->(); next LOOP; } } warn "Parse_SM_Sub_Header: Error parsing input\n + \\$line\\\n"; $_->{leftover} = $line; last; } $_; return(@retarg); } # end Parse_SM_Sub_Header
    It works quite happily on all the versions of outputs that I have to deal with, so I think we'll be right for a while :)

    Again, many thanks for everyone's help.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://696590]
Approved by toolic
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2024-04-16 09:34 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found