Parsing a Variable Format String

ozboomer has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Parsing a Variable Format String by pc88mxer (Vicar) on Jul 10, 2008 at 03:46 UTC
Combine regular expression with a dispatch table. For each regular expression you define what action should take place. In this case, the actions will add values to a hash. Parsing is performed from left to right. As matches are found the matched text is removed from the input string. Parsing stops when no patterns match or the input string is reduced to the empty string. use Data::Dumper; my %patterns = ( qr/SS\s+(\d+)/ => sub { $_->{SS} = $1 }, qr/PL\s+([=#]?)(\d+)([=#])(\d+)/ => sub { $_->{SP} = $2; $_->{LP} = $4 }, qr/PV\s+([a-z]?\d+)(\.\d+)?/ => sub { }, # to be written... qr/RL\s+(\d+)([,'"])/ => sub { $_->{RL} = $1; }, qr/SA\s+(\d+)/ => sub { $_->{SA} = $1 }, qr/DS\s+(\d+)/ => sub { $_->{DS} = $1 }, qr/CT\s+([#^]?)\s(\d+)\s+(\S+)\s+(\S+)/, => sub { }, qr/CL\s+([#^]?)(\d+)\s+(\S+)/, => sub { }, ); sub MAGIC { my $line = $_[0]; $_ = {}; $line =~ s/^\s//; LOOP: while (length($line)) { for my $k (keys %patterns) { if ($line =~ s/^$k(?:\s+\|\z)//) { $patterns{$k}->(); next LOOP; } } warn "got stuck on: $line\n"; $_->{leftover} = $line; last; } $_; } print Dumper( MAGIC("SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 1 +06 DS 93")); __END__ $VAR1 = { 'SA' => '106', 'DS' => '93', 'RL' => '126', 'SP' => '2', 'SS' => '21', 'LP' => '3' }; [download]	[reply] [d/l]
Re^2: Parsing a Variable Format String by ozboomer (Friar) on Jul 10, 2008 at 06:13 UTC
OOooo, that's certainly elegant :) I've been trying to get things happening... with some success. My regex is still woeful... but I have a vague idea of what's going on here.. I still can't get the CT/CL items "usable" though, viz: use Data::Dumper; @opts = ( "SS 21M- PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93", "SS 21 + PL 2.3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93", "SS 21F PL2#3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" +, "SS 21#+ PL2.3# PV 51.3 CL 110 -13 RL 126' SA 106 DS 93 +", "SS 32 - PL2.3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" ); my ($ss, $mstate, $mvote); my %patterns = ( qr/SS\s+(\d+)([MF# ])([\+\- ])/ => sub { $ss = $1; $mstate = $2; $mvote = $3; }, # qr/PL\s+([=#]?)(\d+)([=#]?)(\d+)/ # qr/PL\s+(\d+)([\.\#])(\d+)([\.\#])/ qr/PL\s(\d+)([\#\.])(\d+)([\#\.]?)/ => sub { $sp = $1; $spl = $2; $spl = "" if ($spl ne "#"); $lp = $3; $lpl = $4; $lpl = "" if ($lpl ne "#"); }, qr/PV\s([a-z]?\d+)(\.)(\d+)/ => sub { $vsp = $1; $vlp = $3 }, qr/RL\s+(\d+)([,'"]?)/ => sub { $rl = $1; $rlflg = $2; }, qr/SA\s+(\d+)/ => sub { $csa = $1 }, qr/DS\s+(\d+)/ => sub { $cds = $1 }, qr/CT\s+([#^]?)\s(\d+)\s+(\S+)\s+(\S+)/, => sub { }, qr/CL\s+([#^]?)(\d+)\s+(\S+)/, => sub { }, ); sub MAGIC { my $line = $_[0]; $_ = {}; $line =~ s/^\s//; printf("Input: >%s<\n", $line); LOOP: while (length($line)) { for my $k (keys %patterns) { if ($line =~ s/^$k(?:\s+\|\z)//) { $patterns{$k}->(); next LOOP; } } warn "got stuck on: $line\n"; $_->{leftover} = $line; last; } printf(" \$ss = $ss\n"); printf("\$mstate = >$mstate<\n"); printf(" \$mvote = >$mvote<\n"); printf(" \$sp = $sp\n"); printf(" \$spl = $spl\n"); printf(" \$lp = $lp\n"); printf(" \$lpl = $lpl\n"); printf(" \$ctl = ...\n"); printf(" \$ct = ...\n"); printf(" \$ctr = ...\n"); printf(" \$vsp = $vsp\n"); printf(" \$vlp = $vlp\n"); printf(" \$rl = $rl\n"); printf(" \$rlflg = $rlflg\n"); printf(" \$csa = $csa\n"); printf(" \$cds = $cds\n"); $_; } foreach $buf (@opts) { printf("START -----\n"); MAGIC($buf); # print Dumper( MAGIC($buf)); printf(" \n"); } [download] For example, for the 4th item in @opts, I would like to capture the info that: `Buf: "SS 21#+ PL2.3# PV 51.3 CL 110 +13 RL 126' SA 106 DS 93" Cycle Time = CT (or CL) = 110 Locks = (no #) = OFF Rotation = "that number" = 13` [download] Many thanks for the assistance; I appreciate it a lot.	[reply] [d/l] [select]
Re^3: My Favourite Regex Tools (Was: Parsing a Variable Format String) by planetscape (Chancellor) on Jul 10, 2008 at 13:27 UTC
My regex is still woeful... In addition to the Pattern Matching, Regular Expressions, and Parsing section of our very fine Tutorials, you may also find the following helpful: Mastering Regular Expressions perlrequick perlretut perlre YAPE::Regex::Explain GraphViz::Regex The Regex Coach (Win32 only, IIRC) Regex Arcana txt2re: headache relief for programmers :: regular expression generator Regex tester Perl 5.10 Advanced Regular Expressions Update: Online Regular Expression Analyzer, courtesy Ratazong (2010-08-11) HTH, planetscape	[reply]
Re^4: My Favourite Regex Tools (Was: Parsing a Variable Format String) by Anonymous Monk on Jun 02, 2014 at 07:38 UTC
Re^3: Parsing a Variable Format String by pc88mxer (Vicar) on Jul 10, 2008 at 07:58 UTC
Something like: `qr/(CT\|CL)\s+([#]?)\s*(\d+)\s+(\S+)/, => sub { $_->{CTorCL} = $1; $_->{CycleTime} = $3; $_->{Locks} = ($2 eq "#" ? "ON" : "OFF"); $_->{Rotation} = $4; }` [download] Note I put the values in a hash. Otherwise you'll have to explicitly declare or reset your variables so one call won't interfere with subsequent calls.	[reply] [d/l]
Re: Parsing a Variable Format String by toolic (Bishop) on Jul 10, 2008 at 01:52 UTC
You could try to split the string on whitespace and count the number of whitespace-separated tokens on each line: `my @items = split /\s+/, $buf; if (scalar(@items) == 13) { # process (ii) } else { # process (i) or (iii) }` [download] This assumes that the number of token in a line determines the line type. Update: try this: `use strict; use warnings; while (<DATA>) { my @items = split /\s+/, $_; if (scalar(@items) == 13) { my (@new) = ($items[4] =~ /(PV)(.)/); my (@new2) = ($items[8] =~ /(RL)(.)/); splice @items, 4, 1, @new; splice @items, 9, 1, @new2; } print "@items\n"; # now @items always contains same number of to +kens # process items... } __DATA__ SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93 SS 21 PL 2#3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93 SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93` [download] prints: `SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93 SS 21 PL 2#3 PV a51.3 CT^ 110 +0 RL 126, SA 106 DS 93 SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93` [download]	[reply] [d/l] [select]
Re^2: Parsing a Variable Format String by ozboomer (Friar) on Jul 10, 2008 at 02:40 UTC
Thanks for the suggestions... I'll put it in the pot(!) After some more walking and thinking, a woefully poor way of doing what I (ultimately) need might be: @opts = ( "SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93", "SS 21 PL 2.3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93", "SS 21 PL2#3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" +, "SS 21 PL2.3# PV 51.3 CL #110 +0 RL 126' SA 106 DS 93" ); foreach $buf (@opts) { printf(" 1 2 3 4 5 + 6\n"); printf(" 0123456789012345678901234567890123456789012345678901 +234567890\n"); printf("\$buf: >%s<\n", $buf); printf("\n"); $ssos = index($buf, "SS", 0); $plos = index($buf, "PL", $ssos); $pvos = index($buf, "PV", $plos); $ctos = index($buf, "CT", $pvos); if ($ctos < 0) { $ctos = index($buf, "CL", $pvos); } $rlos = index($buf, "RL", $ctos); $saos = index($buf, "SA", $rlos); $dsos = index($buf, "DS", $saos); printf("\$ssos = $ssos\n"); printf("\$plos = $plos\n"); printf("\$pvos = $pvos\n"); printf("\$ctos = $ctos\n"); printf("\$rlos = $rlos\n"); printf("\$saos = $saos\n"); printf("\$dsos = $dsos\n"); printf("\n"); $ssstr = substr($buf, $ssos+2, $plos - $ssos - 2); $plstr = substr($buf, $plos+2, $pvos - $plos - 2); $pvstr = substr($buf, $pvos+2, $ctos - $pvos - 2); $ctstr = substr($buf, $ctos+2, $rlos - $ctos - 2); $rlstr = substr($buf, $rlos+2, $saos - $rlos - 2); $sastr = substr($buf, $saos+2, $dsos - $saos - 2); $dsstr = substr($buf, $dsos+2); $ssstr =~ s/\s+//g; $plstr =~ s/\s+//g; $pvstr =~ s/\s+//g; $ctstr =~ s/\s+//g; $rlstr =~ s/\s+//g; $sastr =~ s/\s+//g; $dsstr =~ s/\s+//g; printf("\$ssstr = >$ssstr<\n"); printf("\$plstr = >$plstr<\n"); printf("\$pvstr = >$pvstr<\n"); printf("\$ctstr = >$ctstr<\n"); printf("\$rlstr = >$rlstr<\n"); printf("\$sastr = >$sastr<\n"); printf("\$dsstr = >$dsstr<\n"); printf("\n"); } # another opt [download] ...which gives a typical output: `$ssstr = >21< $plstr = >2#3< $pvstr = >51.3< $ctstr = >#110+0< $rlstr = >126'< $sastr = >106< $dsstr = >93<` [download] ...but that's pretty dashed ugly, even if it does work. Now, if I could replicate all that index/substr garbage with something more elegant...	[reply] [d/l] [select]
Re^3: Parsing a Variable Format String by jethro (Monsignor) on Jul 10, 2008 at 03:04 UTC
The more elegant is, as you already guessed, a regex. When you are looking for 'CT', the regex is `/CT/`. when you are looking for the first number after 'CT', the regex becomes`/CT .? (\d)/x`. The x at the end of the regex allows me to insert spaces so that the regex is easier to read. They don't get matched. If you really need to match a space, you can put a slash before it or use `\s` which parses anything spacy, like tab characters too When this regex matches something, it returns true. In that case what was parsed between the first and only parens is now in $1. Further parens in the regex would be stored in $2,$3,$4 and so on The `.?` matches anything, but tries to match as few characters as possible With slight variations of this regex you probably can substitute all your index thingies. You can even combine your regexes to one long regex by combining all of them with `.?` inbetween.	[reply] [d/l] [select]
Re^4: Parsing a Variable Format String by ysth (Canon) on Jul 10, 2008 at 03:45 UTC
Re: Parsing a Variable Format String by GrandFather (Saint) on Jul 10, 2008 at 07:43 UTC
The following code uses a regex to extract the various fields to an array: use strict; use warnings; while (<DATA>) { chomp; next unless length; my @items = / SS\s+ (\S+)\s+ # SS field PL\s+ (.?) # PL field - may have trailing white space PV (.?) # PV field - includes white space C[LT] (.?) # CL field - includes white space RL (.?) # RL field - includes white space SA\s+ (\S+)\s+ # SA field DS\s+ (\S+) # DS field /xi; # Allow comments and white space in regex. Ign +ore case if (@items != 7) { # Warn then skip badly formed lines warn "Unrecognised line format: $_\n"; next; } # Do stuff with the extracted fields print '>', join ('<, >', @items), "<\n"; } __DATA__ SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93 SS 21 PL 2#3 PVa51.3 CT^ 110 +0 RL126, SA 106 DS 93 SS 21 PL 2#3 PV 51.3 CL #110 +0 RL 126' SA 106 DS 93 [download] Prints: `>21<, >2#3 <, > 51.3 <, > #110 +0 <, > 126' <, >106<, >93< >21<, >2#3 <, >a51.3 <, >^ 110 +0 <, >126, <, >106<, >93< >21<, >2#3 <, > 51.3 <, > #110 +0 <, > 126' <, >106<, >93<` [download] Perl is environmentally friendly - it saves trees	[reply] [d/l] [select]
Re: Parsing a Variable Format String by ozboomer (Friar) on Jul 10, 2008 at 12:38 UTC
Heh... Couldn't resist and had to have a look from home :) Hmm... Thx for those latest suggestions; I always love this about Perl - TMTOWTDI :) Will try a couple of these techniques when I get back to work tomorrow. These examples would be useful placed in the Snippets Section, I think... as extracting delimited fields (and not simple CSV-type data) from a text string is something that I certainly do a lot. Thanks again for the most useful information.	[reply]
Re: Parsing a Variable Format String by ozboomer (Friar) on Jul 11, 2008 at 02:37 UTC
Well, here we go with the final code. I needed to get the thing working today, so I opted for the following way of doing it - a combination of in/elegance :) Viz: sub Parse_SM_Sub_Header { # # [...private notes...] # # The array returned by this sub contains the following items: # # 0 Sub system # 1 Marriage state # 2 Marriage vote # 3 Split plan # 4 Split plan lock # 5 Link plan # 6 Link plan lock # 7 Split plan vote # 8 Link plan vote # 9 Cycle time lock # 10 Cycle time # 11 Rotation state # 12 Rotation amount # 13 Requested cycle time # 14 VF state # 15 Critical SA # 16 Critical SA DS # my ($line) = @_; my ($tmpbuf); my (@retarg); my %patterns = ( qr/SS\s(\d+)([MF# ])([\+\- ])/ => sub { push(@retarg, $1); # Sub system push(@retarg, $2); # Marriage state push(@retarg, $3); # Marriage vote }, qr/PL\s(\d+)([\#\.])(\d+)([\#\.]?)/ => sub { push(@retarg, $1); # Split plan $tmpbuf = $2; # Split plan lock $tmpbuf = "" if ($tmpbuf ne "#"); push(@retarg, $tmpbuf); push(@retarg, $3); # Link plan $tmpbuf = $4; # Link plan lock $tmpbuf = "" if ($tmpbuf ne "#"); push(@retarg, $tmpbuf); }, qr/PV\s([a-z]?\d+)(\.)(\d+)/ => sub { push(@retarg, $1); # Split plan vote push(@retarg, $3); # Link plan vote }, qr/C[LT]\s([\#\^]?)\s(\d+)\s([\+\-])(\d+)\s/, => sub { push(@retarg, $1); # Cycle time lock push(@retarg, $2); # Cycle time push(@retarg, $3); # Rotation state push(@retarg, $4); # Rotation amount }, qr/RL\s(\d+)([,'"]?)/ => sub { push(@retarg, $1); # Requested cycle time push(@retarg, $2); # VF state }, qr/SA\s+(\d+)/ => sub { push(@retarg, $1); }, # Critical SA qr/DS\s+(\d+)/ => sub { push(@retarg, $1); }, # Critical SA DS ); $_ = {}; @retarg = (); $line =~ s/^\s*//; LOOP: while (length($line)) { for my $k (keys %patterns) { if ($line =~ s/^$k(?:\s+\|\z)//) { $patterns{$k}->(); next LOOP; } } warn "Parse_SM_Sub_Header: Error parsing input\n + \\$line\\\n"; $_->{leftover} = $line; last; } $_; return(@retarg); } # end Parse_SM_Sub_Header [download] It works quite happily on all the versions of outputs that I have to deal with, so I think we'll be right for a while :) Again, many thanks for everyone's help.	[reply] [d/l]


more useful options
	PerlMonks