To parse or not to parse

set_uk has asked for the wisdom of the Perl Monks concerning the following question:

8 months ago I wrote a perl script to perform the following functions:-

Open a file,read the input and identify garbage lines via regexes and remove them.

Read the file again (I know I know already inefficient) and split into records and then split the records into columns and then store specific columns in a hash keyed on record id.

Once we have all the records the script would then insert them all into an Oracle database which another part of the script would query and tag and report on. It worked and I was happy.

In the interim I have learnt a lot about structure, packages, modules, regexes (with a lot of help from the monks - thanks Monks!).

I now feel the need for a rewrite and a new approach as it needs to be quicker and more robust than it is. Plus it will be an opportunity to learn something new. The garbage identification mechanism needs constant updating as the garbage varies. I need to state what I will accept rather than what I wont accept. (Its output from Meridian Voice Switches for those that are interested). I am extending the functionality so that it will be able to parse multiple output types

I need to end up with a data structure containing data structures of valid records.

Reading about I get the feeling I should be using IO::Filter or implementing the Filtered IO idea from TheDamian's OOPerl to remove the garbage from my file and then perhaps use Parse::RecDescent to parse my file and validate the record and create the datastructure. At the same time I am conscious I dont want to make it more complex than it needs to be.

This seems like a very common task to want to perform. My questions are:-

What approach would you take to a problem of this kind? I think I am that point but what criteria do I use to determine whether I should stop using plain regexes and consider using Parse::RecDescent?

Typical output of a record looks like set_uk's scratchpad To show the complexity of the problem.

General rules are:-

The first word at the beginning of each column is its key. There are a lot of valid key types 1000+ - anything not starting with a key is garbage. Unless:- If the first col is blank then the data belongs to a key on the previous col. If the first col is the same as previous then data belongs to the previous col.

I'd be interested to hear what you think. No doubt there are shortcuts to this type of problem that I am not aware of.

Simon

Comment on To parse or not to parse

Replies are listed 'Best First'.
Re: To parse or not to parse by Abigail-II (Bishop) on Nov 13, 2003 at 23:03 UTC
Why? You have a program that has been working fine for 8 months. It breaks the problem up in small pieces, and it uses a simple solution for all the pieces. From the description you are giving, about the only thing that's against it is reading in a file twice, and some streamlining could have been programmed in. But overall, you have a pretty good program, and it works. But now you have an urge to rewrite it. And what my impression is, in a more complex way. I'd say, don't do it. I think I am that point but what criteria do I use to determine whether I should stop using plain regexes and consider using Parse::RecDescent? Never. If you can do it with plain regexes, do it with plain regexes. Regexes can backtrack, but Parse::RecDescent will backtrack much more, unless you put a lot of work in it. Parse::RecDescent is slow. You'd only use Parse::RecDescent if your problem can't be solved with plain regexes. And even then you might want to consider regexes with advanced features over Parse::RecDescent. I can't prove it (yet), but I think that anything that can be parsed with a Parse::RecDescent parser, can be parsed with a regex (except perhaps cases where you modify the parser mid-parse). Abigail	[reply]
Re: Re: To parse or not to parse by set_uk (Pilgrim) on Nov 13, 2003 at 23:47 UTC
Thats the kind of steer that I need. Perhaps you can help with problem of eliminating the garbage in a more efficient manner I currently use the following to identify and remove garbage from a file. I compile the regexes because I am using them repeatedly over and over again different lines in the file. I have a vague recollection that `(?<![A-Z])(AUD.)` using lookbehind and lookaheads causes a large overhead. Runs really slowly hogging all the memory on a file with 150000 lines for 20 mins. No doubt this can be improved:- sub ValidateInputFile{ my($line, $llen, $LineNo, $ErrChar, $ErrCount, $mtch_junk, $tncount, @check, $SOF_fh); $tncount=0; $ErrCount=0; $SOF_fh=OpenFile($fqd_sof, "READ"); $SOFV_fh=OpenFile($fqd_sof.'validated', "WRITE"); $SOFR_fh=OpenFile($fqd_sof.'removed', "WRITE"); @HeaderGarbage = ('^LD 20', '^PT0000 ', '^REQ\: PRT', '^TYPE\: TNB', '^TN $', '^CDEN ', '^CUST ', '^DATE ', '^PAGE ', '^DES '); foreach $RegEx (@HeaderGarbage){ $CHRegex = qr/($RegEx\n)/; push @HeaderRegExes, $CHRegex; } @ErrCodes = ('(^\.\n)', '(ADMIN VSID.)', '(BUG.\n)', '(?<![A-Z])(AUD.)', '(SCH.\n)', '(MAINT.\n)', '(OPRDATA.\n)', '(TIM000.\n)', '(DTC103.)', '(DTC001.)', '(DSL.TIME.\n)', '(FIR110.\n)', '(FIR103.\n)', '(^ERR.\n)', '(DCH:.)', '(^VAS008.\n)', '(^INDEX.VTN.\n)', '(^IND.\n)', '(^ NET 56 P.*\n)', '(00A1491A 00000C16 00000000 000014F0 00000000 000000 +00 00000000 128)', '(0000 00000000 0 00000000 00000000 00000000 0000000 +0 00000000 00000000)'); foreach $RegEx (@ErrCodes){ $CBRegex = qr/$RegEx/; push @BodyRegExes, $CBRegex; } while ($line = ($SOF_fh->getline)){ $LineNo++; $llen=length($line); #Remove the DOS line endings $line=~s/\r\n/\n/; # Following code removes spurious switch codes that cant be suppre +ssed # First lets strip the first 20 lines of the TNB file of garbage if ($LineNo < 30){ #print $LineNo."\n"; #print $line; foreach $ErrChar (@HeaderRegExes){ undef $mtch_junk; #print "Checking ".$ErrChar."\n"; if ($line=~/$ErrChar/){ $ErrCount++; #print " Found ".$ErrChar."\n"; $line=~s/$ErrChar//; $mtch_junk = $1; if ($LineNo != $LastInvLine){ # Only print the line the first time we go over it # Because $1 isn't reset if a subsequent attempt t +o match fails if ($mtch_junk){ $LastInvLine = $LineNo; $SOFR_fh->write("$LineNo($ErrChar)->$mtch_ +junk\n"); } } } } } # Now attempt to remove spurious codes foreach $ErrChar (@BodyRegExes){ undef $mtch_junk; # Need to ensure this doesn't match BAUD just AUD if ($line=~/$ErrChar/){ $ErrCount++; $line=~s/$ErrChar//; $mtch_junk = $1; if ($LineNo != $LastInvLine){ # Only print the line the first time we go over it # Because $1 isn't reset if a subsequent attempt t +o match fails if ($mtch_junk){ $LastInvLine = $LineNo; $SOFR_fh->write("$LineNo($ErrChar)->$mtch_junk\n") +; } } } } if ($line =~ /^(TN\s+\d+\s+\d+)/) { $tncount++; #$LOG_fh->write(msg("INFO","Found $1","FORMAT"),8); #print $1."\n"; } $SOFV_fh->write($line); } $LOG_fh->write(msg("INFO","Removed $ErrCount occurrences of non TN dat +a","FORMAT"),4); msg("INFO","Removed $ErrCount occurrences of non TN data","PRINT"); $LOG_fh->write(msg("INFO","Found $tncount occurrences of TN at beginni +ng of line","FORMAT"),4); msg("INFO","Found $tncount occurrences of TN at beginning of line","PR +INT"); $SwitchCount{$switch_id}{'REMOVED'} = $ErrCount; $SwitchCount{$switch_id}{'TN_COUNT'} = $tncount; $SOF_fh->close or throw TNBCriticalException("Cannot close switch outp +ut file"); undef $SOF_fh; undef $SOFV_fh; undef $SOFR_fh; return $tncount; } [download] Simon	[reply] [d/l] [select]
Re: Re: Re: To parse or not to parse by Roger (Parson) on Nov 14, 2003 at 00:50 UTC
I agree with Abigail-II that the Parse::RecDescent module is not going to make your program run faster or better, it will only introduce yet more maintenance work. I think 2 passes on the data file is not necessary. You can do all in a single pass, just drop the garbage lines as you loop through the input file. Also with the precompiled regular expression, you did - `foreach $RegEx (@HeaderGarbage){ $CHRegex = qr/($RegEx\n)/; push @HeaderRegExes, $CHRegex; } ... foreach $ErrChar (@HeaderRegExes){ ...` [download] Why not just compile everything in your header garbage list into a single regular expression? `@HeaderGarbage = ('LD 20', 'PT0000 ', ... 'CUST ', 'DATE ', 'PAGE ', 'DES '); my $reg = join '\|', @HeaderGarbage; my $HeaderRegex = qr/^$reg/; ... while (my $line = <FILE>) { $line =~ s/$HeaderRegex//; # replace matching garbage ...` [download] Read more... (3 kB)	[reply] [d/l] [select]
Re: To parse or not to parse by duff (Parson) on Nov 13, 2003 at 23:59 UTC
If you feel you need a new approach because of speed concerns, you almost certainly don't want to use Parse::RecDescent. It's nice as far as abstractions go, but it is quite slow in my experience. But, if you think the code is crufty and you can do better with your advanced knowledge of perl and you have the time/resources to rewrite the code, go for it. It sounds like you've got a handle on a potential problem with the original code anyway (matches variable errors rather than not-so-variable keep lines). If speed really is a concern, you should profile the existing code first. It may be that your current approach to matching error lines is plenty fast and that the bottleneck is really somewhere else. duff	[reply]


laziness, impatience, and hubris
	PerlMonks