laziness, impatience, and hubris | |
PerlMonks |
To parse or not to parseby set_uk (Pilgrim) |
on Nov 13, 2003 at 22:25 UTC ( [id://306952]=perlquestion: print w/replies, xml ) | Need Help?? |
set_uk has asked for the wisdom of the Perl Monks concerning the following question:
8 months ago I wrote a perl script to perform the following functions:-
Open a file,read the input and identify garbage lines via regexes and remove them. Read the file again (I know I know already inefficient) and split into records and then split the records into columns and then store specific columns in a hash keyed on record id. Once we have all the records the script would then insert them all into an Oracle database which another part of the script would query and tag and report on. It worked and I was happy. In the interim I have learnt a lot about structure, packages, modules, regexes (with a lot of help from the monks - thanks Monks!). I now feel the need for a rewrite and a new approach as it needs to be quicker and more robust than it is. Plus it will be an opportunity to learn something new. The garbage identification mechanism needs constant updating as the garbage varies. I need to state what I will accept rather than what I wont accept. (Its output from Meridian Voice Switches for those that are interested). I am extending the functionality so that it will be able to parse multiple output types I need to end up with a data structure containing data structures of valid records. Reading about I get the feeling I should be using IO::Filter or implementing the Filtered IO idea from TheDamian's OOPerl to remove the garbage from my file and then perhaps use Parse::RecDescent to parse my file and validate the record and create the datastructure. At the same time I am conscious I dont want to make it more complex than it needs to be. This seems like a very common task to want to perform. My questions are:- What approach would you take to a problem of this kind? I think I am that point but what criteria do I use to determine whether I should stop using plain regexes and consider using Parse::RecDescent? Typical output of a record looks like set_uk's scratchpad To show the complexity of the problem. General rules are:- The first word at the beginning of each column is its key. There are a lot of valid key types 1000+ - anything not starting with a key is garbage. Unless:- If the first col is blank then the data belongs to a key on the previous col. If the first col is the same as previous then data belongs to the previous col. I'd be interested to hear what you think. No doubt there are shortcuts to this type of problem that I am not aware of. Simon
Back to
Seekers of Perl Wisdom
|
|