|Pathologically Eclectic Rubbish Lister|
Re: How to process variables length fields in delimited file.by liverpole (Monsignor)
|on Oct 06, 2016 at 02:03 UTC||Need Help??|
My first approach would be to define, programmatically (ie. with a data structure), what the input file contains on each line. Once that's in a script, you run it and prove to yourself that your data does in fact behave as expected.
Since each line is made up of space-delimited items, but some of them are count-prefixed, you could define your line format with an array containing an array reference for each item. Each array reference would hold the LABEL of the item (eg. 'ssn' for social-security, 'emp_num' for employee number, etc.), and a compiled regular expression (that's the qr/.../ syntax) used to parse the item.
In cases where the item is prefixed with a count, specifying the length of the item, you could use a string like 'COUNT' instead of a regex.
Here's an example for what you've defined:
Then you write a subroutine parse_line that you call for each line of your input file. (I would also pass in the line number, in case the line doesn't match your formula, so you can die with an error saying which line was invalid).
For each array ref in @line_format you either parse the COUNT, and pull off that number of characters, or you apply the next regex. If the data validates, you assign it into a hash local to the subroutine, with the label as the key. When the subroutine completes successfully, you pass back a reference to that hash.
Here's how you might write the parse_line subroutine:
When I call that subroutine with the data you defined for a single line:
This simple program dumps as its result:
So I know I'm on the right track.
The next steps would be something like;
Does that help?
Edit: fixed whom I'm responding to (thanks choroba)