Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

comment on

( #3333=superdoc: print w/replies, xml ) Need Help??
Hi dbach355,

My first approach would be to define, programmatically (ie. with a data structure), what the input file contains on each line. Once that's in a script, you run it and prove to yourself that your data does in fact behave as expected.

Since each line is made up of space-delimited items, but some of them are count-prefixed, you could define your line format with an array containing an array reference for each item. Each array reference would hold the LABEL of the item (eg. 'ssn' for social-security, 'emp_num' for employee number, etc.), and a compiled regular expression (that's the qr/.../ syntax) used to parse the item.

In cases where the item is prefixed with a count, specifying the length of the item, you could use a string like 'COUNT' instead of a regex.

Here's an example for what you've defined:

my @line_format = ( [ 'ssn', qr/(\d{9})/ ], [ 'emp_num', qr/(\d+)/ ], [ 'emp_name', 'COUNT' ], [ 'hire_date', qr/(\d{8})/ ], [ 'city', 'COUNT' ], [ 'state', qr/([A-Z]{2})/ ], [ 'city', 'COUNT' ], [ 'zip', qr/(\d{5})/ ], );
Then you write a subroutine parse_line that you call for each line of your input file. (I would also pass in the line number, in case the line doesn't match your formula, so you can die with an error saying which line was invalid).

For each array ref in @line_format you either parse the COUNT, and pull off that number of characters, or you apply the next regex. If the data validates, you assign it into a hash local to the subroutine, with the label as the key. When the subroutine completes successfully, you pass back a reference to that hash.

Here's how you might write the parse_line subroutine:

sub parse_line { my ($line, $linenum) = @_; my %parsed = ( ); foreach my $format (@line_format) { my ($label, $expected) = @$format; if ($expected eq 'COUNT') { # Pull the COUNT off the beginning of the line and apply i +t if ($line !~ s/\s*(\d+) //) { die "Error #1 parsing item '$label' (line #$linenum)\n +"; } my $count = $1; if ($line !~ s/(.{$count})//) { die "Error #2 parsing item '$label' (line #$linenum)\n +"; } $parsed{$label} = $1; } else { # Pull of the next non-space word, and test with the regex if ($line !~ s/^\s*(\S+)//) { die "Error #3 parsing item '$label' (line #$linenum)\n +"; } $parsed{$label} = $1; } } return \%parsed; }

When I call that subroutine with the data you defined for a single line:

use Data::Dumper::Concise; my $line = "123445678 45612 11 Steve Smith 11012015 16 1001 Main + Street GA 7 Atlanta 30553"; my $result = parse_line($line, 1); die Dumper $result;

This simple program dumps as its result:

{ city => "Atlanta", emp_name => "Steve Smith", emp_num => 45612, hire_date => 11012015, ssn => 123445678, state => "GA", zip => 30553 }

So I know I'm on the right track.

The next steps would be something like;

  1. Read all the lines in the file
  2. Call the subroutine parse_line on each line (and line number), getting back a hash ref
  3. Add that hash ref to an array (or do whatever you want with it)

Does that help?

Edit: fixed whom I'm responding to (thanks choroba)


In reply to Re: How to process variables length fields in delimited file. by liverpole
in thread How to process variable length fields in delimited file. by dbach355

Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":

  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or or How to display code and escape characters are good places to start.
Log In?

What's my password?
Create A New User
Domain Nodelet?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others about the Monastery: (2)
As of 2021-12-07 22:33 GMT
Find Nodes?
    Voting Booth?
    R or B?

    Results (34 votes). Check out past polls.