http://qs321.pair.com?node_id=628055

daseme has asked for the wisdom of the Perl Monks concerning the following question:

Wise Monks,
I have several hundred files each of which has fixed column positions, but none of the files contain the same positions for these columns.
My code appears to do fine for the first 3 cols of data, but the 4th and 5th cols are sometimes right-aligned which my code can not handle. I would appreciate any help with an approach to process the problem columns differently and/or enlighten me about other potential issues in my code. Using the post from tilly Locate char in a string I was able to get as far as i have below. I am but a Perl hobbyist.
#!/usr/bin/perl -w use strict; use diagnostics; my @pos; my $line; my @field; open FILE, 'TEST2.txt' or die "Can't open input file: $!\n"; my @data = <FILE>; close(FILE); &find_position; foreach $line (@data) { my @rec; my $prev = 0; foreach my $col (@pos) { push @rec, substr( $line, $prev, $col - $prev - 1 ); $prev = $col - 1; } print join( ':', @rec ); print "\n ----- \n"; @rec = undef; } sub find_position { foreach $line (@data) { # find first line to meet conditions and +capture position info if ( $line =~ /^(\w.*|\w+\S.*)(\s{2,}.*\S)(\s{2,})\d{9}/x && ! +$pos[0] ) { # match my delimiter while ( $line =~ /(\s{2,}|\t\s?)(\w|\d)/g ) { push @pos, pos($line); } } } }
Some sample data that illustrates the variation in a given file
The First One Here Is Longer. Collie SN      262287630	  77312	   93871  MVP		
A  Second (PART) here         First In 20 MT 169287655	  506666   61066  RTD		
3rd Person "Something"        X&Y No SH      564287705	  45423    52443  RTE	
The Fourth Person 20          MLP 4000       360505504	  3530     72201  VRE	
The Fifth Name OR Something   Twin 200 SH    469505179	  3530     72201  VRE
The Sixth Person OR Item      MLP            260505174	  3,530   72,201  VRE
70 The Seventh Record         MLP            764205122	  3530     72201  VRE
The Eighth Person MLP         MLP            160545154	  3530      7220  VRE 

Replies are listed 'Best First'.
Re: Fixed Position Column Records
by BrowserUk (Patriarch) on Jul 22, 2007 at 02:12 UTC

    I'm assuming that you don't know the lengths of each field in advance. Ie. That you are hoping to use the same code to process similar files containing fixed length records where the lengths of the fields can vary from one file to the next?

    To that end, I've come up with an (imperfect) mechanism to allow the program to determine the offsets by inspection. It requires two passes of the file.

    1. During the first pass, each record is string-OR'd with a mask string that starts out containing all spaces.

      By the end of that pass, the mask will only contain spaces in positions where every record also contained a space in that position.

      We then scan the mask and use the positions of the remaining spaces to build an unpack template.

    2. That template is used during the second pass of the file to break up the records into fields.
    #! perl -slw use strict; my @lines = <DATA>; ## Pass 1. OR the records with a mask of spaces my $mask = chr(32) x length $lines[ 0 ]; $mask |= $_ for @lines; ## Detect the spaces that remain and build the template my $templ = ''; $templ .= 'a' . length( $1 ) . 'x' . length( $2 ) . ' ' while $mask =~ m[(\S+)(\s+|$)]g; $templ =~ s[x\d+\s+$][]; ## Strip redundant last 'xN' print $templ; ## Split the records and output delimited by '|' print join '|', unpack $templ, $_ for @lines; __DATA__ The First One Here Is Longer. Collie SN 262287630 77312 93871 + MVP A Second (PART) here First In 20 MT 169287655 506666 61066 + RTD 3rd Person "Something" X&Y No SH 564287705 45423 52443 + RTE The Fourth Person 20 MLP 4000 360505504 3530 72201 + VRE The Fifth Name OR Something Twin 200 SH 469505179 3530 72201 + VRE The Sixth Person OR Item MLP 260505174 3,530 72,201 + VRE 70 The Seventh Record MLP 764205122 3530 72201 + VRE The Eighth Person MLP MLP 160545154 3530 7220 + VRE

    The output

    a29x1 a11x1 a2x1 a9x2 a6x2 a6x2 a4 The First One Here Is Longer.|Collie SN | |262287630|77312 | 93871|M +VP A Second (PART) here |First In 20|MT|169287655|506666| 61066|R +TD 3rd Person "Something" |X&Y No SH | |564287705|45423 | 52443|R +TE The Fourth Person 20 |MLP 4000 | |360505504|3530 | 72201|V +RE The Fifth Name OR Something |Twin 200 SH| |469505179|3530 | 72201|V +RE The Sixth Person OR Item |MLP | |260505174|3,530 |72,201|V +RE 70 The Seventh Record |MLP | |764205122|3530 | 72201|V +RE The Eighth Person MLP |MLP | |160545154|3530 | 7220|V +RE

    The above shows why it is imperfect. It 'found' an extra column at the end of the second column.

    However, the more lines in the file, statistically, the less likelyhood of word breaks 'lining up' throughout the file. It shouldn't happen too often on files of any great length. (Famous last words:)

    Whether that's a flaw you can live with is your decision. I tried to think of a heuristic to determine when a column should be combined with a neihbour, but it will depend entirely on the file and the data.

    I've used the 'a' template which pads fields with spaces because it makes for ease of alignment for printing, but use 'A' if you want the trailing spaces stripped.

    Update: I thought of a heuristic that would probably work, but it would require at least a third pass.

    Left or right justified, one end or the other of every field should contain a non-space char in every record.

    Another pass that inspected the first and last chars of each field could detect 'false' columns. You'd then need to decide whether that column should be combined with the preceding or the following field. Another heuristic is called for, but whatever you come up with, it is possible to dream up scenarios in which it would fail.

    In the case above, the fact that the field that follows the false field has a non-space in the first char in every record in the file is a strong indication that the false field should be combined with its precedant. But had the following field been a right-justified field, then things would be less clear cut.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      I would have posted a similar approach to induce probable column boundaries and then build a template string for unpack based on file data, but I had forgotten about the "string-OR" trick, so my code would end up much longer (and obviously was taking much longer to write).

      Luckily for me, I rechecked the thread before getting to the tricky part in my approach (collapsing an array of character offsets into a pack template), and saw that there was a better way to do it (better in the sense of getting it done quicker with less code, as proven by the relative speed of your post).

      ++!

      You don't need the x in the template. As you note applying a heuristic requires another pass through the data. Here is a simple append to LHS column one.... As you note there are failure cases whatever you do. Just leaving it simple and doing the column merge in Excel probably makes a lot of sense.

      #! perl -w use strict; my (@templ, $templ); my $TEMPL = 'a'; my @lines = grep{! m/^\s*$/ }<DATA>; my $mask = ' ' x length $lines[0]; $mask |= $_ for @lines; push @templ, length($1) while $mask =~ m/(\S+(\s+|$))/g; $templ = $TEMPL. join $TEMPL, @templ; print "Naive $templ\n"; print join '|', unpack $templ, $_ for @lines; # heuristic to detect and remove column breaks giving null fields # this effectively assumes left justification and appends left # but you could make it trickier for my $line (@lines) { my @data = unpack $templ, $line; for my $i (1..$#data) { next unless $data[$i] =~ m/^\s*$/; $templ[$i-1] += $templ[$i]; # add to LHS column $templ[$i] = 0; # unset this column in template } } $templ = $TEMPL. join $TEMPL, grep{$_}@templ; # need grep to skip 0's print "\nMunged $templ\n"; print join '|', unpack $templ, $_ for @lines; __DATA__ The First One Here Is Longer. Collie SN 2 62287630 77312 9387 +1 MVP A A Second (PART) here First In 20 MT 69287655 506666 6106 +6 RTD 3rd Person "Something" X&Y No SH 64287705 45423 5244 +3 RTE The Fourth Person 20 MLP 4000 60505504 3530 7220 +1 VRE The Fifth Name OR Something Twin 200 SH 69505179 3530 7220 +1 VRE B The Sixth Person OR Item MLP 60505174 3,530 72,20 +1 VRE 70 The Seventh Record MLP 64205122 3530 7220 +1 VRE The Eighth Person MLP MLP 60545154 3530 722 +0 VRE

      Output

      Naive a30a12a3a2a10a8a8a4a2 The First One Here Is Longer. |Collie SN | |2 |62287630 |77312 +| 93871 |MVP |A A Second (PART) here |First In 20 |MT | |69287655 |506666 +| 61066 |RTD | 3rd Person "Something" |X&Y No SH | | |64287705 |45423 +| 52443 |RTE | The Fourth Person 20 |MLP 4000 | | |60505504 |3530 +| 72201 |VRE | The Fifth Name OR Something |Twin 200 SH | | |69505179 |3530 +| 72201 |VRE |B The Sixth Person OR Item |MLP | | |60505174 |3,530 +|72,201 |VRE | 70 The Seventh Record |MLP | | |64205122 |3530 +| 72201 |VRE | The Eighth Person MLP |MLP | | |60545154 |3530 +| 7220 |VRE | Munged a30a17a10a8a8a6 The First One Here Is Longer. |Collie SN 2 |62287630 |77312 | +93871 |MVP A A Second (PART) here |First In 20 MT |69287655 |506666 | +61066 |RTD 3rd Person "Something" |X&Y No SH |64287705 |45423 | +52443 |RTE The Fourth Person 20 |MLP 4000 |60505504 |3530 | +72201 |VRE The Fifth Name OR Something |Twin 200 SH |69505179 |3530 | +72201 |VRE B The Sixth Person OR Item |MLP |60505174 |3,530 |7 +2,201 |VRE 70 The Seventh Record |MLP |64205122 |3530 | +72201 |VRE The Eighth Person MLP |MLP |60545154 |3530 | + 7220 |VRE
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Fixed Position Column Records
by grep (Monsignor) on Jul 22, 2007 at 01:19 UTC
    You're making this way to complicated. Use unpack. Create a "picture line" for each different format.
    #untested my @col = unpack('a30a15a12a8a10a3',$line)
Re: Fixed Position Column Records
by daseme (Beadle) on Jul 22, 2007 at 03:23 UTC
    Thank you grep and BrowserUk.
    below is a replacement for my first foreach statement which works better than my original code thanks to the suggestion by grep.
    my $f2 = $pos[1] - $pos[0]; my $f3 = $pos[2] - $pos[1]; my $f4 = $pos[3] - $pos[2]; my $f5 = $pos[4] - $pos[3]; foreach $line (@data) { my $pictureline = "a$pos[0] a$f2 a$f3 a$f4 a$f5"; print unpack( $pictureline, $line ), "\n"; }
    And thank you again BrowserUk for a more elegant solution altogether.

      Posted here to ensure it comes to the OPs attention.

      Replace the regex used by the pattern building code with

      while $mask =~ m[([^ ]+)( +|$)]g;

      That is, replace \S and \s with literal spaces as the ORing process can generate chr(9) and other characters that would be treated as whitespace and so create false columns.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Fixed Position Column Records
by jhourcle (Prior) on Jul 22, 2007 at 20:02 UTC
    My code appears to do fine for the first 3 cols of data, but the 4th and 5th cols are sometimes right-aligned which my code can not handle

    Based on the sample data you've given, have you considered using multiple methods to process the line? If you can reliably determine where the 3rd column starts, then you can get columns 3 through 6 with a split on \s+.

    Of course, this won't work if you have empty cells in your data, or if any of the cells in columns 3-6 have spaces in them.

      Thank you for your reply. I am still very much a beginner at Perl and so unfortunately I am not clear on the approach you are suggesting. To give you an idea of where I am at skill-wise, I am still attempting to apply BrowserUK's suggested heuristic from above.
      I would certainly appreciate any pointers to help me understand your suggestion though.
Re: Fixed Position Column Records
by EvanCarroll (Chaplain) on May 18, 2008 at 21:55 UTC
    For people who stumble upon this thread. I've written a module to help with task. DataExtract::FixedWidth
    my $de = DataExtract::FixedWidth->new({ header_row => undef , heuristic => \@lines , cols => [qw/name address num1 num2 num3/] }) say $de->parse_hash( $_ )->{name} foreach @lines


    Evan Carroll
    I hack for the ladies.
    www.EvanCarroll.com