Beefy Boxes and Bandwidth Generously Provided by pair Networks
Your skill will accomplish
what the force of many cannot
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

I'm assuming that you don't know the lengths of each field in advance. Ie. That you are hoping to use the same code to process similar files containing fixed length records where the lengths of the fields can vary from one file to the next?

To that end, I've come up with an (imperfect) mechanism to allow the program to determine the offsets by inspection. It requires two passes of the file.

  1. During the first pass, each record is string-OR'd with a mask string that starts out containing all spaces.

    By the end of that pass, the mask will only contain spaces in positions where every record also contained a space in that position.

    We then scan the mask and use the positions of the remaining spaces to build an unpack template.

  2. That template is used during the second pass of the file to break up the records into fields.
#! perl -slw use strict; my @lines = <DATA>; ## Pass 1. OR the records with a mask of spaces my $mask = chr(32) x length $lines[ 0 ]; $mask |= $_ for @lines; ## Detect the spaces that remain and build the template my $templ = ''; $templ .= 'a' . length( $1 ) . 'x' . length( $2 ) . ' ' while $mask =~ m[(\S+)(\s+|$)]g; $templ =~ s[x\d+\s+$][]; ## Strip redundant last 'xN' print $templ; ## Split the records and output delimited by '|' print join '|', unpack $templ, $_ for @lines; __DATA__ The First One Here Is Longer. Collie SN 262287630 77312 93871 + MVP A Second (PART) here First In 20 MT 169287655 506666 61066 + RTD 3rd Person "Something" X&Y No SH 564287705 45423 52443 + RTE The Fourth Person 20 MLP 4000 360505504 3530 72201 + VRE The Fifth Name OR Something Twin 200 SH 469505179 3530 72201 + VRE The Sixth Person OR Item MLP 260505174 3,530 72,201 + VRE 70 The Seventh Record MLP 764205122 3530 72201 + VRE The Eighth Person MLP MLP 160545154 3530 7220 + VRE

The output

a29x1 a11x1 a2x1 a9x2 a6x2 a6x2 a4 The First One Here Is Longer.|Collie SN | |262287630|77312 | 93871|M +VP A Second (PART) here |First In 20|MT|169287655|506666| 61066|R +TD 3rd Person "Something" |X&Y No SH | |564287705|45423 | 52443|R +TE The Fourth Person 20 |MLP 4000 | |360505504|3530 | 72201|V +RE The Fifth Name OR Something |Twin 200 SH| |469505179|3530 | 72201|V +RE The Sixth Person OR Item |MLP | |260505174|3,530 |72,201|V +RE 70 The Seventh Record |MLP | |764205122|3530 | 72201|V +RE The Eighth Person MLP |MLP | |160545154|3530 | 7220|V +RE

The above shows why it is imperfect. It 'found' an extra column at the end of the second column.

However, the more lines in the file, statistically, the less likelyhood of word breaks 'lining up' throughout the file. It shouldn't happen too often on files of any great length. (Famous last words:)

Whether that's a flaw you can live with is your decision. I tried to think of a heuristic to determine when a column should be combined with a neihbour, but it will depend entirely on the file and the data.

I've used the 'a' template which pads fields with spaces because it makes for ease of alignment for printing, but use 'A' if you want the trailing spaces stripped.

Update: I thought of a heuristic that would probably work, but it would require at least a third pass.

Left or right justified, one end or the other of every field should contain a non-space char in every record.

Another pass that inspected the first and last chars of each field could detect 'false' columns. You'd then need to decide whether that column should be combined with the preceding or the following field. Another heuristic is called for, but whatever you come up with, it is possible to dream up scenarios in which it would fail.

In the case above, the fact that the field that follows the false field has a non-space in the first char in every record in the file is a strong indication that the false field should be combined with its precedant. But had the following field been a right-justified field, then things would be less clear cut.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re: Fixed Position Column Records by BrowserUk
in thread Fixed Position Column Records by daseme

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (5)
As of 2024-03-29 07:14 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found