You don't need the x in the template. As you note applying a heuristic requires another pass through the data. Here is a simple append to LHS column one.... As you note there are failure cases whatever you do. Just leaving it simple and doing the column merge in Excel probably makes a lot of sense.
#! perl -w
use strict;
my (@templ, $templ);
my $TEMPL = 'a';
my @lines = grep{! m/^\s*$/ }<DATA>;
my $mask = ' ' x length $lines[0];
$mask |= $_ for @lines;
push @templ, length($1) while $mask =~ m/(\S+(\s+|$))/g;
$templ = $TEMPL. join $TEMPL, @templ;
print "Naive $templ\n";
print join '|', unpack $templ, $_ for @lines;
# heuristic to detect and remove column breaks giving null fields
# this effectively assumes left justification and appends left
# but you could make it trickier
for my $line (@lines) {
my @data = unpack $templ, $line;
for my $i (1..$#data) {
next unless $data[$i] =~ m/^\s*$/;
$templ[$i-1] += $templ[$i]; # add to LHS column
$templ[$i] = 0; # unset this column in template
}
}
$templ = $TEMPL. join $TEMPL, grep{$_}@templ; # need grep to skip 0's
print "\nMunged $templ\n";
print join '|', unpack $templ, $_ for @lines;
__DATA__
The First One Here Is Longer. Collie SN 2 62287630 77312 9387
+1 MVP A
A Second (PART) here First In 20 MT 69287655 506666 6106
+6 RTD
3rd Person "Something" X&Y No SH 64287705 45423 5244
+3 RTE
The Fourth Person 20 MLP 4000 60505504 3530 7220
+1 VRE
The Fifth Name OR Something Twin 200 SH 69505179 3530 7220
+1 VRE B
The Sixth Person OR Item MLP 60505174 3,530 72,20
+1 VRE
70 The Seventh Record MLP 64205122 3530 7220
+1 VRE
The Eighth Person MLP MLP 60545154 3530 722
+0 VRE
Output
Naive a30a12a3a2a10a8a8a4a2
The First One Here Is Longer. |Collie SN | |2 |62287630 |77312
+| 93871 |MVP |A
A Second (PART) here |First In 20 |MT | |69287655 |506666
+| 61066 |RTD |
3rd Person "Something" |X&Y No SH | | |64287705 |45423
+| 52443 |RTE |
The Fourth Person 20 |MLP 4000 | | |60505504 |3530
+| 72201 |VRE |
The Fifth Name OR Something |Twin 200 SH | | |69505179 |3530
+| 72201 |VRE |B
The Sixth Person OR Item |MLP | | |60505174 |3,530
+|72,201 |VRE |
70 The Seventh Record |MLP | | |64205122 |3530
+| 72201 |VRE |
The Eighth Person MLP |MLP | | |60545154 |3530
+| 7220 |VRE |
Munged a30a17a10a8a8a6
The First One Here Is Longer. |Collie SN 2 |62287630 |77312 |
+93871 |MVP A
A Second (PART) here |First In 20 MT |69287655 |506666 |
+61066 |RTD
3rd Person "Something" |X&Y No SH |64287705 |45423 |
+52443 |RTE
The Fourth Person 20 |MLP 4000 |60505504 |3530 |
+72201 |VRE
The Fifth Name OR Something |Twin 200 SH |69505179 |3530 |
+72201 |VRE B
The Sixth Person OR Item |MLP |60505174 |3,530 |7
+2,201 |VRE
70 The Seventh Record |MLP |64205122 |3530 |
+72201 |VRE
The Eighth Person MLP |MLP |60545154 |3530 |
+ 7220 |VRE
-
Are you posting in the right place? Check out Where do I post X? to know for sure.
-
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big>
<blockquote> <br /> <dd>
<dl> <dt> <em> <font>
<h1> <h2> <h3> <h4>
<h5> <h6> <hr /> <i>
<li> <nbsp> <ol> <p>
<small> <strike> <strong>
<sub> <sup> <table>
<td> <th> <tr> <tt>
<u> <ul>
-
Snippets of code should be wrapped in
<code> tags not
<pre> tags. In fact, <pre>
tags should generally be avoided. If they must
be used, extreme care should be
taken to ensure that their contents do not
have long lines (<70 chars), in order to prevent
horizontal scrolling (and possible janitor
intervention).
-
Want more info? How to link
or How to display code and escape characters
are good places to start.