comment on

Hi monks -- longtime listener; first time caller. I just tried this one over at StackOverflow but the one responder apparently got upset when I suggested his quite nice solution wasn't right for my situation -- and deleted his own answer and all comments! His ball and he's going home! Oh well.

So. I've got big files of hand-entered data in various formats that needs cleaning and rearranging; the current one looks like this:

C31 6  3 2.4 1.5 2.6  
C32 2  7 3 1.0  
H31 1 1 0 21.0 11.2 5.3 1.4
T11 2  1 0 6.0 1.1 2.2
L06 1  1 0 1.0 3.3
L06 1  4 0 1.1 1.8
[download]

That first line is bad -- missing its fourth field, which should be [0-3]; all sorts of typo-like errors like that. Catch those, send them to the Bad file, cut up good lines into a hash for redistribution. I've got this one matched like so:

($t, $p, $s, $d) = (/^([A-Z]\d\d) +(1?\d) +(\d?\d(?:\.1)?) +([0-3]) /)
+ 
  or ($bad{$line++} = $_) && next;
@cts = ($' =~ /(\d?\d\.\d)/g);
[download]

That works but I'd really like to do it in a single pattern so I can simply swap patterns for the many differently similar files still to come. I couldn't figure out anything for this that would do both the careful pattern matching and the variable-length lines all in one go (It's easy to catch every field with just /(\S+)\s+/g but then I have to check each catch separately for its proper form, which makes it messy when I retool the script for the next stinking input file).

At this point I'm mainly interested in the theoretico-mechanical question of whether what I want is *possible*. Can you do a match like

@allFields = (/patt1 patt2 patt3 patt19+/);
[download]

where the first three patts each occur once and patt19 occurs {1,n} times, you validate all catches with picky matching or next, and however many patt19s there are in a given line everything winds up in @allFields? Everything I tried got the first three fields and either the first patt19 or the last but I could never get them all.

Thanks!

In reply to validate variable-length lines in one regex? by uhClem

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


"be consistent"
	PerlMonks