can't use unpack or split??

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: can't use unpack or split?? by eclark (Scribe) on Jun 07, 2004 at 19:46 UTC
How about this? `open FH, "<datafile"; while (my $line = <FH>) { chomp $line; my @a = ($line =~ /('{1}.*?'{1}(?=\s+)\|\S+)/g); print join('\|', @a) . "\n"; } close FH;` [download]	[reply] [d/l]
Re^2: can't use unpack or split?? by seaver (Pilgrim) on Jun 07, 2004 at 20:12 UTC
Thanks for your reply, it totally makes sense, however I wasn't explicit enough in my question. You've coded the search pattern specifically for the example I gave above, but truly, the 4 or 5 characters in the second column could be ANY character, and not necessarily be delimited by ' as they are in the example above. After having thought about this, the two things I can guarantee are: The first three characters are FIXED. There is a space between every column The second column is at least 1 character, and at the most 5 characters So my problem is truly that I have columns delimited by spaces, (nothing I can do about that) and the second column may actually have within it, a space...!! Any more ideas? Cheers Sam	[reply]
Re: can't use unpack or split?? by Limbic~Region (Chancellor) on Jun 07, 2004 at 20:13 UTC
seaver, Here is code that will work for the limited data sample you provided: `while ( <DATA> ) { my @col = /^(\w{3}) (.{1,5})\s+(\w+)\s+([01?]) ([01?]) ([01?]) ([0 +1?]) ([01?])$/; print join "\t" , @col; print "\n"; }` [download] Of course, depending on all the factors of your situation, there is likely a better solution - perhaps involving pre-processing. In any account - enjoy. L~R Update: Though it was perfectly fine, I changed .* to .{1,5} after reading about the constraint elswhere in the thread	[reply] [d/l]
Re^2: can't use unpack or split?? by seaver (Pilgrim) on Jun 07, 2004 at 20:22 UTC
Limbic~Region I'm afraid I dont understand how the '.*' will capture a space within the second column, seeing it's followed by a '\s+'? Another thing, that had just occured to me: The letter present in the third column is identical to one of the letters(first or second) in the second column. So if I checked that, I'd know if a space turned up in the wrong place, because the letters don't check out right? Sam	[reply]
Re^3: can't use unpack or split?? by Limbic~Region (Chancellor) on Jun 07, 2004 at 20:26 UTC
seaver, I'm afraid I dont understand how the '.' will capture a space within the second column, seeing it's followed by a '\s+'?* So you ran the code, saw that it works, but didn't know why. I would suggest perldoc perlre or perhaps The Owl Book. It works because I have anchored it at both ends ^ and $ and forced the other spaces where appropriate. Since general use of .* is frowned upon, I have modified it after reading your more constraining information here. Cheers - L~R	[reply]
Re: can't use unpack or split?? by BrowserUk (Patriarch) on Jun 07, 2004 at 20:47 UTC
If only the second column can contain spaces, then by anchoring the regex at both ends and using a non-greedy match for column two you should be able to handle all the possibilities. This seems to. (I've added a couple of possible variations). Read more... (1347 Bytes) Results P:\test>test1 BAZ\|'N3''\|N\|0\|?\|?\|?\|1 BAZ\|'N4''\|N\|0\|?\|?\|?\|1 BAZ\|'C8''\|C\|0\|?\|?\|?\|1 BAZ\|C9\|C\|0\|?\|?\|?\|1 BAZ\|ZN\|ZN\|0\|?\|?\|?\|0 BAZ\|HN1\|H\|0\|?\|?\|?\|1 BAZ\|1HN2\|H\|0\|?\|?\|?\|0 BAZ\|2HN2\|H\|0\|?\|?\|?\|0 001\|F11\|F\|0\|?\|?\|?\|1 001\|C11\|C\|0\|?\|?\|?\|1 001\|O 1 1\|O\|0\|?\|?\|?\|1 001\|N12\|N\|0\|?\|?\|?\|1 001\|C12\|C\|0\|?\|?\|?\|1 001\|C13\|C\|0\|?\|?\|?\|1 001\|C 14\|C\|0\|?\|?\|?\|1 001\|C15\|C\|0\|?\|?\|?\|1 001\|C16\|C\|0\|?\|?\|?\|1 BCB\|CBA\|C\|0\|?\|?\|?\|1 BCB\|C GA\|C\|0\|?\|?\|?\|1 BCB\|O1A\|O\|0\|?\|?\|?\|1 BCB\|O2A\|O\|0\|?\|?\|?\|1 BCB\|'N B'\|N\|0\|?\|?\|?\|1 BCB\|C1B\|C\|0\|?\|?\|?\|1 BCB\|C2B\|C\|0\|?\|?\|?\|1 BCB\|C3B\|C\|0\|?\|?\|?\|1 BCB\|C4B\|C\|0\|?\|?\|?\|1 BCB\|CMB\|C\|0\|?\|?\|?\|1 27 [download] Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail	[reply] [d/l] [select]
Re: can't use unpack or split?? by allolex (Curate) on Jun 07, 2004 at 20:53 UTC
I really like Limbic~Region's approach, but here is the my idea for your algorithm. Mine is not dependent on fixed widths at all. It starts from the right and grabs the last six space-delimited strings. Then you can grab the first and second items. My Perl here is a bit sloppy, but this proof-of-concept works. `#!/usr/bin/perl use strict; use warnings; while (<DATA>) { s/\s([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s] +)$//; my @items = ($1, $2, $3, $4, $5, $6); m/([^\s]+)\s+(.)$/; unshift @items, $2; unshift @items, $1; print "[$_] " foreach @items; print "\n"; }` [download] Read more... (870 Bytes) OUTPUT: [BAZ] ['N3''] [N] [0] [?] [?] [?] [1] [BAZ] ['N4''] [N] [0] [?] [?] [?] [1] [BAZ] ['C8''] [C] [0] [?] [?] [?] [1] [BAZ] [C9] [C] [0] [?] [?] [?] [1] [BAZ] [ZN] [ZN] [0] [?] [?] [?] [0] [BAZ] [HN1] [H] [0] [?] [?] [?] [1] [BAZ] [1HN2] [H] [0] [?] [?] [?] [0] [BAZ] [2HN2] [H] [0] [?] [?] [?] [0] [001] [F11] [F] [0] [?] [?] [?] [1] [001] [C11] [C] [0] [?] [?] [?] [1] [001] [O11] [O] [0] [?] [?] [?] [1] [001] [N12] [N] [0] [?] [?] [?] [1] [001] [C12] [C] [0] [?] [?] [?] [1] [001] [C13] [C] [0] [?] [?] [?] [1] [001] [C14] [C] [0] [?] [?] [?] [1] [001] [C15] [C] [0] [?] [?] [?] [1] [001] [C16] [C] [0] [?] [?] [?] [1] [BCB] [CBA] [C] [0] [?] [?] [?] [1] [BCB] [CGA] [C] [0] [?] [?] [?] [1] [BCB] [O1A] [O] [0] [?] [?] [?] [1] [BCB] [O2A] [O] [0] [?] [?] [?] [1] [BCB] ['N B'] [N] [0] [?] [?] [?] [1] [BCB] [C1B] [C] [0] [?] [?] [?] [1] [BCB] [C2B] [C] [0] [?] [?] [?] [1] [BCB] [C3B] [C] [0] [?] [?] [?] [1] [BCB] [C4B] [C] [0] [?] [?] [?] [1] [BCB] [CMB] [C] [0] [?] [?] [?] [1] [download] -- Damon Allen Davison http://www.allolex.net	[reply] [d/l] [select]
Re: can't use unpack or split?? by wufnik (Friar) on Jun 07, 2004 at 22:48 UTC
hmmm; what about this. it's simple: `my @records = (<DATA>); # lookbehind to make sure those that start # with an apostrophe end with one... $tok1 = qr/[\'][A-Za-z0-9\s\']+(?<=\')/; $tok2 = qr/[A-Z\?0-9]+; # for each record in the above sample... # note precedence in regex below... foreach my $record (@records){ my @fields = ($record =~ /$tok1\|$tok2/g); # do something with fields } __DATA__ #...those irritating records in full...` [download] works for the dataset provided, and i think for seavers extension. ...wufnik -- in the world of the mules there are no rules --	[reply] [d/l]
Re^2: can't use unpack or split?? by seaver (Pilgrim) on Jun 11, 2004 at 18:44 UTC
It is definitely a good one, and one I'll remember for the future, but unfortunately, it relies on the delimitation of the second column by apostrophes, and this is not guaranteed, there are NO delimiters in the second column! So a regex search that tries to matches as much of the line as possible, as written by browserUk, is the straight forward answer that will include any possibilties. Thanks to everyone for answering Cheers Sam	[reply]
Re: can't use unpack or split?? by Roy Johnson (Monsignor) on Jun 07, 2004 at 19:38 UTC
Try this Q&A. Update: Had numbers transposed in the ID. Fixed. Not sure why this one is such a lightning rod for downvotes. I wish one of the voters was literate enough to explain it. The PerlMonk `tr///` Advocate	[reply]
Re^2: can't use unpack or split?? by seaver (Pilgrim) on Jun 07, 2004 at 20:13 UTC
i'm not sure you have the right node there...	[reply]


Come for the quick hacks, stay for the epiphanies.
	PerlMonks