Re: can't use unpack or split??
by eclark (Scribe) on Jun 07, 2004 at 19:46 UTC
|
open FH, "<datafile";
while (my $line = <FH>) {
chomp $line;
my @a = ($line =~ /('{1}.*?'{1}(?=\s+)|\S+)/g);
print join('|', @a) . "\n";
}
close FH;
| [reply] [d/l] |
|
Thanks for your reply, it totally makes sense, however I wasn't explicit enough in my question.
You've coded the search pattern specifically for the example I gave above, but truly, the 4 or 5 characters in the second column could be ANY character, and not necessarily be delimited by ' as they are in the example above.
After having thought about this, the two things I can guarantee are:
- The first three characters are FIXED.
- There is a space between every column
- The second column is at least 1 character, and at the most 5 characters
So my problem is truly that I have columns delimited by spaces, (nothing I can do about that) and the second column may actually have within it, a space...!!
Any more ideas?
Cheers
Sam
| [reply] |
Re: can't use unpack or split??
by Limbic~Region (Chancellor) on Jun 07, 2004 at 20:13 UTC
|
seaver,
Here is code that will work for the limited data sample you provided:
while ( <DATA> ) {
my @col = /^(\w{3}) (.{1,5})\s+(\w+)\s+([01?]) ([01?]) ([01?]) ([0
+1?]) ([01?])$/;
print join "\t" , @col;
print "\n";
}
Of course, depending on all the factors of your situation, there is likely a better solution - perhaps involving pre-processing. In any account - enjoy.
L~R
Update: Though it was perfectly fine, I changed .* to .{1,5} after reading about the constraint elswhere in the thread | [reply] [d/l] |
|
Limbic~Region
I'm afraid I dont understand how the '.*' will capture a space within the second column, seeing it's followed by a '\s+'?
Another thing, that had just occured to me:
The letter present in the third column is identical to one of the letters(first or second) in the second column.
So if I checked that, I'd know if a space turned up in the wrong place, because the letters don't check out right?
Sam
| [reply] |
|
seaver,
I'm afraid I dont understand how the '.*' will capture a space within the second column, seeing it's followed by a '\s+'?
So you ran the code, saw that it works, but didn't know why.
I would suggest perldoc perlre or perhaps The Owl Book. It works because I have anchored it at both ends ^ and $ and forced the other spaces where appropriate. Since general use of .* is frowned upon, I have modified it after reading your more constraining information here.
Cheers - L~R
| [reply] |
Re: can't use unpack or split??
by BrowserUk (Patriarch) on Jun 07, 2004 at 20:47 UTC
|
If only the second column can contain spaces, then by anchoring the regex at both ends and using a non-greedy match for column two you should be able to handle all the possibilities. This seems to. (I've added a couple of possible variations).
Results
P:\test>test1
BAZ|'N3''|N|0|?|?|?|1
BAZ|'N4''|N|0|?|?|?|1
BAZ|'C8''|C|0|?|?|?|1
BAZ|C9|C|0|?|?|?|1
BAZ|ZN|ZN|0|?|?|?|0
BAZ|HN1|H|0|?|?|?|1
BAZ|1HN2|H|0|?|?|?|0
BAZ|2HN2|H|0|?|?|?|0
001|F11|F|0|?|?|?|1
001|C11|C|0|?|?|?|1
001|O 1 1|O|0|?|?|?|1
001|N12|N|0|?|?|?|1
001|C12|C|0|?|?|?|1
001|C13|C|0|?|?|?|1
001|C 14|C|0|?|?|?|1
001|C15|C|0|?|?|?|1
001|C16|C|0|?|?|?|1
BCB|CBA|C|0|?|?|?|1
BCB|C GA|C|0|?|?|?|1
BCB|O1A|O|0|?|?|?|1
BCB|O2A|O|0|?|?|?|1
BCB|'N B'|N|0|?|?|?|1
BCB|C1B|C|0|?|?|?|1
BCB|C2B|C|0|?|?|?|1
BCB|C3B|C|0|?|?|?|1
BCB|C4B|C|0|?|?|?|1
BCB|CMB|C|0|?|?|?|1
27
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] [d/l] [select] |
Re: can't use unpack or split??
by allolex (Curate) on Jun 07, 2004 at 20:53 UTC
|
I really like Limbic~Region's approach, but here is the my idea for your algorithm. Mine is not dependent on fixed widths at all. It starts from the right and grabs the last six space-delimited strings. Then you can grab the first and second items. My Perl here is a bit sloppy, but this proof-of-concept works.
#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>) {
s/\s*([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]
+)$//;
my @items = ($1, $2, $3, $4, $5, $6);
m/([^\s]+)\s+(.*)$/;
unshift @items, $2;
unshift @items, $1;
print "[$_] " foreach @items;
print "\n";
}
OUTPUT:
[BAZ] ['N3''] [N] [0] [?] [?] [?] [1]
[BAZ] ['N4''] [N] [0] [?] [?] [?] [1]
[BAZ] ['C8''] [C] [0] [?] [?] [?] [1]
[BAZ] [C9] [C] [0] [?] [?] [?] [1]
[BAZ] [ZN] [ZN] [0] [?] [?] [?] [0]
[BAZ] [HN1] [H] [0] [?] [?] [?] [1]
[BAZ] [1HN2] [H] [0] [?] [?] [?] [0]
[BAZ] [2HN2] [H] [0] [?] [?] [?] [0]
[001] [F11] [F] [0] [?] [?] [?] [1]
[001] [C11] [C] [0] [?] [?] [?] [1]
[001] [O11] [O] [0] [?] [?] [?] [1]
[001] [N12] [N] [0] [?] [?] [?] [1]
[001] [C12] [C] [0] [?] [?] [?] [1]
[001] [C13] [C] [0] [?] [?] [?] [1]
[001] [C14] [C] [0] [?] [?] [?] [1]
[001] [C15] [C] [0] [?] [?] [?] [1]
[001] [C16] [C] [0] [?] [?] [?] [1]
[BCB] [CBA] [C] [0] [?] [?] [?] [1]
[BCB] [CGA] [C] [0] [?] [?] [?] [1]
[BCB] [O1A] [O] [0] [?] [?] [?] [1]
[BCB] [O2A] [O] [0] [?] [?] [?] [1]
[BCB] ['N B'] [N] [0] [?] [?] [?] [1]
[BCB] [C1B] [C] [0] [?] [?] [?] [1]
[BCB] [C2B] [C] [0] [?] [?] [?] [1]
[BCB] [C3B] [C] [0] [?] [?] [?] [1]
[BCB] [C4B] [C] [0] [?] [?] [?] [1]
[BCB] [CMB] [C] [0] [?] [?] [?] [1]
| [reply] [d/l] [select] |
Re: can't use unpack or split??
by wufnik (Friar) on Jun 07, 2004 at 22:48 UTC
|
hmmm; what about this. it's simple:
my @records = (<DATA>);
# lookbehind to make sure those that start
# with an apostrophe end with one...
$tok1 = qr/[\'][A-Za-z0-9\s\']+(?<=\')/;
$tok2 = qr/[A-Z\?0-9]+;
# for each record in the above sample...
# note precedence in regex below...
foreach my $record (@records){
my @fields = ($record =~ /$tok1|$tok2/g);
# do something with fields
}
__DATA__
#...those irritating records in full...
works for the dataset provided, and i think for
seavers extension.
...wufnik
-- in the world of the mules there are no rules --
| [reply] [d/l] |
|
It is definitely a good one, and one I'll remember for the future, but unfortunately, it relies on the delimitation of the second column by apostrophes, and this is not guaranteed, there are NO delimiters in the second column!
So a regex search that tries to matches as much of the line as possible, as written by browserUk, is the straight forward answer that will include any possibilties.
Thanks to everyone for answering
Cheers
Sam
| [reply] |
Re: can't use unpack or split??
by Roy Johnson (Monsignor) on Jun 07, 2004 at 19:38 UTC
|
Try this Q&A.
Update: Had numbers transposed in the ID. Fixed.
Not sure why this one is such a lightning rod for downvotes. I wish one of the voters was literate enough to explain it.
The PerlMonk tr/// Advocate
| [reply] |
|
i'm not sure you have the right node there...
| [reply] |