Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

can't use unpack or split??

by seaver (Pilgrim)
on Jun 07, 2004 at 19:24 UTC ( [id://362079]=perlquestion: print w/replies, xml ) Need Help??

seaver has asked for the wisdom of the Perl Monks concerning the following question:

Dear all,

I basically have to parse a file that looks like this:

BAZ 'N3''  N  0 ? ? ? 1
BAZ 'N4''  N  0 ? ? ? 1
BAZ 'C8''  C  0 ? ? ? 1
BAZ C9     C  0 ? ? ? 1
BAZ ZN     ZN 0 ? ? ? 0
BAZ HN1    H  0 ? ? ? 1
BAZ 1HN2   H  0 ? ? ? 0
BAZ 2HN2   H  0 ? ? ? 0
001 F11  F 0 ? ? ? 1
001 C11  C 0 ? ? ? 1
001 O11  O 0 ? ? ? 1
001 N12  N 0 ? ? ? 1
001 C12  C 0 ? ? ? 1
001 C13  C 0 ? ? ? 1
001 C14  C 0 ? ? ? 1
001 C15  C 0 ? ? ? 1
001 C16  C 0 ? ? ? 1
BCB CBA   C  0 ? ? ? 1
BCB CGA   C  0 ? ? ? 1
BCB O1A   O  0 ? ? ? 1
BCB O2A   O  0 ? ? ? 1
BCB 'N B' N  0 ? ? ? 1
BCB C1B   C  0 ? ? ? 1
BCB C2B   C  0 ? ? ? 1
BCB C3B   C  0 ? ? ? 1
BCB C4B   C  0 ? ? ? 1
BCB CMB   C  0 ? ? ? 1
irregular columns means I cant use unpack, BUT, the space in the second column where it says 'N B' HAS to be ignored when splitting!!

I'm hoping there's a third way to do this?

Cheers
Sam

Replies are listed 'Best First'.
Re: can't use unpack or split??
by eclark (Scribe) on Jun 07, 2004 at 19:46 UTC

    How about this?

    open FH, "<datafile"; while (my $line = <FH>) { chomp $line; my @a = ($line =~ /('{1}.*?'{1}(?=\s+)|\S+)/g); print join('|', @a) . "\n"; } close FH;
      Thanks for your reply, it totally makes sense, however I wasn't explicit enough in my question.

      You've coded the search pattern specifically for the example I gave above, but truly, the 4 or 5 characters in the second column could be ANY character, and not necessarily be delimited by ' as they are in the example above.

      After having thought about this, the two things I can guarantee are:

      1. The first three characters are FIXED.

      2. There is a space between every column

      3. The second column is at least 1 character, and at the most 5 characters

      So my problem is truly that I have columns delimited by spaces, (nothing I can do about that) and the second column may actually have within it, a space...!!

      Any more ideas?

      Cheers
      Sam

Re: can't use unpack or split??
by Limbic~Region (Chancellor) on Jun 07, 2004 at 20:13 UTC
    seaver,
    Here is code that will work for the limited data sample you provided:
    while ( <DATA> ) { my @col = /^(\w{3}) (.{1,5})\s+(\w+)\s+([01?]) ([01?]) ([01?]) ([0 +1?]) ([01?])$/; print join "\t" , @col; print "\n"; }
    Of course, depending on all the factors of your situation, there is likely a better solution - perhaps involving pre-processing. In any account - enjoy.

    L~R

    Update: Though it was perfectly fine, I changed .* to .{1,5} after reading about the constraint elswhere in the thread
      Limbic~Region

      I'm afraid I dont understand how the '.*' will capture a space within the second column, seeing it's followed by a '\s+'?

      Another thing, that had just occured to me:

      The letter present in the third column is identical to one of the letters(first or second) in the second column.

      So if I checked that, I'd know if a space turned up in the wrong place, because the letters don't check out right?

      Sam

        seaver,
        I'm afraid I dont understand how the '.*' will capture a space within the second column, seeing it's followed by a '\s+'?

        So you ran the code, saw that it works, but didn't know why.

        I would suggest perldoc perlre or perhaps The Owl Book. It works because I have anchored it at both ends ^ and $ and forced the other spaces where appropriate. Since general use of .* is frowned upon, I have modified it after reading your more constraining information here.

        Cheers - L~R

Re: can't use unpack or split??
by BrowserUk (Patriarch) on Jun 07, 2004 at 20:47 UTC

    If only the second column can contain spaces, then by anchoring the regex at both ends and using a non-greedy match for column two you should be able to handle all the possibilities. This seems to. (I've added a couple of possible variations).

    Results

    P:\test>test1 BAZ|'N3''|N|0|?|?|?|1 BAZ|'N4''|N|0|?|?|?|1 BAZ|'C8''|C|0|?|?|?|1 BAZ|C9|C|0|?|?|?|1 BAZ|ZN|ZN|0|?|?|?|0 BAZ|HN1|H|0|?|?|?|1 BAZ|1HN2|H|0|?|?|?|0 BAZ|2HN2|H|0|?|?|?|0 001|F11|F|0|?|?|?|1 001|C11|C|0|?|?|?|1 001|O 1 1|O|0|?|?|?|1 001|N12|N|0|?|?|?|1 001|C12|C|0|?|?|?|1 001|C13|C|0|?|?|?|1 001|C 14|C|0|?|?|?|1 001|C15|C|0|?|?|?|1 001|C16|C|0|?|?|?|1 BCB|CBA|C|0|?|?|?|1 BCB|C GA|C|0|?|?|?|1 BCB|O1A|O|0|?|?|?|1 BCB|O2A|O|0|?|?|?|1 BCB|'N B'|N|0|?|?|?|1 BCB|C1B|C|0|?|?|?|1 BCB|C2B|C|0|?|?|?|1 BCB|C3B|C|0|?|?|?|1 BCB|C4B|C|0|?|?|?|1 BCB|CMB|C|0|?|?|?|1 27

    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "Think for yourself!" - Abigail
Re: can't use unpack or split??
by allolex (Curate) on Jun 07, 2004 at 20:53 UTC

    I really like Limbic~Region's approach, but here is the my idea for your algorithm. Mine is not dependent on fixed widths at all. It starts from the right and grabs the last six space-delimited strings. Then you can grab the first and second items. My Perl here is a bit sloppy, but this proof-of-concept works.

    #!/usr/bin/perl use strict; use warnings; while (<DATA>) { s/\s*([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s]+)\s+([^\s] +)$//; my @items = ($1, $2, $3, $4, $5, $6); m/([^\s]+)\s+(.*)$/; unshift @items, $2; unshift @items, $1; print "[$_] " foreach @items; print "\n"; }
    OUTPUT: [BAZ] ['N3''] [N] [0] [?] [?] [?] [1] [BAZ] ['N4''] [N] [0] [?] [?] [?] [1] [BAZ] ['C8''] [C] [0] [?] [?] [?] [1] [BAZ] [C9] [C] [0] [?] [?] [?] [1] [BAZ] [ZN] [ZN] [0] [?] [?] [?] [0] [BAZ] [HN1] [H] [0] [?] [?] [?] [1] [BAZ] [1HN2] [H] [0] [?] [?] [?] [0] [BAZ] [2HN2] [H] [0] [?] [?] [?] [0] [001] [F11] [F] [0] [?] [?] [?] [1] [001] [C11] [C] [0] [?] [?] [?] [1] [001] [O11] [O] [0] [?] [?] [?] [1] [001] [N12] [N] [0] [?] [?] [?] [1] [001] [C12] [C] [0] [?] [?] [?] [1] [001] [C13] [C] [0] [?] [?] [?] [1] [001] [C14] [C] [0] [?] [?] [?] [1] [001] [C15] [C] [0] [?] [?] [?] [1] [001] [C16] [C] [0] [?] [?] [?] [1] [BCB] [CBA] [C] [0] [?] [?] [?] [1] [BCB] [CGA] [C] [0] [?] [?] [?] [1] [BCB] [O1A] [O] [0] [?] [?] [?] [1] [BCB] [O2A] [O] [0] [?] [?] [?] [1] [BCB] ['N B'] [N] [0] [?] [?] [?] [1] [BCB] [C1B] [C] [0] [?] [?] [?] [1] [BCB] [C2B] [C] [0] [?] [?] [?] [1] [BCB] [C3B] [C] [0] [?] [?] [?] [1] [BCB] [C4B] [C] [0] [?] [?] [?] [1] [BCB] [CMB] [C] [0] [?] [?] [?] [1]

    --
    Damon Allen Davison
    http://www.allolex.net

Re: can't use unpack or split??
by wufnik (Friar) on Jun 07, 2004 at 22:48 UTC
    hmmm; what about this. it's simple:
    my @records = (<DATA>); # lookbehind to make sure those that start # with an apostrophe end with one... $tok1 = qr/[\'][A-Za-z0-9\s\']+(?<=\')/; $tok2 = qr/[A-Z\?0-9]+; # for each record in the above sample... # note precedence in regex below... foreach my $record (@records){ my @fields = ($record =~ /$tok1|$tok2/g); # do something with fields } __DATA__ #...those irritating records in full...
    works for the dataset provided, and i think for seavers extension.
    ...wufnik

    -- in the world of the mules there are no rules --
      It is definitely a good one, and one I'll remember for the future, but unfortunately, it relies on the delimitation of the second column by apostrophes, and this is not guaranteed, there are NO delimiters in the second column!

      So a regex search that tries to matches as much of the line as possible, as written by browserUk, is the straight forward answer that will include any possibilties.

      Thanks to everyone for answering

      Cheers
      Sam

Re: can't use unpack or split??
by Roy Johnson (Monsignor) on Jun 07, 2004 at 19:38 UTC
    Try this Q&A.
    Update: Had numbers transposed in the ID. Fixed.

    Not sure why this one is such a lightning rod for downvotes. I wish one of the voters was literate enough to explain it.


    The PerlMonk tr/// Advocate
      i'm not sure you have the right node there...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://362079]
Approved by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (3)
As of 2024-04-24 15:22 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found