Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
davorg's reply suggested one way to approach the problem would be to read the whole file into memory then parse with regular expressions. The following script shows one possible way of doing this using two stages, the first to break into records and the second to break each record into fields. Here it is

use strict; use warnings; my $rxRecord = qr {(?xs) (ENTRY.*?\n) (?=ENTRY|\z) }; my $rxFieldHdrs = qr{(?:ENTRY|TITLE|ORGANISM|ACCESSIONS)}; my $rxField = qr {(?xs) ($rxFieldHdrs.*?\n) (?=$rxFieldHdrs|\z) }; my $fileText; { local $/; $fileText = <DATA>; } my @records = $fileText =~ m{$rxRecord}g; foreach my $record (@records) { print qq{$record}, q{+} x 50, qq{\n}; my @fields = $record =~ m{$rxField}g; foreach my $field (@fields) { print qq{$field}, q{-} x 50, qq{\n}; } print q{*} x 50, qq{\n}; } __END__ ENTRY CCHU #type complete TITLE cytochrome c [validated] - human Homo sapiens ORGANISM #formal_name Homo sapiens #common_name man ACCESSIONS A31764; A05676; I55192; A00001 MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGMIYARAJLFGRKTSEKGQAPGYSYTAANKN +KGIIWGEDTLMEYLENPKKYIP ENTRY CCCZ #type complete TITLE cytochrome c - chimpanzee (tentative sequence) ORGANISM #formal_name Pan troglodytes #common_name chimpanzee ACCESSIONS A00002 GDVEKGKKIFIMKCSQCHTSEKVEKGSSSKHKSSSTGPNLHGLMIYARAJFGRKTGSEKQAPGYSYTAAN +KNKGIIWGED ENTRY CCMQR #type complete TITLE cytochrome c - rhesus macaque (tentative sequence) Macaca mulatta ORGANISM #formal_name Macaca mulatta #common_name rhesus macaq +ue ACCESSIONS A00003 GDVEKGKKIFIMKCSQSEKCHTVEKGGSSSSKHKTGPNLHGSSEKEMIYARAJKSEKLFGAAAAAAAARK +TGQAPGYSYTAANKSSSSNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE ENTRY CCMKP #type complete TITLE cytochrome c - spider monkey ORGANISM #formal_name Ateles sp. #common_name spider monkey ACCESSIONS A00004 GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLMIYARAJSEKFGSSSSSSSSSSR

and here is the output showing for each record the whole record then each individual field. As you can see, your two-line title is preserved.

ENTRY CCHU #type complete TITLE cytochrome c [validated] - human Homo sapiens ORGANISM #formal_name Homo sapiens #common_name man ACCESSIONS A31764; A05676; I55192; A00001 MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGMIYARAJLFGRKTSEKGQAPGYSYTAANKN +KGIIWGEDTLMEYLENPKKYIP ++++++++++++++++++++++++++++++++++++++++++++++++++ ENTRY CCHU #type complete -------------------------------------------------- TITLE cytochrome c [validated] - human Homo sapiens -------------------------------------------------- ORGANISM #formal_name Homo sapiens #common_name man -------------------------------------------------- ACCESSIONS A31764; A05676; I55192; A00001 MGDVEKGKKIFIMKCSQCHTVEMGDVEKGGKHKTGPNLHGMIYARAJLFGRKTSEKGQAPGYSYTAANKN +KGIIWGEDTLMEYLENPKKYIP -------------------------------------------------- ************************************************** ENTRY CCCZ #type complete TITLE cytochrome c - chimpanzee (tentative sequence) ORGANISM #formal_name Pan troglodytes #common_name chimpanzee ACCESSIONS A00002 GDVEKGKKIFIMKCSQCHTSEKVEKGSSSKHKSSSTGPNLHGLMIYARAJFGRKTGSEKQAPGYSYTAAN +KNKGIIWGED ++++++++++++++++++++++++++++++++++++++++++++++++++ ENTRY CCCZ #type complete -------------------------------------------------- TITLE cytochrome c - chimpanzee (tentative sequence) -------------------------------------------------- ORGANISM #formal_name Pan troglodytes #common_name chimpanzee -------------------------------------------------- ACCESSIONS A00002 GDVEKGKKIFIMKCSQCHTSEKVEKGSSSKHKSSSTGPNLHGLMIYARAJFGRKTGSEKQAPGYSYTAAN +KNKGIIWGED -------------------------------------------------- ************************************************** ENTRY CCMQR #type complete TITLE cytochrome c - rhesus macaque (tentative sequence) Macaca mulatta ORGANISM #formal_name Macaca mulatta #common_name rhesus macaq +ue ACCESSIONS A00003 GDVEKGKKIFIMKCSQSEKCHTVEKGGSSSSKHKTGPNLHGSSEKEMIYARAJKSEKLFGAAAAAAAARK +TGQAPGYSYTAANKSSSSNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE ++++++++++++++++++++++++++++++++++++++++++++++++++ ENTRY CCMQR #type complete -------------------------------------------------- TITLE cytochrome c - rhesus macaque (tentative sequence) Macaca mulatta -------------------------------------------------- ORGANISM #formal_name Macaca mulatta #common_name rhesus macaq +ue -------------------------------------------------- ACCESSIONS A00003 GDVEKGKKIFIMKCSQSEKCHTVEKGGSSSSKHKTGPNLHGSSEKEMIYARAJKSEKLFGAAAAAAAARK +TGQAPGYSYTAANKSSSSNKGITWGEDTLMEYLENPKKYIPGTKMIFVGIKKKEE -------------------------------------------------- ************************************************** ENTRY CCMKP #type complete TITLE cytochrome c - spider monkey ORGANISM #formal_name Ateles sp. #common_name spider monkey ACCESSIONS A00004 GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLMIYARAJSEKFGSSSSSSSSSSR ++++++++++++++++++++++++++++++++++++++++++++++++++ ENTRY CCMKP #type complete -------------------------------------------------- TITLE cytochrome c - spider monkey -------------------------------------------------- ORGANISM #formal_name Ateles sp. #common_name spider monkey -------------------------------------------------- ACCESSIONS A00004 GDVFKGKRIFIMKCSQCHTVESSSSKGGKHKTGPNLHGLMIYARAJSEKFGSSSSSSSSSSR -------------------------------------------------- **************************************************

I hope this is of use

Cheers,

JohnGG


In reply to Re: doubt in storing a data of 2 lines in an array. by johngg
in thread doubt in storing a data of 2 lines in an array. by heidi

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (3)
As of 2024-04-25 18:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found