Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Text manipulation on a file with multiple entries, obo format

by Sakti (Novice)
on Oct 01, 2015 at 17:19 UTC ( [id://1143591]=perlquestion: print w/replies, xml ) Need Help??

Sakti has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,
I have been working with a file with the .obo extension. In a summary, this is the general format:
############### [ Term ] id: HP:0000007 name: Autosomal recessive inheritance alt_id: HP:0001416 alt_id: HP:0001526 def: "A mode of inheritance that is observed for traits related to a g +ene encoded on one of the autosomes (i.e., the human chromosomes 1-22 +) in which a trait manifests in homozygotes. In the context of medica +l genetics, autosomal recessive disorders manifest in homozygotes (wi +th two copies of the mutant allele) or compound heterozygotes (whereb +y each copy of a gene has a distinct mutant allele)." [ HPO:curators] synonym: "Autosomal recessive" EXACT [ ] synonym: "AUTOSOMAL RECESSIVE FORM" RELATED [ HPO:skoehler] synonym: "Autosomal recessive predisposition" RELATED [] is_a: HP:0000005 ! Mode of inheritance<br> [ Term ] id: HP:0000008 name: Abnormality of female internal genitalia def: "An abnormality of the female internal genitalia." [HPO:probinson +] is_a: HP:0000812 ! Abnormal internal genitalia is_a: HP:0010460 ! Abnormality of the female genitalia property_value: HP:0040005 "An abnormality of the `female internal gen +italia` (FMA:45654)." xsd:string {xref="HPO:probinson"} ##################
I've been struggling with generating hashes where after each Term the id becomes the key of a hash, and the is_a: fields after ! become the elements contained within an array in this key. something like:
hash{HP:0000007}[0]="Mode of inheritance" hash{HP:0000008}[0]="Abnormal internal genitalia" hash{HP:0000008}[1]="Abnormality of the female genitalia"
Another alternative to the array is to generate a hash of hashes where I save all values associated with an id and can access the is_a: fields directly. Can one of our advanced brothers enlighten me?? Thanks a lot!!! Sakti

Replies are listed 'Best First'.
Re: Text manipulation on a file with multiple entries, obo format
by karlgoethebier (Abbot) on Oct 01, 2015 at 19:51 UTC

      Who "woulda thunk it!" The problem-domain breadth of the treasures to be found on CPAN continues to amaze me! Thanks, karlgoethebier, for bringing that module to light.

      My, I didn't know about this one, but I am so grateful for your reply, thanks!!
Re: Text manipulation on a file with multiple entries, obo format
by 2teez (Vicar) on Oct 01, 2015 at 21:24 UTC

    Hi,
    If I may add to what others have said here.
    Please and please read up perldsc - Perl Data Structures Cookbook
    You will be glad you did!

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
      Thank you all for your replies, will check out that book and the CPAN module!
Re: Text manipulation on a file with multiple entries, obo format
by baxy77bax (Deacon) on Oct 01, 2015 at 20:01 UTC
    ok

    I don't quite get your problem but as far as i understood you want soimething like this:

    use strict; use Data::Dumper; open(IN, "<", "in.obo") || die "$!"; my %hash; my $term; while(<IN>){ chomp; if(/^id\:\s+(.*)/){ $term = $1; }elsif(/^is\_a\:\s+.*?\s+\!\s+(.*)/){ push(@{$hash{$term}},$1); } } close IN; print Dumper(\%hash);
    Did I get it right?

    PS: code not tested

      Thank you, I will test this code as well as the CPAN module, which seems like the easiest way to do it. Best!
Re: Text manipulation on a file with multiple entries, obo format
by perlron (Pilgrim) on Oct 01, 2015 at 22:56 UTC
    Hi

    I am not aware of the OBO data set , but i was just checking the data set and thought i can share some technical inputs on it. Your requirement was clear enough to understand in that you wish to build a data structure like a hash of arrays, or any such data structure from which you can readily extract your data based on the id key.
    However as i mentioned your data set has some interesting features. I hope the OBO Parser solves your problems, coz if you had to work with such data sets and write code from scratch it is hardly extensible.
    But in any case, i was able to find a way to structure your dataset using the space delimiter and the '!' character delimiter. Actually if you consider it , it might not be a great approach, but again that is your data set :D
    If you were to practically wish to achieve this you would need a thorough understanding of the data structures in Perl. But not to worry, you can read the docs and figure it out . perldata, perlreftut to begin with.
    The key feature i would say i found that was needed for me to make an extraction , was the use of anonymous arrays and references. Now even though you might say the code works, i can only surmise it is hardly extensible in case your requirement changes, and if you were asked to analyse a dataset of a million or so records, i think it is best you have someone use the standard module (like OBO::Parser )
    Note - I created a file of your data in the OP and passed it as an argument to this script below

    #!/usr/bin/perl use strict; my (%hash,$hash_id); my $isa_array_ref; open(my $fh,"<",$ARGV[0]) || die "$0: can't open $ARGV[0] for reading: $!"; LINE: while(<$fh>){ chomp($_); next LINE if ($_ eq "Term"); #split on first blank space my @TermRow = split(/ /,$_,2); if($TermRow[0] eq 'id:'){ $hash_id = $TermRow[1]; $isa_array_ref = undef; } elsif($TermRow[0] eq 'is_a:'){ my @TermISAText = split(/!/,$TermRow[1]); #checking if anonymous array reference already exists if($isa_array_ref){ my @temp_array = @{$isa_array_ref}; push(@temp_array,$TermISAText[1]); $isa_array_ref = \@temp_array; $hash{$hash_id} = $isa_array_ref; } else{ #creating an anonymous array reference $isa_array_ref= [$TermISAText[1]]; $hash{$hash_id} = $isa_array_ref; } } } close($fh); print "Result of Extraction:\n "; my @id_keys = keys %hash; foreach(@id_keys){ print "key : $_"; print "list of values \n"; foreach(@{$hash{$_}}){ print $_,"\n"; } print "\n"; }
    Output
    XXXXXX:progs$ perl term_reader.pl ./term.txt Result of Extraction: key : HP:0000008list of values Abnormal internal genitalia Abnormality of the female genitalia key : HP:0000007list of values Mode of inheritance

    The Great Programmer is one who inspires others to code, not just one who writes great code
Re: Text manipulation on a file with multiple entries, obo format
by GotToBTru (Prior) on Oct 01, 2015 at 20:40 UTC

    In situations like this, let the use of the data structure dictate its design. How will you want to access this data?

    Dum Spiro Spero

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1143591]
Approved by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (6)
As of 2024-04-19 20:33 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found