Text manipulation on a file with multiple entries, obo format

Sakti has asked for the wisdom of the Perl Monks concerning the following question:

Dear monks,
I have been working with a file with the .obo extension. In a summary, this is the general format:

###############
[ Term ]
id: HP:0000007
name: Autosomal recessive inheritance
alt_id: HP:0001416
alt_id: HP:0001526
def: "A mode of inheritance that is observed for traits related to a g
+ene encoded on one of the autosomes (i.e., the human chromosomes 1-22
+) in which a trait manifests in homozygotes. In the context of medica
+l genetics, autosomal recessive disorders manifest in homozygotes (wi
+th two copies of the mutant allele) or compound heterozygotes (whereb
+y each copy of a gene has a distinct mutant allele)." [ HPO:curators]
synonym: "Autosomal recessive" EXACT [ ]
synonym: "AUTOSOMAL RECESSIVE FORM" RELATED [ HPO:skoehler]
synonym: "Autosomal recessive predisposition" RELATED []
is_a: HP:0000005 ! Mode of inheritance<br>

[ Term ]
id: HP:0000008
name: Abnormality of female internal genitalia
def: "An abnormality of the female internal genitalia." [HPO:probinson
+]
is_a: HP:0000812 ! Abnormal internal genitalia
is_a: HP:0010460 ! Abnormality of the female genitalia
property_value: HP:0040005 "An abnormality of the `female internal gen
+italia` (FMA:45654)." xsd:string {xref="HPO:probinson"}
##################
[download]

I've been struggling with generating hashes where after each Term the id becomes the key of a hash, and the is_a: fields after ! become the elements contained within an array in this key. something like:

hash{HP:0000007}[0]="Mode of inheritance"
hash{HP:0000008}[0]="Abnormal internal genitalia"
hash{HP:0000008}[1]="Abnormality of the female genitalia"
[download]

Another alternative to the array is to generate a hash of hashes where I save all values associated with an id and can access the is_a: fields directly. Can one of our advanced brothers enlighten me?? Thanks a lot!!! Sakti

Comment on Text manipulation on a file with multiple entries, obo format Select or Download Code

Replies are listed 'Best First'.
Re: Text manipulation on a file with multiple entries, obo format by karlgoethebier (Abbot) on Oct 01, 2015 at 19:51 UTC
OBO::Parser::OBOParser? ŤThe Crux of the Biscuit is the Apostropheť	[reply]
Re^2: Text manipulation on a file with multiple entries, obo format by u65 (Chaplain) on Oct 01, 2015 at 20:11 UTC
Who "woulda thunk it!" The problem-domain breadth of the treasures to be found on CPAN continues to amaze me! Thanks, karlgoethebier, for bringing that module to light.	[reply]
Re^2: Text manipulation on a file with multiple entries, obo format by Sakti (Novice) on Oct 01, 2015 at 21:29 UTC
My, I didn't know about this one, but I am so grateful for your reply, thanks!!	[reply]
Re: Text manipulation on a file with multiple entries, obo format by 2teez (Vicar) on Oct 01, 2015 at 21:24 UTC
Hi, If I may add to what others have said here. Please and please read up perldsc - Perl Data Structures Cookbook You will be glad you did! If you tell me, I'll forget. If you show me, I'll remember. if you involve me, I'll understand. --- Author unknown to me	[reply]
Re^2: Text manipulation on a file with multiple entries, obo format by Sakti (Novice) on Oct 01, 2015 at 21:28 UTC
Thank you all for your replies, will check out that book and the CPAN module!	[reply]
Re: Text manipulation on a file with multiple entries, obo format by baxy77bax (Deacon) on Oct 01, 2015 at 20:01 UTC
ok I don't quite get your problem but as far as i understood you want soimething like this: `use strict; use Data::Dumper; open(IN, "<", "in.obo") \|\| die "$!"; my %hash; my $term; while(<IN>){ chomp; if(/^id\:\s+(.)/){ $term = $1; }elsif(/^is\_a\:\s+.?\s+\!\s+(.*)/){ push(@{$hash{$term}},$1); } } close IN; print Dumper(\%hash);` [download] Did I get it right? PS: code not tested	[reply] [d/l]
Re^2: Text manipulation on a file with multiple entries, obo format by Sakti (Novice) on Oct 01, 2015 at 21:39 UTC
Thank you, I will test this code as well as the CPAN module, which seems like the easiest way to do it. Best!	[reply]
Re: Text manipulation on a file with multiple entries, obo format by perlron (Pilgrim) on Oct 01, 2015 at 22:56 UTC
Hi I am not aware of the OBO data set , but i was just checking the data set and thought i can share some technical inputs on it. Your requirement was clear enough to understand in that you wish to build a data structure like a hash of arrays, or any such data structure from which you can readily extract your data based on the id key. However as i mentioned your data set has some interesting features. I hope the OBO Parser solves your problems, coz if you had to work with such data sets and write code from scratch it is hardly extensible. But in any case, i was able to find a way to structure your dataset using the space delimiter and the '!' character delimiter. Actually if you consider it , it might not be a great approach, but again that is your data set :D If you were to practically wish to achieve this you would need a thorough understanding of the data structures in Perl. But not to worry, you can read the docs and figure it out . perldata, perlreftut to begin with. The key feature i would say i found that was needed for me to make an extraction , was the use of anonymous arrays and references. Now even though you might say the code works, i can only surmise it is hardly extensible in case your requirement changes, and if you were asked to analyse a dataset of a million or so records, i think it is best you have someone use the standard module (like OBO::Parser ) Note - I created a file of your data in the OP and passed it as an argument to this script below #!/usr/bin/perl use strict; my (%hash,$hash_id); my $isa_array_ref; open(my $fh,"<",$ARGV[0]) \|\| die "$0: can't open $ARGV[0] for reading: $!"; LINE: while(<$fh>){ chomp($_); next LINE if ($_ eq "Term"); #split on first blank space my @TermRow = split(/ /,$_,2); if($TermRow[0] eq 'id:'){ $hash_id = $TermRow[1]; $isa_array_ref = undef; } elsif($TermRow[0] eq 'is_a:'){ my @TermISAText = split(/!/,$TermRow[1]); #checking if anonymous array reference already exists if($isa_array_ref){ my @temp_array = @{$isa_array_ref}; push(@temp_array,$TermISAText[1]); $isa_array_ref = \@temp_array; $hash{$hash_id} = $isa_array_ref; } else{ #creating an anonymous array reference $isa_array_ref= [$TermISAText[1]]; $hash{$hash_id} = $isa_array_ref; } } } close($fh); print "Result of Extraction:\n "; my @id_keys = keys %hash; foreach(@id_keys){ print "key : $_"; print "list of values \n"; foreach(@{$hash{$_}}){ print $_,"\n"; } print "\n"; } [download] Output `XXXXXX:progs$ perl term_reader.pl ./term.txt Result of Extraction: key : HP:0000008list of values Abnormal internal genitalia Abnormality of the female genitalia key : HP:0000007list of values Mode of inheritance` [download] The Great Programmer is one who inspires others to code, not just one who writes great code	[reply] [d/l] [select]
Re: Text manipulation on a file with multiple entries, obo format by GotToBTru (Prior) on Oct 01, 2015 at 20:40 UTC
In situations like this, let the use of the data structure dictate its design. How will you want to access this data? Dum Spiro Spero	[reply]


Think about Loose Coupling
	PerlMonks