Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical

parse problem

by Anonymous Monk
on Apr 20, 2003 at 01:35 UTC ( [id://251753] : perlquestion . print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I want to get the number from a html source file, I want to parse the data like:
and get 12345678, I did as below:
$data = gi|12345678|ref|NP_001234.1|; @data = split ('gi|',$data); @data1 = split ('|ref',$data[1]); $number = $data1[0];

I got e, g,..., some weird letter, when I changed the code to below:
$data = gi|12345678|ref|NP_001234.1|; @data = split ('gi',$data); @data1 = split ('ref',$data[1]); $number = $data1[0];
I got:|12345678|, I try use regular expression to remove the |:
$number =~ m/[0-9]*/;

I got the same thing which has |12345678|, What can I do? Please help and Thanks in advance! Please help and Thanks in advance!

Replies are listed 'Best First'.
Re: parse problem
by DrManhattan (Chaplain) on Apr 20, 2003 at 02:05 UTC
    The first argument to split() needs to be a regular expression matching the string that delimits the fields in your data. In your case, the fields in your line are separated by a '|', so the code could look like this:
    #!/usr/bin/perl use strict; my $data = 'gi|12345678|ref|NP_001234.1|'; my @data = split /\|/, $data; my $number = $data[1];
    Or more concisely:
    #!/usr/bin/perl use strict; my $data = 'gi|12345678|ref|NP_001234.1|'; my $number = (split(/\|/, $data))[1];


Re: parse problem
by dpuu (Chaplain) on Apr 20, 2003 at 01:46 UTC
    Your problem may be that the first arg to split is a regular extression -- and the vertical bar is a pattern separator with an empty extression on its left -- which can always match. If you are only wanting the one number you show, then your could use:
    $data =~ /gi\|(\d+)\|ref/ and $number = $1;
    Note that the vertical bar is escaped using the backslash. --Dave
Re: parse problem
by artist (Parson) on Apr 20, 2003 at 04:50 UTC
    You have already received good solutions.
    Your algorithm should be:
    A. split the data with the pattern . (pipe symbol in your case)
    B. get the second item from the result of the above split.
    Learn more about split.