Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Accessing data between two tags

by ant (Scribe)
on Nov 09, 2006 at 11:13 UTC ( #583082=perlquestion: print w/replies, xml ) Need Help??

ant has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

A practical Pattern matching query seems to be catching me out.

I have a large text file (over a gig in size), each line has <CS_REFCLT>12526489</CS_REFCLT> in it some where. I would like to get at the number in between the tags but as the line is not fixed position I can't use substr to get at it. I have got around this by using split like below.
while (my $line = <SESAME>){ my ($tempa, $tempb) = split (/<CS_REFCLT>/,$line) my ($value, $tempc) = split (/<\/CS_REFCLT>/,$tempb); }
However, I'd like this also as a pattern match so that I can compare speeds and speed up the program, as I think a regular expression will be quicker.

Therefore a pattern match snippet of code for this would be much appreciated.

Thanks in Advance

Ant

Replies are listed 'Best First'.
Re: Accessing data between two tags
by Skeeve (Parson) on Nov 09, 2006 at 13:23 UTC
    The others already gave regexes. But how about this split approach:
    while (my $line = <SESAME>){ my ($tempa, $value, $tempb)= split m#</?CS_REFCLT>#, $line, 3; }

    Update: Thanks to jdporter for telling me about my mistake, using 2 instead of 3 above. Fixed...

    Of course this will also find other constructs like </CS_REFCLT>dfasdf</CS_REFCLT>.

    But maybe, if the data is XML, a real XML parser like XML::Twig is something that would help you best here...
    #!/usr/bin/perl use strict; use warnings; use XML::Twig; my $twig= new XML::Twig( twig_handlers => { CS_REFCLT => \&cs_refclt, }, ); my @numbers; $twig->parsefile( 'filename' ); # here you will have all numbers in @numbers. sub cs_refclt { my ($t, $elt)= @_; push @numbers, $elt->text(); }

    s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
    +.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e
Re: Accessing data between two tags
by prasadbabu (Prior) on Nov 09, 2006 at 11:21 UTC

    Hi ant,

    Are you looking something like this?

    my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; or my (@value) = $line =~ m|<CS_REFCLT>((?:(?!</CS_REFCLT>).)*)</CS_REFCL +T>|g;

    Also take a look at perlre.

    Prasad

      ITYM my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; as you are trying to pull out a scalar not an array.

      Cheers,

      JohnGG

        johngg,

        I am getting array as output. As per your solution, we can get only one value even if you use 'g' modifier.

        use strict; use warnings; my $line = 'some text <CS_REFCLT>12121</CS_REFCLT> then some text <CS_ +REFCLT>4654</CS_REFCLT> here'; my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g; $" ="\t"; print "Array: @value\n"; print "Scalar: $value\n"; prints: ------- Array: 12121 4654 Scalar: 12121

        Prasad

Re: Accessing data between two tags
by planetscape (Chancellor) on Nov 10, 2006 at 00:55 UTC

    In general, parsing tag-delimited data with a regex is fraught with peril, and can cause all manner of interesting failures, like segfaults. While I do not know with certainty the format you are parsing, I would strongly recommend you use a parser built and tested to work with the kind of data you are processing, one such as XML::Twig, XML::TreeBuilder, or HTML::TreeBuilder, for example.

    HTH,

    planetscape
Re: Accessing data between two tags
by Jenda (Abbot) on Dec 29, 2006 at 17:57 UTC
    use XML::Rules; my @numbers; my $parser = XML::Rules->new( rules => [ '_default' => '', # not interested in most tags 'CS_REFCLT' => sub {push @numbers, $_[1]->{_content}; return}, ], ); $parser->parse($filename);

    This way you don't have to worry whether there's just one <CS_REFCLT> on a line or whether there are more, etc.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://583082]
Approved by prasadbabu
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chanting in the Monastery: (7)
As of 2023-01-31 09:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?