Accessing data between two tags

ant has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

A practical Pattern matching query seems to be catching me out.

I have a large text file (over a gig in size), each line has <CS_REFCLT>12526489</CS_REFCLT> in it some where. I would like to get at the number in between the tags but as the line is not fixed position I can't use substr to get at it. I have got around this by using split like below.

while (my $line = <SESAME>){

  my ($tempa, $tempb)     = split (/<CS_REFCLT>/,$line)    
  my ($value, $tempc) = split (/<\/CS_REFCLT>/,$tempb);

}
[download]

However, I'd like this also as a pattern match so that I can compare speeds and speed up the program, as I think a regular expression will be quicker.

Therefore a pattern match snippet of code for this would be much appreciated.

Thanks in Advance

Ant

Comment on Accessing data between two tags Download Code

Replies are listed 'Best First'.

Re: Accessing data between two tags
by Skeeve (Parson) on Nov 09, 2006 at 13:23 UTC

while (my $line = <SESAME>){

  my ($tempa, $value, $tempb)= split m#</?CS_REFCLT>#, $line, 3;
}
[download]

Update: Thanks to jdporter for telling me about my mistake, using 2 instead of 3 above. Fixed...

Of course this will also find other constructs like </CS_REFCLT>dfasdf</CS_REFCLT>.

#!/usr/bin/perl

use strict;
use warnings;

use XML::Twig;

my $twig= new XML::Twig(
    twig_handlers => {
        CS_REFCLT          => \&cs_refclt,
    },
);

my @numbers;

$twig->parsefile( 'filename' );
# here you will have all numbers in @numbers.


sub cs_refclt {
    my ($t, $elt)= @_;
    push @numbers, $elt->text();
}
[download]

s$$([},&%#}/&/]+}%&{})*;#$&&s&&$^X.($'^"%]=\&(|?*{%
+.+=%;.#_}\&"^"-+%*).}%:##%}={~=~:.")&e&&s""`$''`"e

[reply]
[d/l]
[select]

Re: Accessing data between two tags
by prasadbabu (Prior) on Nov 09, 2006 at 11:21 UTC

Hi ant,

Are you looking something like this?

my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g;
or
my (@value) = $line =~ m|<CS_REFCLT>((?:(?!</CS_REFCLT>).)*)</CS_REFCL
+T>|g;
[download]

Also take a look at perlre.

Prasad

[reply]
[d/l]

Re^2: Accessing data between two tags

by johngg (Canon) on Nov 09, 2006 at 11:53 UTC

my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g;

Cheers,

JohnGG

[reply]
[d/l]

Re^3: Accessing data between two tags

by prasadbabu (Prior) on Nov 09, 2006 at 12:12 UTC

johngg,

I am getting array as output. As per your solution, we can get only one value even if you use 'g' modifier.

use strict;
use warnings;

my $line = 'some text <CS_REFCLT>12121</CS_REFCLT> then some text <CS_
+REFCLT>4654</CS_REFCLT> here';

my (@value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g;

my ($value) = $line =~ m|<CS_REFCLT>(\d+)</CS_REFCLT>|g;

$" ="\t";

print "Array: @value\n";
print "Scalar: $value\n";

prints:
-------
Array: 12121    4654
Scalar: 12121
[download]

Prasad

[reply]
[d/l]

Re^4: Accessing data between two tags

by johngg (Canon) on Nov 09, 2006 at 13:56 UTC

Re: Accessing data between two tags
by planetscape (Chancellor) on Nov 10, 2006 at 00:55 UTC

In general, parsing tag-delimited data with a regex is fraught with peril, and can cause all manner of interesting failures, like segfaults. While I do not know with certainty the format you are parsing, I would strongly recommend you use a parser built and tested to work with the kind of data you are processing, one such as XML::Twig, XML::TreeBuilder, or HTML::TreeBuilder, for example.

HTH,

planetscape

[reply]

Re: Accessing data between two tags
by Jenda (Abbot) on Dec 29, 2006 at 17:57 UTC

use XML::Rules;

my @numbers;
my $parser = XML::Rules->new(
 rules => [
  '_default' => '', # not interested in most tags
  'CS_REFCLT' => sub {push @numbers, $_[1]->{_content}; return},
 ],
);

$parser->parse($filename);
[download]

This way you don't have to worry whether there's just one <CS_REFCLT> on a line or whether there are more, etc.

Jenda
Support Denmark!
Defend the free world!

[reply]
[d/l]


Welcome to the Monastery
	PerlMonks