Re^2: XML processing taking too much time

Replies are listed 'Best First'.

Re^3: XML processing taking too much time
by mirod (Canon) on Mar 26, 2009 at 09:47 UTC

I was just surprised that you could use XML::DOM at all on files of that size. And it looks like you can't actually, a 1gb XML file would take at least 8gb in memory using XML::DOM. So it might be interesting to know how you did it. What I meant was that if you had been able to do it, by throwing large amounts of memory at the problem, then XML::LibXML would have been an option.

With XML::Twig you can very easily extract the k/v pairs:

my $t= XML::Twig->new( twig_roots => 
  { SigData => sub { push @keys, $_->field( 'Key');
                     push @values, $_->field( 'Value');
                     $_->purge;
                   } 
  },
                         )
                   ->parsefile("my_big_fat_xml_file.xml");
[download]

Of course the @keys and @values arrays are going to be huge too, so you might still want to add a few GB of RAM to your machine, but at least the XML structure will never take up more than a few bytes.

Other possible options are XML::Rules (I expect jenda to show up and give you an example as soon as he wakes up, and maybe the new XML::Reader, which seems quite appropriate. XML::LibXML's pull mode might also be appropriate, but I have never used it so I can't comment on it.

[reply]
[d/l]

Re^4: XML processing taking too much time

by Jenda (Abbot) on Mar 27, 2009 at 13:42 UTC

:-))

If you are sure each <KVPair> contains both <Key> and <Value> and is always in <SigData> you can use something as simple as this:

use XML::Rules;

my (@keys, @values);

my $parser = XML::Rules->new(
    rules => {
        _default => '',
        Key => sub {push @keys, $_[1]->{_content}},
        Value => sub {push @values, $_[1]->{_content}},
    },
);
$parser->parse(\*DATA);

use Data::Dumper;
print Dumper(\@keys);
print Dumper(\@values);

__DATA__
<root>
<SigData>
<KVPair>
<Key>eb08f9990ae6545f9ea625412c71f24f7bf007ed</Key>
<Value>c73df5228c35c419f884ba9571310cd7</Value>
</KVPair>
<bogus>sdf sdhf nsdfg sdfgh nserg sfgdfgh</bogus>
</SigData>
<SigData>
<KVPair>
<Key>EB08F9990AE6545F9EA625412C71F24F7BF007ED</Key>
<Value>C73DF5228C35C419F884BA9571310CD7</Value>
</KVPair>
</SigData>
</root>
[download]

If there is more in the XML you may skip some tags and their children by adding

  start_rules => {
    'the,list,of,such,tags' => 'skip'
  },
[download]

If you do not want to use the globals, you may do something like:

my $parser = XML::Rules->new(
    stripspaces => 3,
    rules => {
        _default => '',
        Key => 'content',
        Value => 'content',
        KVPair => 'pass',
        SigData => sub {return '@keys' => $_[1]->{Key}, '@values' => $
+_[1]->{Value}},
        root => 'pass',
    },
);
my $data = $parser->parse(\*DATA);

use Data::Dumper;
print Dumper($data);
[download]

Actually are you sure you want to build two interrelated arrays? Wouldn't it make more sense to create a single hash? Or maybe process the pair as soon as you read it instead of keeping them all in memory?

The first would be

my $parser = XML::Rules->new(
    stripspaces => 3,
    rules => {
        _default => '',
        Key => 'content',
        Value => 'content',
        KVPair => sub {return $_[1]->{Key} => $_[1]->{Value}},
        SigData => 'pass',
        root => 'pass',
    },
);
my $data = $parser->parse(\*DATA);
[download]


laziness, impatience, and hubris
	PerlMonks