Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Creating a xml file in chunks

by vagabonding electron (Curate)
on Jul 12, 2016 at 15:07 UTC ( #1167626=perlquestion: print w/replies, xml ) Need Help??

vagabonding electron has asked for the wisdom of the Perl Monks concerning the following question:

Dear All,

I am parsing a large number of data files and need to save the output in an xml file. A minimal running example is below contrived and simplified of course, since it is about gigabytes. My problem is that if I assemble the complete xml output document in memory it grows too large so that $doc->setDocumentElement($root); print $doc->toString() would not be good. So I try to print the output in chunks as I would normally do with csv output. The code below does print chunks (single composed nodes in a real life it would be one composed node per data file). What I did not figure out is how to print out the root element nicely. Currently I just hardcoded the opening tag output at the beginning and the closing tag at the end. Is there a nicer way to do this (or probably a nicer way to create xml in the first place)?

Thanks in advance!

#!/perl use strict; use warnings FATAL => qw(all); use Text::CSV_XS; use XML::LibXML; my $csv_par = { binary => 1, auto_diag => 1, allow_whitespace => 1, sep_char => ';', eol => $/, quote_char => undef, }; my $csv = Text::CSV_XS->new($csv_par); my @header = @{$csv->getline(*DATA)}; my %rec; $csv->bind_columns(\@rec{@header}); my $doc = XML::LibXML::Document->new('1.0', 'utf-8'); my $root = $doc->createElement("ROOT"); print join("\n", '<?xml version="1.0" encoding="UTF-8"?>', '<ROOT>'),$/; while ( $csv->getline(*DATA) ) { my $line_tag = $doc->createElement("alpha"); $line_tag->setAttribute('name'=> $rec{"alpha"}); # $root->appendChild($line_tag); # intentional. for my $other ( qw(beta gamma) ) { my $other_tag = $doc->createElement($other); $other_tag->setAttribute(name => $rec{$other}); $line_tag->appendChild($other_tag); } print $line_tag->toString(1),$/; } print '</ROOT>',$/; =output <?xml version="1.0" encoding="UTF-8"?> <ROOT> <alpha name="q"> <beta name="2"/> <gamma name="3"/> </alpha> <alpha name="w"> <beta name="9"/> <gamma name="8"/> </alpha> <alpha name="e"> <beta name="1"/> <gamma name="2"/> </alpha> <alpha name="r"> <beta name="6"/> <gamma name="7"/> </alpha> <alpha name="t"> <beta name="5"/> <gamma name="9"/> </alpha> <alpha name="y"> <beta name="3"/> <gamma name="1"/> </alpha> </ROOT> =cut __DATA__ alpha;beta;gamma q;2;3 w;9;8 e;1;2 r;6;7 t;5;9 y;3;1

Replies are listed 'Best First'.
Re: Creating a xml file in chunks
by haukex (Archbishop) on Jul 12, 2016 at 16:29 UTC

    Hi vagabonding electron,

    A while back I tried out XML::Writer, I believe it writes its output continually and only keeps a stack of tags that need to be closed, so I think it's probably worth a try for your purposes.

    I did find its API a bit verbose and later stopped using it because of that. If your root tag is literally just "<ROOT>" with no attributes, just print it, as there's nothing to escape and therefore nothing that can go wrong. Of course if you've got complicated documents with attributes and namespaces etc. it's not that simple!

    Hope this helps,
    -- Hauke D

      Thank you very much haukex!

      The API of XML::Writer appeared to be a bit too complex for me at first. However I tried it just now - and indeed it seems to write the output continually (which I checked with the commented "Hi" line below. The syntax was not very verbose either on the second look :-) A bit strange (to my taste) is the option NEWLINES which adds a newline before the closing delimiter, however it does make the output human readable.

      I will test the module with my real data. Many thanks again!

      #!/perl use strict; use warnings FATAL => qw(all); use Text::CSV_XS; use XML::Writer; my $csv_par = { binary => 1, auto_diag => 1, allow_whitespace => 1, sep_char => ';', eol => $/, quote_char => undef, }; my $csv = Text::CSV_XS->new($csv_par); my @header = @{$csv->getline(*DATA)}; my %rec; $csv->bind_columns(\@rec{@header}); my $writer = XML::Writer->new(NEWLINES => 1, ENCODING => 'UTF-8'); # stdout. $writer->xmlDecl(); # ("UTF-8") already mentioned above. $writer->startTag('ROOT'); while ( $csv->getline(*DATA) ) { $writer->startTag('alpha', 'name' => $rec{"alpha"}); for my $other( qw(beta gamma) ) { $writer->startTag($other, 'name' => $rec{$other}); $writer->endTag($other); } # print "\t\tHi!\n"; $writer->endTag('alpha'); } $writer->endTag('ROOT'); $writer->end(); __DATA__ alpha;beta;gamma q;2;3 w;9;8 e;1;2 r;6;7 t;5;9 y;3;1

        Hi vagabonding electron,

        The syntax was not very verbose either on the second look :-)

        This is of course just my personal opinion and not a good reason to not use the module, but its method naming just bugged me... when I'm writing Java I don't mind reallyLongMethodNamesThatDocumentEverythingTheMethodDoes (that's what autocomplete is for), but when writing Perl I prefer Perlish APIs. Some examples: the method "characters" could be named "text" (and accept multiple arguments), "startTag" could be named "start", or s/dataElement/tag/. It might seem minor but when I tried it out I just found myself typing "startTag" too often and my code actually got longer using this "helper" module. But anyways, that's just my two cents, if it works for you then don't let me stop you :-)

        Regards,
        -- Hauke D

Re: Creating a xml file in chunks
by choroba (Cardinal) on Jul 12, 2016 at 19:49 UTC
    XML::LibXML is great for handling existing XML documents, its XML::LibXML::Reader can even process large files that don't fit into memory. The distribution has no tool to produce such large files, though. I'd probably use XML::Writer or fill the data into a template. You can use XML::LibXML to create the template for you (but you'd still need to output the opening and closing root tags):
    #!/usr/bin/perl use warnings; use strict; use Text::CSV_XS; use XML::LibXML; use Template; print "<ROOT>\n"; my $doc = 'XML::LibXML::Document'->new('1.0', 'utf-8'); my $line_tag = $doc->createElement('alpha'); $line_tag->setAttribute(name => '[% alpha %]'); for my $other (qw( beta gamma )) { my $other_tag = $doc->createElement($other); $other_tag->setAttribute(name => "[% $other %]"); $line_tag->appendChild($other_tag); } my $template = $line_tag->toString(1) . "\n"; my $csv_init = { binary => 1, auto_diag => 1, allow_whitespace => 1, sep_char => ';', eol => $/, quote_char => undef, }; my $csv = 'Text::CSV_XS'->new($csv_init); my @header = @{ $csv->getline(*DATA) }; my %rec; $csv->bind_columns(\@rec{@header}); my $tt = 'Template'->new; my $compiled_template = $tt->template(\$template); while ($csv->getline(*DATA)) { $tt->process($compiled_template, \%rec); } print "</ROOT>\n";

    Update: Code modified to compile the template just once.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thank you very much choroba! I never used Template before, but it looks very interesting and I will try it now. In the meantime I have run XML::Writer with the real data, as you and haukex proposed - it does write the output file continuously and saves up the memory. Now I have two good options. Thanks again!

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1167626]
Approved by marto
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others chilling in the Monastery: (4)
As of 2022-12-03 21:46 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?