Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl: the Markov chain saw
 
PerlMonks  

Parsing a highly nested XML file correctly and efficiently

by Ppeoc (Beadle)
on Jun 08, 2016 at 03:01 UTC ( [id://1165129]=perlquestion: print w/replies, xml ) Need Help??

Ppeoc has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks, I am looking to parse this demo file that is highly nested. the problem is that I need data from different nest levels. My XML file and sample code are below. I apologize for the XML file being so long since I wanted to show the complexity of the file. I am currently using XML twig to parse this. I will be more than happy to try other options. Basically I want my output to be of the form
d1|d2|Nest1->Nest2->d5->X|Nest1->Nest2->d5->Y|Nest1->Nest2->d6->X|Nest +1->Nest2->d6->Y|Nest1->Nest2->Nest3->Nest4->d7->d9->d10->text.
I am not able to retain the parent path information only for the required my data. My code prints a whole lot of junk data as well. Would appreciate any help. Thanks Monks!
Shorter XML data to make it easier
<DatatoParse> <elt> <d1>TV show 1</d1> <d2>Heroes</d2> <d3>4</d3> <d4/> <Nest1> <elt> <Junk1>FULL</Junk1> <Junk2>Page 65</Junk2> <Nest2> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> </Nest2> </elt> </Nest1> <notrequired1> <elt> <Junk1>FULL</Junk1> <Junk2>Page 65</Junk2> <Nest2> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> </Nest2> </elt> </notrequired1> </elt> <elt> <d1>TV show 2</d1> <d2>Prison Break</d2> <d3>8</d3> <d4/> <Nest1> <elt> <Junk1>FULL</Junk1> <Junk2>Page 65</Junk2> <Nest2> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> </Nest2> </elt> </Nest1> <notrequired1> <elt> <Junk1>FULL</Junk1> <Junk2>Page 65</Junk2> <Nest2> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> <elt> <d5> <X>-2</X> <Y>-3</Y> </d5> <d6> <X>5</X> <Y>8</Y> </d6> <Nest3> <Nest4> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Nest4> <Junk> <d7> <d9> <d10> <d11>yipppeee</d11> </d10> </d9> </d7> </Junk> </Nest3> </elt> </Nest2> </elt> </notrequired1> </elt> </DatatoParse>
And here is the code I have so far
use warnings; use XML::Twig; use XML::Simple; $localfile= "Test_1.xml"; my $field = "Nest1"; open my $fout1, '>', "testx.csv" or die "Could not open file!"; $twig = XML::Twig->new( twig_roots => { $field => 1, 'd1' => 1, 'd2'=> 1, }, twig_handlers => { 'DatatoParse' => \&node, 'DatatoParse//*' => \&node1 } ); $twig->parsefile($localfile); sub node { my($twig, $el) = @_; $twig->purge; } sub node1{ print $fout1 "\n", if ($_->tag eq "d1"); print $fout1 $_->text, ",", unless ($_->has_children('#EL +T')); print $fout1 "\n", if ($_->tag eq "elt"); }

Replies are listed 'Best First'.
Re: Parsing a highly nested XML file correctly and efficiently
by Discipulus (Canon) on Jun 08, 2016 at 07:10 UTC
    I am currently using XML twig to parse this. I will be more than happy to try other options.

    There are other (few) options, but not XML::Simple. Avoid it. I suspect that you unhappiness is due not to XML::Twig but to the shaggy beast XML is, per se.

    Anyway i do not understand your expected output format:

    d1|d2|Nest1->Nest2->d5->X|Nest1->Nest2->d5->Y|Nest1->Nest2->d6->X|Nest1->Nest2->d6->Y|Nest1->Nest2->Nest3->Nest4->d7->d9->d10->text.

    what the above means?

    Modifying a little your program (using strict too..)

    I obtain some output that make some sense and no garbage at all:

    #cat testx.csv TV show 1,Heroes,FULL,Page 65,-2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, TV show 2,Prison Break,FULL,Page 65,-2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, TV show 4,Alias,FULL,Page 65,-2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee, -2,-3,5,8,yipppeee,yipppeee,

    What is wrong with this? What output you want?

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
      Thanks for your help. I hope I can make this a little more clear. 1) I want my output to be like this
      TV show 1,Heroes,FULL,Page 65,-2,-3,5,8,yipppeee (d1|d2|Nest1->Nest2-> +d5->X|Nest1->Nest2->d5->Y|Nest1->Nest2->d6->X|Nest1->Nest2->d6->Y|Nes +t1->Nest2->Nest3->Nest4->d7->d9->d10->text) TV show 1,Heroes,FULL,Page 65,-2,-3,5,8,yipppeee (d1|d2|Nest1->Nest2-> +d5->X|Nest1->Nest2->d5->Y|Nest1->Nest2->d6->X|Nest1->Nest2->d6->Y|Nes +t1->Nest2->Nest3->Nest4->d7->d9->d10->text) TV show 1,Heroes,FULL,Page 65,-2,-3,5,8,yipppeee(d1|d2|Nest1->Nest2->d +5->X|Nest1->Nest2->d5->Y|Nest1->Nest2->d6->X|Nest1->Nest2->d6->Y|Nest +1->Nest2->Nest3->Nest4->d7->d9->d10->text) TV show 1,Heroes,FULL,Page 65,-2,-3,5,8,yipppeee(d1|d2|Nest1->Nest2->d +5->X|Nest1->Nest2->d5->Y|Nest1->Nest2->d6->X|Nest1->Nest2->d6->Y|Nest +1->Nest2->Nest3->Nest4->d7->d9->d10->text) TV show 2,Prison Break,FULL,Page 65,-2,-3,5,8,yipppeee TV show 2,Prison Break,FULL,Page 65,-2,-3,5,8,yipppeee TV show 2,Prison Break,FULL,Page 65,-2,-3,5,8,yipppeee TV show 2,Prison Break,FULL,Page 65,-2,-3,5,8,yipppeee
      2) I do not want the data from parents labelled junk or notrequired. My current code displays those as well
        mmh for me it does not make it clearer
        TV show 1,Heroes,FULL,Page 65,-2,-3,5,8,yipppeee # does not match (at least for me..) with your description (if I under +stand it) (d1|d2|Nest1->Nest2->d5->X|Nest1->Nest2->d5->Y|Nest1->Nest2->d6->X|Nes +t1->Nest2->d6->Y|Nest1->Nest2->Nest3->Nest4->d7->d9->d10->text)

        L*

        There are no rules, there are no thumbs..
        Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Parsing a highly nested XML file correctly and efficiently
by Anonymous Monk on Jun 08, 2016 at 03:26 UTC
Re: Parsing a highly nested XML file correctly and efficiently
by choroba (Cardinal) on Jun 09, 2016 at 18:21 UTC
    I played with your XML in XML::XSH2. It has a shell/REPL, so it's easy to discover the correct XPath expressions there by trial and error. It's a wrapper around XML::LibXML where you can use the same XPath expressions if speed is your concern.
    open file.xml ; for /DatatoParse/elt/Nest1/elt/Nest2/elt { my $top = ancestor::elt[last()] ; echo :s $top/d1 , $top/d2 , (d5/X) , (d5/Y) , (d6/X) , (d6/Y) , (N +est3/Nest4/d7/d9/d10/d11) ; }

    Output:

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
      Thanks Choroba! I haven't used XSH2 or LibXML before. This is completely new to me. But its looks like you have given me a good starting point. I will need to read up on this before I understand wat you have done. Really appreciate the help!
Re: Parsing a highly nested XML file correctly and efficiently
by tangent (Parson) on Jun 09, 2016 at 19:08 UTC
    Here is a way to do it using XML::LibXML. Using Xpath expressions you should be able to match your nesting exactly.
    my $doc = XML::LibXML->load_xml(string => $xml); my @nodes = $doc->findnodes('DatatoParse/elt'); for my $node ( @nodes ) { my $d1 = $node->findvalue('d1'); my $d2 = $node->findvalue('d2'); my @xnodes = $node->findnodes( 'Nest1/elt/Nest2/elt' ); for my $xnode ( @xnodes ) { my $d5x = $xnode->findvalue( 'd5/X' ); my $d5y = $xnode->findvalue( 'd5/Y' ); my $d6x = $xnode->findvalue( 'd6/X' ); my $d6y = $xnode->findvalue( 'd6/Y' ); my $d10 = $xnode->findvalue( 'Nest3/Nest4/d7/d9/d10' ); $d10 =~ s/^\s+//; $d10 =~ s/\s+$//; print "$d1,$d2,$d5x,$d5y,$d6x,$d6y,$d10\n"; } }
    Output:
    TV show 1,Heroes,-2,-3,5,8,yipppeee TV show 1,Heroes,-2,-3,5,8,yipppeee TV show 1,Heroes,-2,-3,5,8,yipppeee TV show 1,Heroes,-2,-3,5,8,yipppeee TV show 2,Prison Break,-2,-3,5,8,yipppeee TV show 2,Prison Break,-2,-3,5,8,yipppeee TV show 2,Prison Break,-2,-3,5,8,yipppeee TV show 2,Prison Break,-2,-3,5,8,yipppeee TV show 4,Alias,-2,-3,5,8,yipppeee TV show 4,Alias,-2,-3,5,8,yipppeee TV show 4,Alias,-2,-3,5,8,yipppeee TV show 4,Alias,-2,-3,5,8,yipppeee
      Thank you for the genius solution. This is so simple and yet efficient. Exactly what I was looking for
Re: Parsing a highly nested XML file correctly and efficiently -- XML::Twig
by Discipulus (Canon) on Jun 10, 2016 at 09:07 UTC
    Hello, so if I understood your desired output, you can simply get the first_children sequentially; like :

    # into the sub 'twig_handler' '/DatatoParse/elt' $_[1]->first_child('Nest1')->first_child('elt')->first_child('Junk1 +')->text;
    This becomes very prolix and repetitive soon, in fact you only need an xpath so for Junk1 you can also:
    my @junk1 = $_[1]->get_xpath('./Nest1/elt/Junk1'); print $junk1[0]->text;
    So having a lot of xpath to process the same way you can compatc the code a lot, ending with the following twig_handler
    sub elt_map{ my $elt = $_[1]; print join ',', map { my @cur = $elt->get_xpath($_); $cur[0]->text; }(qw( d1 d2 ./Nest1/elt/Junk1 ./Nest1/elt/Junk2 ./Nest1/elt/Nest2/elt/d5/X ./Nest1/elt/Nest2/elt/d5/Y ./Nest1/elt/Nest2/elt/d6/X ./Nest1/elt/Nest2/elt/d6/Y ./Nest1/elt/Nest2/elt/Nest3/Nest4/d7/d9/d10/ +d11 )); print "\n" }

    The whole code will be:

    use strict; use warnings; use XML::Twig; my $field = "Nest1"; my $twig = XML::Twig->new( twig_handlers => {'/DatatoParse/elt' => \&el +t_map,} ); $/=''; $twig->parse(<DATA>); sub elt_map{ my $elt = $_[1]; print join ',', map { my @cur = $elt->get_xpath($_); $cur[0]->text; }(qw( d1 d2 ./Nest1/elt/Junk1 ./Nest1/elt/Junk2 ./Nest1/elt/Nest2/elt/d5/X ./Nest1/elt/Nest2/elt/d5/Y ./Nest1/elt/Nest2/elt/d6/X ./Nest1/elt/Nest2/elt/d6/Y ./Nest1/elt/Nest2/elt/Nest3/Nest4/d7/d9/d10/ +d11 )); print "\n" } __DATA__ <DatatoParse> <elt> <d1>TV show 1</d1> ....

    with the following output

    TV show 1,Heroes,FULL,Page 65,-2,-3,5,8,yipppeee TV show 2,Prison Break,FULL,Page 65,-2,-3,5,8,yipppeee TV show 4,Alias,FULL,Page 65,-2,-3,5,8,yipppeee

    In addition, when you need to write everytimes to a destination file, you can profit of select $filehandle; Is very useful also because while debugging you can comment it to see at screen the output.

    HtH

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://1165129]
Approved by GrandFather
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others contemplating the Monastery: (7)
As of 2024-04-16 10:16 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found