Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid

Parsing XML file and keeping the formatting tags

by corfuitl (Sexton)
on Mar 22, 2018 at 14:50 UTC ( #1211528=perlquestion: print w/replies, xml ) Need Help??

corfuitl has asked for the wisdom of the Perl Monks concerning the following question:

Hi PerlMonks

I have the following bilingual file and would like to extract the source and target nodes preserving its xml elements they might have as well as the line breaks. Could you please help me on that? I have no experience in xml parsing with Perl.

Here is a sample of my file:

<trans-unit id="1" maxbytes="14"> <source xml:lang="en-US">Hello <x id=1/> world! How are you?</source> <target xml:lang="ja-JP">Ciao<x id=1/> mondo! Come stai?</target> </trans-unit>

The expected result should be:

Hello <x id=1/> world! <lb/> How are you? || Ciao<x id=1/> mondo! <lb/> Come stai?

Thank you for your time!

Replies are listed 'Best First'.
Re: Parsing XML file and keeping the formatting tags
by choroba (Archbishop) on Mar 22, 2018 at 16:49 UTC
    Your input is not a valid XML. The id=1 should be id="1". After fixing that, the following seems to work for the input you showed:
    #!/usr/bin/perl use warnings; use strict; use XML::LibXML; my $dom = 'XML::LibXML'->load_xml(location => 'file.xml'); for my $child ( @{ $dom->find('/trans-unit/*[self::source | self::target]') } ) { ( my $contents = join '', $child->childNodes ) =~ s,\n, <lb/> ,g; print $contents, $child->nodeName eq 'source' ? ' || ' : "\n"; }
    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      Thank you so much

      I added the " and it works :)

      However, I would like to have a deeper understanding of the code as I am not familiar. I will check it and will get back if any questions will arrise.


        This is not a code problem, it is a problem with incorrect formatting of XML data.

        Like any structured data format, you must stick with the standard, and ensure it is correct. Perhaps some programs will work with bad XML/JSON etc data, but properly written applications and modules will not (and should not imho).

Re: Parsing XML file and keeping the formatting tags
by Your Mother (Archbishop) on Mar 22, 2018 at 15:55 UTC

    Since your output still seems to be some flavor of SGML/XML but modified from the original I think what you might really be after is XSLT instead of XML handling. This may be something a bit out of reach for a beginner unless you have strong technical aptitude or programming experience. :( Sidenote: you should probably fire your Japanese translator. Seems she's been hitting the Campari a little hard.

    The packages I think you want are XML::LibXML and XML::LibXSLT. If you break it into small pieces and show what you've tried, you'll be able to get a ton of help here. There is also if you just want to hire someone to write the code for you.

Re: Parsing XML file and keeping the formatting tags
by Lotus1 (Vicar) on Mar 22, 2018 at 15:21 UTC


      Thank you for your reply.

      Yes, I tried a few weeks ago however it was not possible to preserve the inline tags and the line breaks.

Log In?

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://1211528]
Approved by Corion
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (2)
As of 2021-03-07 03:36 GMT
Find Nodes?
    Voting Booth?
    My favorite kind of desktop background is:

    Results (119 votes). Check out past polls.