Beefy Boxes and Bandwidth Generously Provided by pair Networks
No such thing as a small change
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

It depends whether you're giving or taking.

If you are taking at the output from one single program emitting its own brew of XML, you will usually find that it is always emitted in exactly the same way, often pretty-printed with indented nested elements, or hard wrapped against column zero all the way down.

It is extremely rare (in my experience) to encounter XML emitted by a program that is neatly word-wrapped at or before column 72. After all, that takes a lot more work, and most sane programmers have better things to do with their time. Once you figure out empirically how a given program emits its XML, you can count on it being invariant.

So, as much as it may shock the purists, you can quite easily get away with picking out what you want from a big XML file with a regexp or two, especially if you don't have to worry about context. By that I mean, for example, extracting the contents of element <HG>, if the parent is <BAR> except when the grand parent is <ZONK>

You just need a good test-suite to cover your a.. code, to ensure that things don't break when the source program is upgraded.

You cannot adopt this approach when it is you who has written the XML specification and you're dealing with how people give you their information according to your spec. Everyone will do it differently and you will indeed have to parse it. Update: or you're taking the information from a web service and thus don't have any control or forewarning when the originating program may be upgraded.

That's been my rule so far in dealing with SGML and XML for over 15 years and it has served me well so far.

• another intruder with the mooring in the heart of the Perl


In reply to Re: XML parsing vs Regular expressions by grinder
in thread XML parsing vs Regular expressions by karpatov

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-04-25 23:55 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found