Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Anonymous Monk:

No-one else seems to have mentioned the perils of parsing XML with regular expressions, so I guess I'll do so. It's all fine so long as the XML continues to come in to you formatted as your example, or if you control both ends of the data feed.

However, when dealing with third-party data feeds, at some point, something will eventually happen and they'll change the formatting to give you a headache. For example, suppose the data comes in like this:

<breakfast_menu> <food><name>Belgian Waffles</name><price>$5.95</price> <description>Two of our famous Belgian Waffles with plenty of real + maple syrup</description> <calories>650</calories> </food> <food><name>Strawberry Belgian Waffles</name><price>$7.95 </price><description>Light Belgian waffles covered with strawb +erries and whipped cream </description><calories>900</calories> </food> <food><name>Berry-Berry Belgian Waffles </name> <price>$8.95</price> <description>Light Belgian waffles covered with an assortment o +f fresh berries and whipped cream</description><calories>900</calories> </food> <food> <name>French Toast</name> <price>$4.50</price> <description>Thick slices made from our homemade sourdough brea +d</description> <calories>600</calories> </food> <food> <name> Homestyle Breakfast</name> <price>$6</price> <description>Two eggs, bacon or sausage, toast, and our ever-po +pular hash browns</description> <calories>950</calories> </food> <food><name>Robot Cogs</name><price>$123.456</price></food> <food><name>Berries &amp; More Berries Waffles</name><price>11.5</pric +e></food> </breakfast_menu>

Here, you'll find several things that can cause you some trouble:

  • Some of the values you're interested in have extra whitespace
  • The prices are formatted differently
  • Tags may not appear on the same line
  • Special characters (such as &) will show up as entity text

So you'll find that you'll get awful results with your code:

$ perl pm1208325_proc_xml.pl ugly.xml Homestyle Breakfast 4.50 Berries &amp; More Berries Waffles 123.456 French Toast 8.95 Strawberry Belgian Waffles 5.95

Notice that due to the ugliness I added to the XML file, the output is not only ugly, but wrong!

Not only are some items missing from the output, but since you're using separate arrays to keep your values, any parsing error one one of the values makes your arrays get out of synchronization, so the wrong prices appear on some items.

There are other headaches you can get into when dealing with XML files, too. So you may want to learn one of the XML handling libraries. It's a little bit of a pain at first, but once you're used to it, these sorts of issues just magically go away. Then you can use the time you're not wrestling XML data to handle the other issues, like formatting values!

I used XML::Twig and whipped something up and it displays:

$ perl ex_Xml_Twig_pm1208325.pl ugly.xml Belgian Waffles $5.95 Berries & More Berries Waffles $11.50 Berry-Berry Belgian Waffles $8.95 French Toast $4.50 Homestyle Breakfast $6.00 Robot Cogs $123.46 Strawberry Belgian Waffles $7.95

...roboticus

When your only tool is a regular expression, all XML problems look insurmountable.


In reply to Re: aligning text by roboticus
in thread aligning text by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (2)
As of 2024-04-25 12:47 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found