Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

XML / regex - cleaning up attributes

by ethrbunny (Monk)
on Sep 30, 2010 at 21:50 UTC ( [id://862866]=perlquestion: print w/replies, xml ) Need Help??

ethrbunny has asked for the wisdom of the Perl Monks concerning the following question:

I have the following sorts of text coming in waves of XML files (hundreds / day):
<app text='C:\WINDOWS\SYSTEM32\SHELL32.DLL' date='2008-07-29' time='14 +:39:02' msec='032' sid='442'><![CDATA[Program Manager]]></app> <app text='C:\DOCUMENTS AND SETTINGS\POCL\DESKTOP\STORE'N'GO (E)\THE F +UNCTIONS BEFORE EDITING.EXE' date='2008-07-29' time='14:49:00' msec=' +622' sid='442'><![CDATA[Macromedia Flash Player 6]]></app>
The first line will parse just fine. The second will not as the text tag contains embedded ' marks.
Is there a regex that will remove this extra punctuation but leave the rest of the attribute (and tag) intact?

Replies are listed 'Best First'.
Re: XML / regex - cleaning up attributes
by ikegami (Patriarch) on Oct 01, 2010 at 03:25 UTC

    If it was just single quotes, one could come up with a generic solution that works well in most circumstances.

    s/(?<!=)'(?![ >])/&apos;/g

    However, & is allowed in Windows file names, and that's much trickier to handle generally. Since only one field is likely to hold incorrect data, this problem can be handled easily.

    use HTML::Entities qw( encode_entities_numeric ); s/(?<=<app text=')(.*?)(?=' date)/encode_entities_numeric("$1")/eg;
      These both look v compelling. I'm definitely going to have to spend some time decrypting them.
      Barbie says "regular expressions are hard."
Re: XML / regex - cleaning up attributes
by Anonymous Monk on Sep 30, 2010 at 22:37 UTC
    The second will not as the text tag contains embedded ' marks. Is there a regex that will remove this extra punctuation but leave the rest of the attribute (and tag) intact?

    Complain upstream to whomever thinks they're supplying XML, because they are not, seriously :/

Re: XML / regex - cleaning up attributes
by graff (Chancellor) on Oct 01, 2010 at 01:14 UTC
    Total agreement with both of the AnonyMonk replies above. The data supplier needs to be made aware that their output is faulty and needs to be fixed; also, you really don't want (or need) to delete anything -- just fix the basic mistake.
Re: XML / regex - cleaning up attributes
by rowdog (Curate) on Oct 01, 2010 at 06:58 UTC

    As other monks have pointed out, that's far from valid XML but to answer your specific question...

    Is there a regex that will remove this extra punctuation but leave the rest of the attribute (and tag) intact?

    Sure, you can hack your way around broken XML with something like s/STORE'N'GO/STORENGO/ but don't, it's much better to consume actual XML.

Re: XML / regex - cleaning up attributes
by Anonymous Monk on Sep 30, 2010 at 22:42 UTC
    s{<app text='}{<app text="}g; s{' date='}{" date="}g; s{' time='}{" time="}g; s{' msec='}{" msec="}g; s{' sid='}{" sid="}g; s{'><![CDATA}{"><![CDATA}g;
Re: XML / regex - cleaning up attributes
by JavaFan (Canon) on Oct 01, 2010 at 08:58 UTC
    Is there a regex that will remove this extra punctuation but leave the rest of the attribute (and tag) intact?
    The problem is that for every example you give to "clean" up, it's easy to come up with a regex to do just that. Unfortunally, for any given regexp, there'll be an example where it makes it worse.
Re: XML / regex - cleaning up attributes
by ethrbunny (Monk) on Oct 01, 2010 at 16:34 UTC
    The XML in question comes from my code so complaints are typically ignored. The attribute in question comes from paths to windows apps so just about anything can (and does) appear. I run a series of regex commands on the XML before I pass it to the parser. This particular situation just popped up recently though.
      >>The XML in question comes from my code so complaints are typically ignored.

      :LOL: Okay, so you are saying you produced the XML in the first place? Why don't you clean the path first then -- not only easier, but more efficient than getting some module to parse the bad xml afterward.

      If that field can really contain anything you can't just swap ' for " as a delimiter, and CDATA has the same delimiter issue. That is the crux of the issue: you need a delimiter, either ' or " or CDATA. Choose one (IMO: stick with ') and replace that delimiter in the data before you create the xml.

      s/'/&#39;/g

        You also need to replace & with &amp;.
Re: XML / regex - cleaning up attributes
by TomDLux (Vicar) on Oct 04, 2010 at 03:51 UTC

    Your XML module probably contains a routine for reformatting strings to handle special characters. Run your path through that ... in fact, run all your string values through that, since you have no idea when someone will use "can't" or some other 'forbidden' character.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://862866]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others imbibing at the Monastery: (5)
As of 2024-03-29 00:17 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found