Re: XML / regex - cleaning up attributes
by ikegami (Patriarch) on Oct 01, 2010 at 03:25 UTC
|
If it was just single quotes, one could come up with a generic solution that works well in most circumstances.
s/(?<!=)'(?![ >])/'/g
However, & is allowed in Windows file names, and that's much trickier to handle generally. Since only one field is likely to hold incorrect data, this problem can be handled easily.
use HTML::Entities qw( encode_entities_numeric );
s/(?<=<app text=')(.*?)(?=' date)/encode_entities_numeric("$1")/eg;
| [reply] [Watch: Dir/Any] [d/l] [select] |
|
These both look v compelling. I'm definitely going to have to spend some time decrypting them. Barbie says "regular expressions are hard."
| [reply] [Watch: Dir/Any] |
Re: XML / regex - cleaning up attributes
by Anonymous Monk on Sep 30, 2010 at 22:37 UTC
|
The second will not as the text tag contains embedded ' marks.
Is there a regex that will remove this extra punctuation but leave the rest of the attribute (and tag) intact?
Complain upstream to whomever thinks they're supplying XML, because they are not, seriously :/
| [reply] [Watch: Dir/Any] |
Re: XML / regex - cleaning up attributes
by graff (Chancellor) on Oct 01, 2010 at 01:14 UTC
|
Total agreement with both of the AnonyMonk replies above. The data supplier needs to be made aware that their output is faulty and needs to be fixed; also, you really don't want (or need) to delete anything -- just fix the basic mistake. | [reply] [Watch: Dir/Any] |
Re: XML / regex - cleaning up attributes
by rowdog (Curate) on Oct 01, 2010 at 06:58 UTC
|
As other monks have pointed out, that's far from valid XML but to answer your specific question...
Is there a regex that will remove this extra punctuation but leave the rest of the attribute (and tag) intact?
Sure, you can hack your way around broken XML with something like
s/STORE'N'GO/STORENGO/ but don't, it's much better to consume actual XML.
| [reply] [Watch: Dir/Any] [d/l] |
Re: XML / regex - cleaning up attributes
by Anonymous Monk on Sep 30, 2010 at 22:42 UTC
|
s{<app text='}{<app text="}g;
s{' date='}{" date="}g;
s{' time='}{" time="}g;
s{' msec='}{" msec="}g;
s{' sid='}{" sid="}g;
s{'><![CDATA}{"><![CDATA}g;
| [reply] [Watch: Dir/Any] [d/l] |
Re: XML / regex - cleaning up attributes
by JavaFan (Canon) on Oct 01, 2010 at 08:58 UTC
|
Is there a regex that will remove this extra punctuation but leave the rest of the attribute (and tag) intact?
The problem is that for every example you give to "clean" up, it's easy to come up with a regex to do just that. Unfortunally, for any given regexp, there'll be an example where it makes it worse.
| [reply] [Watch: Dir/Any] |
Re: XML / regex - cleaning up attributes
by ethrbunny (Monk) on Oct 01, 2010 at 16:34 UTC
|
The XML in question comes from my code so complaints are typically ignored. The attribute in question comes from paths to windows apps so just about anything can (and does) appear. I run a series of regex commands on the XML before I pass it to the parser. This particular situation just popped up recently though. | [reply] [Watch: Dir/Any] |
|
>>The XML in question comes from my code so complaints are typically ignored.
:LOL: Okay, so you are saying you produced the XML in the first place? Why don't you clean the path first then -- not only easier, but more efficient than getting some module to parse the bad xml afterward.
If that field can really contain anything you can't just swap ' for " as a delimiter, and CDATA has the same delimiter issue. That is the crux of the issue: you need a delimiter, either ' or " or CDATA. Choose one (IMO: stick with ') and replace that delimiter in the data before you create the xml.
s/'/'/g
| [reply] [Watch: Dir/Any] [d/l] |
|
You also need to replace & with &.
| [reply] [Watch: Dir/Any] [d/l] [select] |
Re: XML / regex - cleaning up attributes
by TomDLux (Vicar) on Oct 04, 2010 at 03:51 UTC
|
Your XML module probably contains a routine for reformatting strings to handle special characters. Run your path through that ... in fact, run all your string values through that, since you have no idea when someone will use "can't" or some other 'forbidden' character.
As Occam said: Entia non sunt multiplicanda praeter necessitatem.
| [reply] [Watch: Dir/Any] |