Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Re: pattern match screwed up!!

by kennethk (Abbot)
on Jan 22, 2015 at 00:51 UTC ( [id://1114087]=note: print w/replies, xml ) Need Help??


in reply to pattern match screwed up!!

So, I will open by saying Anonymous Monk is right and you probably shouldn't be rolling your own here. You are highly unlikely to win the cost-benefit analysis with a home grown solution. I do think there is educational value in understanding how to do it, but this like crafting your own object system: go ahead and roll your own to understand the principles, and then use a well-tested one in production to CYA.

Let's presume you have a well-formed_document, and ignore the question as to whether it's valid for a particular XSD.

The first mistake you are making is thinking about an XML document's line structure as significant. While newlines and indentation are considered good form in an XML document, the standard is whitespace agnostic. Thus, you should be doing a slurp into a single variable. Something like:

#!/usr/local/bin/perl use strict; use warnings; my $sandboxxml = do { open(my $fh, '<', $ARGV[0]) || die("sandbox xml file cannot be loa +ded;check for file name or existance"); local $/; # Slurp <$fh>; };
Note that by having an indirect file handle in the do where I localize $/, the file is automatically closed once I'm done with it.

Second, comments can contain all sorts of text that might interfere with a parse. As well, an XML document may contain a CDATA block, which can contain very nearly arbitrary text. I'm assuming that you don't have them in your trial document since you never handle them, but they are possible and must be removed before you can handle anything else. This also introduces the need to tokenize, as you must extract something from your document, but keep a placeholder in there so you know where your content came from. As who knows what's in the document, we'll need to pick something that can't possibly be legal XML, but that we can work around in our regular expression. How about <<#>>, where # is the index in our token array. Note that since comment delimiters are not special within a CDATA block and vice versa, we must strip them simultaneously. So:

my @tokens; while ($sandboxxml =~ /<!\[(CDATA)\[|<!--/) { if ($1) { # We're in a CDATA block $sandboxxml =~ s/<!\[CDATA\[(.*?)\]\]>/'<<' . (0+@tokens) . '> +>'/es; push @tokens, $1; } else { # Comment $sandboxxml =~ s/<!--.*?-->//s; } }
Note we're just dropping comments, that if the file isn't well-formed, we just created an infinite loop, and lots of lovely escaping since [ and ] have special meaning in regular expressions.

Okay, now we can start actually dealing with tags. Because of how XML is structured, we need to work from the inside out; otherwise is very hard in a general regex to know if you've actually matched start and end tags. We also now need to keep track of a tree structure in some way, but fortunately we can do that in a soft way using the tokens array we've already started.

while ($sandboxxml =~ s#(<[^<>]*(?:/|>(?:[^<>]|<<\d*>>)*</[^<>]*)>)#'< +<' . (0+@tokens) . '>>'#es) { push @tokens, $1; }

Of course, that's a giant mess. We also haven't built our tree up yet and failed to handle the leading <?xml...> tag. And hundred other things. And if our expressions are that complex, debugging them is going to be a pain.


#11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1114087]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others avoiding work at the Monastery: (3)
As of 2024-03-29 14:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found