http://qs321.pair.com?node_id=1114076

mdfaizy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks...Since morning I wrote a complete code to parse an XML file for just few information. However my code breaks in the first section itself when I do a substitute function to remove leading space in a line of the XML. Moreover none of the patter match is working. I am going crazy trying to find out why....any help is higly appreciated. I have tried this code on an Activestate perl installation at office (latest version)and also on my ubuntu perl installation (v5.18.2) and it did not work on either of them. Here is my code:
#!/usr/local/bin/perl # header modules ##################################################### +##################################################################### +########## ###################################################################### +##################################################################### +########## #read the input xml files for the TestSuite tags and its content open(SandBoxXML,$ARGV[0]) || die("sandbox xml file cannot be loaded;ch +eck for file name or existance"); my @sandboxxml = <SandBoxXML>; close(SandBoxXML); #chomp(@sandboxxml); for($i=0;$i<@sandboxxml;$i++) { $sandboxxml[$i] =~ s/^\s+//; #remove leading white spaces and tabs + from each line print $sandboxxml[$i]; #for testing } ###################################################################### +##################################################################### +########## #collecting the required data from the XML dump for($i=0;$i<@sandboxxml;$i++) { #print $sandboxxml[$i]; #for testing if($sandboxxml[$i] =~ /\<TestSuite\>/) { #$i++; #print $sandboxxml[$i]; #for testing while($sandboxxml[$i] !~ /\<\/TestSuite\>/) { if($sandboxxml[$i] =~ /\<ElementName\>/) { my $tsnumber=&readtagdata; #push(@data,$data.","); } if($sandboxxml[$i] =~ /\<Name\>/) { my $tsname=&readtagdata; #push(@data,$data.","); } if($sandboxxml[$i] =~ /\<ATC\>/) { #$i++; while($sandboxxml[$i] !~ /\<\/ATC\>/) { if($sandboxxml[$i] =~ /\<ElementName\>/) { my $atcnumber=&readtagdata; #push(@data,$data.","); } if($sandboxxml[$i] =~ /\<Name\>/) { my $atcname=&readtagdata; #push(@data,$data.","); } if($sandboxxml[$i] =~ /\<Purpose /) { my $atcpurpose=&readtagdata; #push(@data,$data.","); } if($sandboxxml[$i] =~ /\<Requirement\>/) { #$i++; while($sandboxxml[$i] !~ /\<\/Requirement\>/) { if($sandboxxml[$i] =~ /\<ElementName\>/) { my $reqnumber=&readtagdata; #push(@data,$data.","); } if($sandboxxml[$i] =~ /\<Name\>/) { my $reqname=&readtagdata; #push(@data,$data.","); } push(@data, $tsnumber.",".$tsname.",".$atc +number.",".$atcname.",".$atcpurpose.",".$reqnumber.",".$reqname."\n") +; $i++; } } $i++; } } $i++; } } } ###################################################################### +##################################################################### +########## # making the output file open(OUTPUT, ">TestSuite.csv") || die("Cannot make the outpur file...G +OD knows for what reason"); #for testing print OUTPUT @data; #for testing close(OUTPUT); ###################################################################### +##################################################################### +########## #sub functions sub removespace { foreach(@sandboxxml) { $_ =~ s/^[ \t]+//; #remove leading white spaces and tabs from +each line chomp($_); #remove newline character from each line } } sub readtagdata { my @tmp0 = split(/\>/,$sandboxxml[$i]); my @tmp1 = split(/\</,$tmp0[1]); return $tmp1[0]; }

Replies are listed 'Best First'.
Re: pattern match screwed up!!
by kennethk (Abbot) on Jan 22, 2015 at 00:51 UTC
    So, I will open by saying Anonymous Monk is right and you probably shouldn't be rolling your own here. You are highly unlikely to win the cost-benefit analysis with a home grown solution. I do think there is educational value in understanding how to do it, but this like crafting your own object system: go ahead and roll your own to understand the principles, and then use a well-tested one in production to CYA.

    Let's presume you have a well-formed_document, and ignore the question as to whether it's valid for a particular XSD.

    The first mistake you are making is thinking about an XML document's line structure as significant. While newlines and indentation are considered good form in an XML document, the standard is whitespace agnostic. Thus, you should be doing a slurp into a single variable. Something like:

    #!/usr/local/bin/perl use strict; use warnings; my $sandboxxml = do { open(my $fh, '<', $ARGV[0]) || die("sandbox xml file cannot be loa +ded;check for file name or existance"); local $/; # Slurp <$fh>; };
    Note that by having an indirect file handle in the do where I localize $/, the file is automatically closed once I'm done with it.

    Second, comments can contain all sorts of text that might interfere with a parse. As well, an XML document may contain a CDATA block, which can contain very nearly arbitrary text. I'm assuming that you don't have them in your trial document since you never handle them, but they are possible and must be removed before you can handle anything else. This also introduces the need to tokenize, as you must extract something from your document, but keep a placeholder in there so you know where your content came from. As who knows what's in the document, we'll need to pick something that can't possibly be legal XML, but that we can work around in our regular expression. How about <<#>>, where # is the index in our token array. Note that since comment delimiters are not special within a CDATA block and vice versa, we must strip them simultaneously. So:

    my @tokens; while ($sandboxxml =~ /<!\[(CDATA)\[|<!--/) { if ($1) { # We're in a CDATA block $sandboxxml =~ s/<!\[CDATA\[(.*?)\]\]>/'<<' . (0+@tokens) . '> +>'/es; push @tokens, $1; } else { # Comment $sandboxxml =~ s/<!--.*?-->//s; } }
    Note we're just dropping comments, that if the file isn't well-formed, we just created an infinite loop, and lots of lovely escaping since [ and ] have special meaning in regular expressions.

    Okay, now we can start actually dealing with tags. Because of how XML is structured, we need to work from the inside out; otherwise is very hard in a general regex to know if you've actually matched start and end tags. We also now need to keep track of a tree structure in some way, but fortunately we can do that in a soft way using the tokens array we've already started.

    while ($sandboxxml =~ s#(<[^<>]*(?:/|>(?:[^<>]|<<\d*>>)*</[^<>]*)>)#'< +<' . (0+@tokens) . '>>'#es) { push @tokens, $1; }

    Of course, that's a giant mess. We also haven't built our tree up yet and failed to handle the leading <?xml...> tag. And hundred other things. And if our expressions are that complex, debugging them is going to be a pain.


    #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.

Re: pattern match screwed up!!
by Anonymous Monk on Jan 21, 2015 at 23:14 UTC
Re: pattern match screwed up!! (junit xml)
by Anonymous Monk on Jan 22, 2015 at 00:22 UTC
      thanks for the hint. I did look into LibXML module but in the activestae installation at office this module is not installed. And apparently I am not allowed to alter this installation. The only module available to me are Simple and Expat.

        Talk with your manager at work. Explain what you're trying to do and why. Once there is a legitimate business reason, most companies will allow the installation.

        Jason L. Froebe

        Blog, Tech Blog

Re: pattern match screwed up!!
by roboticus (Chancellor) on Jan 22, 2015 at 11:54 UTC

    mdfaizy:

    In addition to what kennethk so eloquently said, I'd like to offer one little thing: A solution based on regexes can be fragile. You may spend a good bit of effort to make something that "works", and you'll be fine ... ... ... for a while. Every once in a while, the document will have something "interesting" in it, and your regex solution will break. Then you get to fix it. Unfortunately, it'll likely keep happening and irregular intervals.

    Even worse, it may appear to be working, but you may miss important things. Since a regex solution doesn't understand the structure of the XML document, you won't know when your regexes aren't working unless they fail in an obvious fashion. The worst failures are when it fails in a non-obvious fashion. As an example, suppose you don't handle attributes on tags because there aren't any currently. Then someone makes a change, and you get a document like this:

    ... <orders> <order> <orderID>1234</orderID> .. other order details .. </order> <order priority="SUPER IMPORTANT"> <orderID>1235</orderID> .. multimillion dollar order .. </order> </orders> ...

    Your boss, expecting a big order sometime soon asks "Hey, did we get any important orders yet?" You look at your log and say, "No, we just got one order today, it doesn't look special." That super important order will likely cause many people headaches and phone calls. But since the attribute existed in the order tag, it got missed.

    </endOfContrivedExample>

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.