pattern match screwed up!!

mdfaizy has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks...Since morning I wrote a complete code to parse an XML file for just few information. However my code breaks in the first section itself when I do a substitute function to remove leading space in a line of the XML. Moreover none of the patter match is working. I am going crazy trying to find out why....any help is higly appreciated. I have tried this code on an Activestate perl installation at office (latest version)and also on my ubuntu perl installation (v5.18.2) and it did not work on either of them. Here is my code:

#!/usr/local/bin/perl

# header modules #####################################################
+#####################################################################
+##########

######################################################################
+#####################################################################
+##########

#read the input xml files for the TestSuite tags and its content
open(SandBoxXML,$ARGV[0]) || die("sandbox xml file cannot be loaded;ch
+eck for file name or existance");
my @sandboxxml = <SandBoxXML>;
close(SandBoxXML);
#chomp(@sandboxxml);
for($i=0;$i<@sandboxxml;$i++)
{
    $sandboxxml[$i] =~ s/^\s+//; #remove leading white spaces and tabs
+ from each line
    print $sandboxxml[$i]; #for testing
}
######################################################################
+#####################################################################
+##########

#collecting the required data from the XML dump
for($i=0;$i<@sandboxxml;$i++)
{
    #print $sandboxxml[$i]; #for testing
    if($sandboxxml[$i] =~ /\<TestSuite\>/)
    {
        #$i++;
        #print $sandboxxml[$i]; #for testing
        while($sandboxxml[$i] !~ /\<\/TestSuite\>/)
        {
            if($sandboxxml[$i] =~ /\<ElementName\>/)
            {
                my $tsnumber=&readtagdata;
                #push(@data,$data.",");
            }
            if($sandboxxml[$i] =~ /\<Name\>/)
            {
                my $tsname=&readtagdata;
                #push(@data,$data.",");
            }
            if($sandboxxml[$i] =~ /\<ATC\>/)
            {
                #$i++;
                while($sandboxxml[$i] !~ /\<\/ATC\>/)
                {
                    if($sandboxxml[$i] =~ /\<ElementName\>/)
                    {
                        my $atcnumber=&readtagdata;
                        #push(@data,$data.",");
                    }
                    if($sandboxxml[$i] =~ /\<Name\>/)
                    {
                        my $atcname=&readtagdata;
                        #push(@data,$data.",");
                    }
                    if($sandboxxml[$i] =~ /\<Purpose /)
                    {
                        my $atcpurpose=&readtagdata;
                        #push(@data,$data.",");
                    }
                    if($sandboxxml[$i] =~ /\<Requirement\>/)
                    {
                        #$i++;
                        while($sandboxxml[$i] !~ /\<\/Requirement\>/)
                        {
                            if($sandboxxml[$i] =~ /\<ElementName\>/)
                            {
                                my $reqnumber=&readtagdata;
                                #push(@data,$data.",");
                            }
                            if($sandboxxml[$i] =~ /\<Name\>/)
                            {
                                my $reqname=&readtagdata;
                                #push(@data,$data.",");
                            }
                            push(@data, $tsnumber.",".$tsname.",".$atc
+number.",".$atcname.",".$atcpurpose.",".$reqnumber.",".$reqname."\n")
+;
                            $i++;
                        }
                        
                    }
                    $i++;
                }
            }
            $i++;
        }
        
    }
}
######################################################################
+#####################################################################
+##########

# making the output file
open(OUTPUT, ">TestSuite.csv") || die("Cannot make the outpur file...G
+OD knows for what reason"); #for testing
print OUTPUT @data; #for testing
close(OUTPUT);
######################################################################
+#####################################################################
+##########

#sub functions
sub removespace
{
    foreach(@sandboxxml)
    {
        $_ =~ s/^[ \t]+//; #remove leading white spaces and tabs from 
+each line
        chomp($_); #remove newline character from each line
    }
}
sub readtagdata
{
    my @tmp0 = split(/\>/,$sandboxxml[$i]);
    my @tmp1 = split(/\</,$tmp0[1]);
    return $tmp1[0];
}
[download]

Comment on pattern match screwed up!! Download Code

Replies are listed 'Best First'.
Re: pattern match screwed up!! by kennethk (Abbot) on Jan 22, 2015 at 00:51 UTC
So, I will open by saying Anonymous Monk is right and you probably shouldn't be rolling your own here. You are highly unlikely to win the cost-benefit analysis with a home grown solution. I do think there is educational value in understanding how to do it, but this like crafting your own object system: go ahead and roll your own to understand the principles, and then use a well-tested one in production to CYA. Let's presume you have a well-formed_document, and ignore the question as to whether it's valid for a particular XSD. The first mistake you are making is thinking about an XML document's line structure as significant. While newlines and indentation are considered good form in an XML document, the standard is whitespace agnostic. Thus, you should be doing a slurp into a single variable. Something like: `#!/usr/local/bin/perl use strict; use warnings; my $sandboxxml = do { open(my $fh, '<', $ARGV[0]) \|\| die("sandbox xml file cannot be loa +ded;check for file name or existance"); local $/; # Slurp <$fh>; };` [download] Note that by having an indirect file handle in the do where I localize `$/`, the file is automatically closed once I'm done with it. Second, comments can contain all sorts of text that might interfere with a parse. As well, an XML document may contain a CDATA block, which can contain very nearly arbitrary text. I'm assuming that you don't have them in your trial document since you never handle them, but they are possible and must be removed before you can handle anything else. This also introduces the need to tokenize, as you must extract something from your document, but keep a placeholder in there so you know where your content came from. As who knows what's in the document, we'll need to pick something that can't possibly be legal XML, but that we can work around in our regular expression. How about `<<#>>`, where # is the index in our token array. Note that since comment delimiters are not special within a CDATA block and vice versa, we must strip them simultaneously. So: `my @tokens; while ($sandboxxml =~ /<!\[(CDATA)\[\|<!--/) { if ($1) { # We're in a CDATA block $sandboxxml =~ s/<!\[CDATA\[(.?)\]\]>/'<<' . (0+@tokens) . '> +>'/es; push @tokens, $1; } else { # Comment $sandboxxml =~ s/<!--.?-->//s; } }` [download] Note we're just dropping comments, that if the file isn't well-formed, we just created an infinite loop, and lots of lovely escaping since `[` and `]` have special meaning in regular expressions. Okay, now we can start actually dealing with tags. Because of how XML is structured, we need to work from the inside out; otherwise is very hard in a general regex to know if you've actually matched start and end tags. We also now need to keep track of a tree structure in some way, but fortunately we can do that in a soft way using the tokens array we've already started. `while ($sandboxxml =~ s#(<[^<>](?:/\|>(?:[^<>]\|<<\d>>)</[^<>])>)#'< +<' . (0+@tokens) . '>>'#es) { push @tokens, $1; }` [download] Of course, that's a giant mess. We also haven't built our tree up yet and failed to handle the leading `<?xml...>` tag. And hundred other things. And if our expressions are that complex, debugging them is going to be a pain. #11929 First ask yourself `How would I do this without a computer?' Then have the computer do it the same way.	[reply] [d/l] [select]
Re: pattern match screwed up!! by Anonymous Monk on Jan 21, 2015 at 23:14 UTC
This may not be what you want to hear but: How do I match XML, HTML, or other nasty, ugly things with a regex? Do not use regexes. Use a module and forget about the regular expressions. ... XML::LibXML ... Other XML modules often recommended are XML::Twig, XML::Rules, or XML::Compile. Personally I prefer XML::LibXML or XML::Twig. (You may run into XML::Simple too, but that is only appropriate for really simple cases, of which this does not appear to be one, so I'd avoid it.) If you could show some representative sample input and the corresponding expected output, someone might be able to get you started with one of those modules.	[reply]
Re: pattern match screwed up!! (junit xml) by Anonymous Monk on Jan 22, 2015 at 00:22 UTC
pattern match screwed up!! ... Moreover none of the patter match is working. Its easy for you to figure it out if you use Basic debugging checklist , brian's Guide to Solving Any Perl Problem Also, its even easier to XML::LibXML with tools like xpather.pl/htmltreexpather.pl which can give you paths to start with, and all the links here Re: Retrieve select information from HTML, they're examples(for tree-xpath and others)/walkthroughs/tutorials ... ~~You know whats even easier? `[metacpan://junit] [cpan://junit]` -> junit junit -> bupkis~~ no dedicated junit parsers on cpan :)	[reply] [d/l]
Re^2: pattern match screwed up!! (junit xml) by mdfaizy (Initiate) on Jan 22, 2015 at 08:38 UTC
thanks for the hint. I did look into LibXML module but in the activestae installation at office this module is not installed. And apparently I am not allowed to alter this installation. The only module available to me are Simple and Expat.	[reply]
Re^3: pattern match screwed up!! (junit xml) by jfroebe (Parson) on Jan 22, 2015 at 14:03 UTC
Talk with your manager at work. Explain what you're trying to do and why. Once there is a legitimate business reason, most companies will allow the installation. Jason L. Froebe Blog, Tech Blog	[reply]
Re: pattern match screwed up!! by roboticus (Chancellor) on Jan 22, 2015 at 11:54 UTC
mdfaizy: In addition to what kennethk so eloquently said, I'd like to offer one little thing: A solution based on regexes can be fragile. You may spend a good bit of effort to make something that "works", and you'll be fine ... ... ... for a while. Every once in a while, the document will have something "interesting" in it, and your regex solution will break. Then you get to fix it. Unfortunately, it'll likely keep happening and irregular intervals. Even worse, it may appear to be working, but you may miss important things. Since a regex solution doesn't understand the structure of the XML document, you won't know when your regexes aren't working unless they fail in an obvious fashion. The worst failures are when it fails in a non-obvious fashion. As an example, suppose you don't handle attributes on tags because there aren't any currently. Then someone makes a change, and you get a document like this: `... <orders> <order> <orderID>1234</orderID> .. other order details .. </order> <order priority="SUPER IMPORTANT"> <orderID>1235</orderID> .. multimillion dollar order .. </order> </orders> ...` [download] Your boss, expecting a big order sometime soon asks "Hey, did we get any important orders yet?" You look at your log and say, "No, we just got one order today, it doesn't look special." That super important order will likely cause many people headaches and phone calls. But since the attribute existed in the order tag, it got missed. </endOfContrivedExample> ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l]

Back to Seekers of Perl Wisdom