Beefy Boxes and Bandwidth Generously Provided by pair Networks
P is for Practical
 
PerlMonks  

A Question on a homebrew XML parser

by cyocum (Curate)
on Jan 23, 2002 at 05:59 UTC ( [id://140781]=perlquestion: print w/replies, xml ) Need Help??

cyocum has asked for the wisdom of the Perl Monks concerning the following question:

I humbly ask the Perl Monks to look at my code.

First, a little background. I am unemployed at the moment so to keep my skills up I have started to write a home brew XML parser. Now there are many features still missing but I have enough to begin testing against a real XML file.

The problem that I am having is that when I parse this file. The while loop exits on <fifths>0</fifths>. I have been racking my feeble mind to untangle what is happening. I hope that the great Perl Monks may be able to help me.

Here is the code that runs this. Please be lienient since I am right now just trying to the process down before I OO it.

use warnings; use strict; use utf8; use FileHandle; use Data::Dumper; my $document = &parseXML("mut.xml"); my $log = new FileHandle "dump.txt", "w+"; print $log Dumper($document); sub parseXML { #args my $file = shift; #pre-declared vars my %doc; my $preSymbol; my $currSymbol; my $currElement; my $depth = 0; my $fh = new FileHandle $file, "r"; my $log = new FileHandle "log.txt", "w+"; CHAR: while($currSymbol = getc($fh)) { print $log "entering while. current: $currSymbol\n"; if($currSymbol eq "<") { my $nextSymbol = getc($fh); print $log "current: $currSymbol next: $nextSymbol\n"; my %element; my $elementName = ""; my $attributes = ""; #if this is a letter then this is a start tag #then read until you reach a space. #attribs should be following or the end of the tag if($nextSymbol =~ m/\p{IsAlpha}/) { $preSymbol = $currSymbol; $currSymbol = $nextSymbol; $depth++; print $log "next symbol is a word char. current: $currSymbol\n +"; until($currSymbol eq " " or $currSymbol eq ">") { print $log "Getting element name\n"; $elementName .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } print $log "Element name: $elementName\n"; if($currSymbol eq " ") { until($currSymbol eq ">") { print $log "getting attributes\n"; $attributes .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } } print $log "Attribs are: $attributes\n"; if($currSymbol eq ">") { print $log "done getting element info. Depth: $depth\n"; $element{name} = $elementName; $element{attributes} = $attributes; $element{children} = []; $element{depth} = $depth; if($depth == 1) { $doc{root} = \%element; $currElement = $doc{root}; }else{ #this is checking to see if this element is #an empty element. The : are the boundries #for the regular expressions. if($element{name} =~ m:/:) { print $log "this is an empty element.\n"; $element{name} =~ s:/::; $depth--; my $children = $currElement->{children}; push @{$children}, \%element; print $log "adding to parent node and going to next ch +ar.\n"; next CHAR; }else { my $children = $currElement->{children}; push @{$children}, \%element; $currElement = \%element; print $log "going to next char in the while loop.\n"; next CHAR; } } }else { print $log "Symbol: $currSymbol is not recognized\n"; } } #need to determine if this is an end element or #an empty element #if this is a / then this is an end tag #then read until the end > elsif($nextSymbol eq "/") { print $log "next symbol is a / ending element.\n"; $currElement->{status} = "closed"; $depth--; $preSymbol = $currSymbol; $currSymbol = $nextSymbol; until($currSymbol eq ">") { $preSymbol = $currSymbol; $currSymbol = getc($fh); } #went forward one too may symbols seek($fh, -1, 1); print $log "element finished\n"; } #if this is a ? then this is a processing instuction #read until the space for the application name #then compare that to make sure that it is not the xml decl #that happens at the beginning of a document #then read until the next ? elsif($nextSymbol eq "?") { print $log "Found processing instuction.\n"; my $appName = ""; my $appInfo = ""; $preSymbol = $currSymbol; $currSymbol = getc($fh); print $log "Reading app name for PI.\n"; until($currSymbol eq " ") { $appName .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } print $log "appname is $appName.\n"; if($appName eq "xml") { print $log "This is the standard decl for xml.\n"; until($currSymbol eq "?") { $appInfo .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } $doc{xmldecl} = $appInfo; next CHAR; }else { print $log "This is a true PI.\n"; until($currSymbol eq "?") { $appInfo .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } $currElement->{PI}->{$appName} = $appInfo; } } }elsif($currSymbol =~ m/\p{IsAlnum}|\p{IsSpace}/) { my $content = ""; print $log "adding content to current element.\n"; until($currSymbol eq "<") { $content .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } $currElement->{content} .= $content; #went forward one too many symbols seek($fh, -1, 1); next CHAR; } } print $log "Current symbol: $currSymbol. Presymbol: $preSymbol\nfi +nished with document.\n"; return \%doc; }

Edit: chipmunk 2002-01-22

Replies are listed 'Best First'.
Re: A Question on a homebrew XML parser
by mirod (Canon) on Jan 23, 2002 at 12:32 UTC

    I did not debug completely your code, although getting a depth of 135 for an element leads me to believe there is a problem somewhere, and the fact that it does not stop at the end of the file is a problem too, but here are a couple of comments:

    • you really need to have a look at the XML spec, and especially at the production rules (if you want to extract the rules Ways to Rome has plenty of code examples for this)
    • spaces are allowed nearly everywhere in markup: between the opening < and the tag name for example,
    • > is a perfectly valid character in an attribute value, so you cannot use it to spot the end of a tag,
    • some of the features that you are not supporting: CDATA sections, entities, DTD...

    My advice would be to do this "a-la-Xtreme Programming": write code that logs the parsing for one (or several) of the XML parsers (XML::Parser, XML::SAX::PurePerl, XML::Parser::Lite, XML::libXML) and then compare the output to what your code does, for small files, one file at a time. This will give you a reference and a goal for your parser.

    When you feel pretty confident that your code works fine, then you can try it on the XML::Parser test suite and decide that after all you don't really need to write no stinkin' parser, as I did ;--(

Re: A Question on a homebrew XML parser
by wog (Curate) on Jan 23, 2002 at 06:22 UTC

    The loop is exiting because of your test:

    CHAR: while($currSymbol = getc($fh)) {
    The character 0 is false, and thus fails this test. You probably should use something like:
    CHAR: while( defined($currSymbol = getc($fh)) ) {

    (If you were using <$fh> to read from the file this would not be a problem even if you read 0 only because perl special cases that to have an implict defined() around it.)

Re: A Question on a homebrew XML parser
by Matts (Deacon) on Jan 23, 2002 at 12:26 UTC
    Apart from the other comments (it's zero-is-false syndrome), I'll add that you should really be using IO::File, rather than FileHandle, as it's the "newer" version of OO files.

    Also, and perhaps more significantly (depending on what your aim is - learning or doing it for serious work), you have a long way to go before this is "right". Currently the only parser in Perl getting anywhere near compliance to the XML spec is my XML::SAX::PurePerl, and I'm so not finished it's not even funny. At the moment PurePerl has about 3000 lines of code (and very few docs or comments), and I've got shit loads of stuff left to implement (at the moment it's careful enough to skip over stuff I haven't implemented yet).

    Don't let me discourage you though - it can be fun writing hand crafted parsers, but with XML you really do have to know the spec inside out.

Re: A Question on a homebrew XML parser
by Anonymous Monk on Jan 23, 2002 at 06:29 UTC
    Just a minor nitpick here, but in several places you use double quotes("") around pure text (like ">").

    Now since "" interpolate variables (ex: "$_>>>>"), and '' do not, and you are not interpolating any vars within the string, there is no need for the double quotes.

    It probably doesn't matter much, but like I said, it's a nitpick (you wouldn't use splice(@a,0,1), where you meant shift(@a), now would you?)

      Actually no.

      Double quotes are optimised away by the compiler. In the case of no variables actually being interpolated, then double quotes work exactly the same way single quotes do. Effectively all the compiler does when it sees a variable in double quotes is turn something like:

      "Error $@ in $file"

      into:

      "Error " . $@ . " in " . $file;

      I know Doug MacEachern was also talking about optimising things so that in some cases you might optimise it into a join, or simply a comma separated list of strings and variables. But I don't think he's submitted any patches to do that just yet.

Re: A Question on a homebrew XML parser
by cyocum (Curate) on Feb 25, 2002 at 11:11 UTC

    Well, I just wanted to let everyone know that I have made some major improvements to my code. I have added many features including CDATA sections (please look at the HACK section. It it massively sub optimal), Safe perl execution of PI code, well formedness testing (although, I am not quite happy with the way I did it), DTD recognition and storage (I don't really parse it up to try to validate yet), an OO interface.

    I am still having problems with end of file testing. I am at a loss to figure it out right now so I did other stuff pending. Also, I thought about the ">" ending tag recognition. If I remember correctly, attribute values have to be bounded by quote so that should not be a problem. If a ">" can be part of an attribute name then that causes problems. I probably should look into it.

    If you have any ideas, it would be cool to comment on it. Anyway, here is the code. Have fun and thanks for all the comments from everyone before.



    package CyParser; require 5.6.1; use warnings; use strict; use utf8; use FileHandle; use IO::String; use Safe; our $VERSION = "1.00"; sub new { my $class = shift; my $input = shift; my $self = {}; bless $self, $class; if(defined $input) { $self->setInput($input); } return $self; } sub setInput { my $self = shift; my $input = shift; if($input->isa("IO::Handle")) { $self->{input} = $input; }else { if($input =~ m/\.xml/) { $self->{input} = new FileHandle "$input", "r"; }else { $self->{input} = new IO::String $input; } } if(not exists $self->{input} and not defined $self->{input}) { die "Invalid input type\n"; } } sub getDocument { my $self = shift; my $doc; if(exists $self->{doc}) { $doc = $self->{doc}; }else{ die "A document has not been parsed yet.\n"; } if(not defined $doc) { die "There is no document\n"; } return $doc; } sub reset { my $self = shift; delete $self->{doc}; delete $self->{input}; } sub getVerson { return $VERSION; } sub parse { #args my $self = shift; my $fh = $self->{input}; #pre-declared vars my %doc; my $preSymbol; my $currSymbol; my $currElement; my $depth = 0; CHAR: while(defined($currSymbol = getc($fh))) { if($currSymbol eq "<") { my $nextSymbol = getc($fh); my %element; my $elementName = ""; my $attributes = ""; #if this is a letter then this is a start tag #then read until you reach a space. #attribs should be following or the end of the tag if($nextSymbol =~ m/\p{IsAlpha}/) { $preSymbol = $currSymbol; $currSymbol = $nextSymbol; $depth++; until($currSymbol eq " " or $currSymbol eq ">") { $elementName .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } if($currSymbol eq " ") { until($currSymbol eq ">") { if($currSymbol eq "/") { $depth--; next CHAR; } $attributes .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } } if($currSymbol eq ">") { $element{name} = $elementName; $element{children} = []; $element{depth} = $depth; my %attribHash; #unrolling attributes into a hash #the attribute name is the hash key and the #attribute value is the value in the hash my @attribs = split /\p{IsSpace}/, $attributes; foreach my $attrib (@attribs) { my ($key, $val) = split /\p{IsSpace}*=\p{IsSpace}*/, $att +rib; if(not defined $key) { next; } $val =~ s/\"//g; $attribHash{$key} = $val; $element{attributes} = \%attribHash; } if($depth == 1) { $doc{root} = \%element; push @{$self->{starttags}}, $element{name} . "\t" . $eleme +nt{depth}; $currElement = $doc{root}; }else{ #this is checking to see if this element is #an empty element. The : are the boundries #for the regular expressions. if($element{name} =~ m:/:) { $element{name} =~ s:/::; push @{$currElement->{children}}, \%element; push @{$self->{endtags}}, $element{name} . "\t$depth"; push @{$self->{starttags}}, $element{name} . "\t$depth +"; $depth--; next CHAR; }else { push @{$currElement->{children}}, \%element; push @{$self->{starttags}}, $element{name} . "\t" . $e +lement{depth}; $currElement = \%element; next CHAR; } } }else { die "Symbol: $currSymbol is not recognized\n"; } } #if this is a / then this is an end tag #then read until the end > elsif($nextSymbol eq "/") { my $name = ""; $preSymbol = $currSymbol; $currSymbol = $nextSymbol; until($currSymbol eq ">") { $name .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } $name =~ s:/::g; #save for well formedness testing push @{$self->{endtags}}, $name . "\t$depth"; #set the depth correctly $depth--; #went forward one too may symbols seek($fh, -1, 1); } #if this is a ? then this is a processing instuction #read until the space for the application name #then compare that to make sure that it is not the xml decl #that happens at the beginning of a document #then read until the next ? elsif($nextSymbol eq "?") { my $appName = ""; my $appInfo = ""; $preSymbol = $currSymbol; $currSymbol = getc($fh); until($currSymbol eq " ") { $appName .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } if($appName eq "xml" or $appName eq "XML") { until($currSymbol eq "?") { $appInfo .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } #break up pseudo-attribs for easier access #and analysis my %xmlAttribs; my @attribs = split /\p{IsSpace}/, $appInfo; foreach my $attrib (@attribs) { my ($key, $val) = split /\p{IsSpace}*=\p{IsSpace}*/, $att +rib; if(not defined $key) { next; } $val =~ s/\"//g; $xmlAttribs{$key} = $val; } $doc{xmldecl} = \%xmlAttribs; next CHAR; }else { until($currSymbol eq "?") { $appInfo .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } $currElement->{PI}->{$appName} = $appInfo; if(exists $currElement->{PI}->{CyParser}) { my $cpt = new Safe; $cpt->share_from("XML::CyParser", ['%doc', '$currSymbol', +'$preSymbol', '$nextSymbol']); $cpt->reval($currElement->{PI}->{CyParser}); if($@) { warn $@; } } } }elsif($nextSymbol eq "!") { $preSymbol = $currSymbol; $currSymbol = getc($fh); if($currSymbol eq "[") { my $cdata = ""; my $ending = ""; $preSymbol = $currSymbol; $currSymbol = getc($fh); until($currSymbol eq "[") { $cdata .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } if($cdata ne "CDATA") { die "This is not a CDATA section!\n"; } $preSymbol = $currSymbol; $currSymbol = getc($fh); until($currSymbol eq "]") { $currElement->{content} .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } #this steps through the ending for the #CDATA section $ending .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); $ending .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); $ending .= $currSymbol; my ($first, $second, $thrid) = split //, $ending; #Hack: this should be a regex like #m/\[\[>/ but perl would not accept it if($first ne "]" and $second ne "]" and $thrid ne ">") { die "CDATA section was not ended correctly\n"; } }elsif($currSymbol eq "-") { $preSymbol = $currSymbol; $currSymbol = getc($fh); die "This is not a comment\n" unless $currSymbol eq "-"; $preSymbol = $currSymbol; $currSymbol = getc($fh); until($currSymbol eq "-") { $preSymbol = $currSymbol; $currSymbol = getc($fh); } }elsif($currSymbol =~ m/(\p{IsAlpha})/ and $depth == 0) { my $dtddecl; $dtddecl .= $preSymbol; $dtddecl .= $nextSymbol; $dtddecl .= $1; $preSymbol = $currSymbol; $currSymbol = getc($fh); until($currSymbol eq "]") { $dtddecl .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } $doc{dtd} .= $dtddecl . $currSymbol . ">"; } } }elsif($currSymbol =~ m/\p{IsAlnum}|\p{IsSpace}/) { my $content = ""; my $entity = ""; #parsing and replacing general entities until($currSymbol eq "<") { if($currSymbol eq "&") { $entity .= $currSymbol; until($currSymbol eq ";") { $preSymbol = $currSymbol; $currSymbol = getc($fh); $entity .= $currSymbol; } #these need to deal with double entities like #&amp; and &lt; if($entity eq "&amp;") { $content .= "&"; }elsif($entity eq "&#38") { until($currSymbol eq ";") { $preSymbol = $currSymbol; $currSymbol = getc($fh); $entity .= $currSymbol; } if($entity eq "&#38;&#38;") { $content .= "&"; }elsif($entity eq "&#38;&#60;") { $content .= "<"; }else{ die "Entity &#38; not fully entered: $entity\n"; } }elsif($entity eq "&gt;" or $entity eq "&#62;") { $content .= ">"; }elsif($entity eq "&lt;") { $content .= "<"; }elsif($entity eq "&apos;" or $entity eq "&#39;") { $content .= "\'"; }elsif($entity eq "&quot;" or $entity eq "&#34;") { $content .= "\""; } #go to the next char to get checked $preSymbol = $currSymbol; $currSymbol = getc($fh); }else { $content .= $currSymbol; $preSymbol = $currSymbol; $currSymbol = getc($fh); } } $currElement->{content} .= $content; #went forward one too many symbols seek($fh, -1, 1); next CHAR; } } $self->{doc} = \%doc; $self->checkWF(); } sub checkWF { my $self = shift; my ($package, $filename) = caller; #check to make sure that this is called ONLY from this parser unless($package eq __PACKAGE__ and $filename eq __FILE__) { die "CheckWF is an internal function and can only be called by the + XML::CyParser\n"; } eval { my @{$self->{endtags}} = sort @{$self->{endtags}}; my @{$self->{starttags}} = sort @{$self->{starttags}}; }; if($@) { die "This document is not well formed at the root element.\n"; } my %starttags = map {$_ => 1} @{$self->{starttags}}; my @diffs = grep {not $starttags{$_}} @{$self->{endtags}}; if(scalar @diffs != 0) { my $message = "Document not well formed at:\n"; foreach my $diff (@diffs) { $message .= "$diff\n"; } delete $self->{endtags}; delete $self->{starttags}; die $message; }else { delete $self->{endtags}; delete $self->{starttags}; } }

      Please, please, please, read the spec! And use some existing validation tests, such as those which come with XML::Parser. What you are parsing is not XML, but what you think is XML.

      2 exemples:

      • production 10 from the spec:
          AttValue  ::= '"' ([^<&"] | Reference)* '"' |  "'" ([^<&'] | Reference)* "'"
        which means that att='value' or even att='"' are perfectly well-formed attributes,
      • production 23:
          XMLDecl  ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'
        so <?XML... does not start a valid XML document</code>

      I haven't really analyzed your code in detail, but other things strike me as odd: why is &#68;&#68; a special case? I could probably go on.

      I have a more general comment, please don't take it too personnaly, I really think you might want to think about it: you started the thread by saying that you wanted to keep your skills up. I believe you are not. Instead of hacking a clever, but useless and above all incorrect toy parser, I think your skills would benefit much more of doing a project properly. You can certainly try writing an XML parser, even though I think this might be a little too ambitious, but really, try doing it the right way: get the spec, write test cases (or even better, find existing ones), design your parser, and then write it, test it and start bragging about it ;--) Above all choose a softwrae development method and stick to it.

      You seem to have a good command of Perl, now try to improve your general software engineering skills. Believe me this will be way more valuable for you than what you are doing here.

        I really appreciate your comments and I really appreciate your time and effort to comment on my program. I have been using the spec. Especally since things have come to a point of more difficult problems when dealing with XML. Now looking at the spec, I could have sworn that it said both xml and XML were valid xml prolog markers. I was wrong. I cannot find it. Also, the &#68;&#68;. I could not find in my code at all. Now &#38;&#38; is the & sign here in the spec. Also, I found a problem with the ">" if it is used either in the attribute content or the name of the attribute. You were correct. It will cause problems.

        As for software development processes, I have used eXtreme programming in a production environment before (with Java rather than Perl). I liked the theory. I would like to try it again sometime but I will only try it with people that I already trust. I had personal problems with the the people whom I tried this theory. It ended in complete disaster. Now I know that personal problems were not really dealt with in the theory. I have a copy of eXtreme programming explained. Also, coming from a QA background the test before you code idea is a godsend. In any case, I do not take your comments personally. I am glad that you wrote them. I will take them under advisement. Like I said before this is a learning experience and I have learned much about the XML spec and how it is put together. This experience will now be wrapped up into a possibly new project that will be better than this one. Maybe one of these days I will have something to brag about ;). Again, I thank you very much for your comments.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://140781]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (2)
As of 2024-04-19 18:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found