Re: A Question on a homebrew XML parser

Well, I just wanted to let everyone know that I have made some major improvements to my code. I have added many features including CDATA sections (please look at the HACK section. It it massively sub optimal), Safe perl execution of PI code, well formedness testing (although, I am not quite happy with the way I did it), DTD recognition and storage (I don't really parse it up to try to validate yet), an OO interface.

I am still having problems with end of file testing. I am at a loss to figure it out right now so I did other stuff pending. Also, I thought about the ">" ending tag recognition. If I remember correctly, attribute values have to be bounded by quote so that should not be a problem. If a ">" can be part of an attribute name then that causes problems. I probably should look into it.

If you have any ideas, it would be cool to comment on it. Anyway, here is the code. Have fun and thanks for all the comments from everyone before.

package CyParser;

require 5.6.1;

use warnings;
use strict;
use utf8;
use FileHandle;
use IO::String;
use Safe;

our $VERSION = "1.00";

sub new {
    my $class = shift;
    my $input = shift;
    my $self = {};
    
    bless $self, $class;

    if(defined $input) {
    $self->setInput($input);
    }

    return $self;
}

sub setInput {
    my $self = shift;
    my $input = shift;

    if($input->isa("IO::Handle")) {
    $self->{input} = $input;
    }else {
    if($input =~ m/\.xml/) {
        $self->{input} = new FileHandle "$input", "r";
    }else {
        $self->{input} = new IO::String $input;
    }
    }

    if(not exists $self->{input} and not defined $self->{input}) {
    die "Invalid input type\n";
    }
}

sub getDocument {
    my $self = shift;
    my $doc;

    if(exists $self->{doc}) {
    $doc = $self->{doc};
    }else{
    die "A document has not been parsed yet.\n";
    }

    if(not defined $doc) {
    die "There is no document\n";
    }

    return $doc;
}

sub reset {
    my $self = shift;

    delete $self->{doc};
    delete $self->{input};
}

sub getVerson {
    return $VERSION;
}

sub parse {
    #args
    my $self = shift;
    my $fh = $self->{input};

    #pre-declared vars
    my %doc;
    my $preSymbol;
    my $currSymbol;
    my $currElement;
    my $depth = 0;

    CHAR: while(defined($currSymbol = getc($fh))) {
    if($currSymbol eq "<") {
        my $nextSymbol = getc($fh);

        my %element;
        my $elementName = "";
        my $attributes = "";

        #if this is a letter then this is a start tag
        #then read until you reach a space.
        #attribs should be following or the end of the tag
        if($nextSymbol =~ m/\p{IsAlpha}/) {
        $preSymbol = $currSymbol;
        $currSymbol = $nextSymbol;
        $depth++;

        until($currSymbol eq " " or $currSymbol eq ">") {
            $elementName .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
        }
        
        if($currSymbol eq " ") {
            until($currSymbol eq ">") {
            if($currSymbol eq "/") {
                $depth--;
                next CHAR;
            }

            $attributes .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            }
        }
        
        if($currSymbol eq ">") {
            $element{name} = $elementName;
            $element{children} = [];
            $element{depth} = $depth;

            my %attribHash;

            #unrolling attributes into a hash
            #the attribute name is the hash key and the 
            #attribute value is the value in the hash
            my @attribs = split /\p{IsSpace}/, $attributes;

            foreach my $attrib (@attribs) {
            my ($key, $val)  = split /\p{IsSpace}*=\p{IsSpace}*/, $att
+rib;
            
            if(not defined $key) {
                next;
            }

            $val =~ s/\"//g;

            $attribHash{$key} = $val;
            
            $element{attributes} = \%attribHash;
            }
            
            if($depth == 1) {
            $doc{root} = \%element;
            push @{$self->{starttags}}, $element{name} . "\t" . $eleme
+nt{depth};

            $currElement = $doc{root};
            }else{
            #this is checking to see if this element is
            #an empty element.  The : are the boundries
            #for the regular expressions.
            if($element{name} =~ m:/:) {
                $element{name} =~ s:/::;

                push @{$currElement->{children}}, \%element;
                push @{$self->{endtags}}, $element{name} . "\t$depth";
                push @{$self->{starttags}}, $element{name} . "\t$depth
+";

                $depth--;
                next CHAR;
            }else {
                push @{$currElement->{children}}, \%element;
                push @{$self->{starttags}}, $element{name} . "\t" . $e
+lement{depth};

                $currElement = \%element;
                next CHAR;
            }
            }
        }else {
            die "Symbol: $currSymbol is not recognized\n";
        }
        }
        #if this is a / then this is an end tag
        #then read until the end >        
        elsif($nextSymbol eq "/") {
        my $name = "";
        $preSymbol = $currSymbol;
        $currSymbol = $nextSymbol;
        
        until($currSymbol eq ">") {
            $name .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
        }

        $name =~ s:/::g;

        #save for well formedness testing
        push @{$self->{endtags}}, $name . "\t$depth";

        #set the depth correctly
        $depth--;

        #went forward one too may symbols
        seek($fh, -1, 1);
        }
        #if this is a ? then this is a processing instuction
        #read until the space for the application name
        #then compare that to make sure that it is not the xml decl
        #that happens at the beginning of a document
        #then read until the next ?
        elsif($nextSymbol eq "?") {
        my $appName = "";
        my $appInfo = "";

        $preSymbol = $currSymbol;
        $currSymbol = getc($fh);

        until($currSymbol eq " ") {
            $appName .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
        }

        if($appName eq "xml" or $appName eq "XML") {

            until($currSymbol eq "?") {
            $appInfo .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            }

            #break up pseudo-attribs for easier access
            #and analysis

            my %xmlAttribs;

            my @attribs = split /\p{IsSpace}/, $appInfo;

            foreach my $attrib (@attribs) {
            my ($key, $val)  = split /\p{IsSpace}*=\p{IsSpace}*/, $att
+rib;
            
            if(not defined $key) {
                next;
            }

            $val =~ s/\"//g;

            $xmlAttribs{$key} = $val;
            }

            $doc{xmldecl} = \%xmlAttribs;
            next CHAR;
        }else {
            until($currSymbol eq "?") {
            $appInfo .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            }
            
            $currElement->{PI}->{$appName} = $appInfo;

            if(exists $currElement->{PI}->{CyParser}) {
            my $cpt = new Safe;
            $cpt->share_from("XML::CyParser", ['%doc', '$currSymbol', 
+'$preSymbol', '$nextSymbol']);
            $cpt->reval($currElement->{PI}->{CyParser});

            if($@) {
                warn $@;
            }
            }
        }
        }elsif($nextSymbol eq "!") {
        $preSymbol = $currSymbol;
        $currSymbol = getc($fh);

        if($currSymbol eq "[") {

            my $cdata = "";
            my $ending = "";

            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);

            until($currSymbol eq "[") {
            $cdata .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            }
            
            if($cdata ne "CDATA") {
            die "This is not a CDATA section!\n";
            }

            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            
            until($currSymbol eq "]") {
            $currElement->{content} .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            }
            
            #this steps through the ending for the
            #CDATA section

            $ending .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            $ending .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            $ending .= $currSymbol;

            my ($first, $second, $thrid) = split //, $ending;

            #Hack: this should be a regex like
            #m/\[\[>/ but perl would not accept it
            if($first ne "]" and $second ne "]" and $thrid ne ">") {
            die "CDATA section was not ended correctly\n";
            }
        }elsif($currSymbol eq "-") {
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);

            die "This is not a comment\n" unless $currSymbol eq "-";

            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);

            until($currSymbol eq "-") {
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            }
        }elsif($currSymbol =~ m/(\p{IsAlpha})/ and $depth == 0) {
            my $dtddecl;

            $dtddecl .= $preSymbol;
            $dtddecl .= $nextSymbol;
            $dtddecl .= $1;

            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);

            until($currSymbol eq "]") {
            $dtddecl .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            }

            $doc{dtd} .= $dtddecl . $currSymbol . ">";
        }
        }
    }elsif($currSymbol =~ m/\p{IsAlnum}|\p{IsSpace}/) {
        my $content = "";
        my $entity = "";

        #parsing and replacing general entities
        until($currSymbol eq "<") {
        if($currSymbol eq "&") {
            $entity .= $currSymbol;

            until($currSymbol eq ";") {
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
            $entity .= $currSymbol;
            }

            #these need to deal with double entities like
            #&amp; and &lt;
            if($entity eq "&amp;") {
            $content .= "&";
            }elsif($entity eq "&#38") {
            until($currSymbol eq ";") {
                $preSymbol = $currSymbol;
                $currSymbol = getc($fh);
                $entity .= $currSymbol;
            }

            if($entity eq "&#38;&#38;") {
                $content .= "&";
            }elsif($entity eq "&#38;&#60;") {
                $content .= "<";
            }else{
                die "Entity &#38; not fully entered: $entity\n";
            }
            }elsif($entity eq "&gt;" or $entity eq "&#62;") {
            $content .= ">";
            }elsif($entity eq "&lt;") {
            $content .= "<";
            }elsif($entity eq "&apos;" or $entity eq "&#39;") {
            $content .= "\'";
            }elsif($entity eq "&quot;" or $entity eq "&#34;") {
            $content .= "\"";
            }

            #go to the next char to get checked
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
        }else {
            $content .= $currSymbol;
            $preSymbol = $currSymbol;
            $currSymbol = getc($fh);
        }
        }
        
        $currElement->{content} .= $content;

        #went forward one too many symbols
        seek($fh, -1, 1);
        next CHAR;
    }
    }
    
    $self->{doc} = \%doc;
    
    $self->checkWF();
}

sub checkWF {
    my $self = shift;

    my ($package, $filename) = caller;

    #check to make sure that this is called ONLY from this parser
    unless($package eq __PACKAGE__ and $filename eq __FILE__) {
    die "CheckWF is an internal function and can only be called by the
+ XML::CyParser\n";
    }

    eval {
    my @{$self->{endtags}} = sort @{$self->{endtags}};
    my @{$self->{starttags}} = sort @{$self->{starttags}};
    };

    if($@) {
    die "This document is not well formed at the root element.\n";
    }

    my %starttags = map {$_ => 1} @{$self->{starttags}};

    my @diffs = grep {not $starttags{$_}} @{$self->{endtags}};

    if(scalar @diffs != 0) {
    my $message = "Document not well formed at:\n";

    foreach my $diff (@diffs) {
        $message .= "$diff\n";
    }

    delete $self->{endtags};
    delete $self->{starttags};

    die $message;
    }else {
    delete $self->{endtags};
    delete $self->{starttags};
    }
}
[download]

Comment on Re: A Question on a homebrew XML parser Download Code

Replies are listed 'Best First'.
Re: Re: A Question on a homebrew XML parser by mirod (Canon) on Feb 25, 2002 at 12:21 UTC
Please, please, please, read the spec! And use some existing validation tests, such as those which come with XML::Parser. What you are parsing is not XML, but what you think is XML. 2 exemples: production 10 from the spec: `AttValue ::= '"' ([^<&"] \| Reference)* '"' \| "'" ([^<&'] \| Reference)* "'"` which means that `att='value'` or `even att='"'` are perfectly well-formed attributes, production 23: `XMLDecl ::= '<?xml' VersionInfo EncodingDecl? SDDecl? S? '?>'` so <?XML... does not start a valid XML document</code> I haven't really analyzed your code in detail, but other things strike me as odd: why is DD a special case? I could probably go on. I have a more general comment, please don't take it too personnaly, I really think you might want to think about it: you started the thread by saying that you wanted to keep your skills up. I believe you are not. Instead of hacking a clever, but useless and above all incorrect toy parser, I think your skills would benefit much more of doing a project properly. You can certainly try writing an XML parser, even though I think this might be a little too ambitious, but really, try doing it the right way: get the spec, write test cases (or even better, find existing ones), design your parser, and then write it, test it and start bragging about it ;--) Above all choose a softwrae development method and stick to it. You seem to have a good command of Perl, now try to improve your general software engineering skills. Believe me this will be way more valuable for you than what you are doing here.	[reply] [d/l] [select]
Re: Re: Re: A Question on a homebrew XML parser by cyocum (Curate) on Feb 25, 2002 at 22:49 UTC
I really appreciate your comments and I really appreciate your time and effort to comment on my program. I have been using the spec. Especally since things have come to a point of more difficult problems when dealing with XML. Now looking at the spec, I could have sworn that it said both xml and XML were valid xml prolog markers. I was wrong. I cannot find it. Also, the `DD`. I could not find in my code at all. Now `&&` is the & sign here in the spec. Also, I found a problem with the ">" if it is used either in the attribute content or the name of the attribute. You were correct. It will cause problems. As for software development processes, I have used eXtreme programming in a production environment before (with Java rather than Perl). I liked the theory. I would like to try it again sometime but I will only try it with people that I already trust. I had personal problems with the the people whom I tried this theory. It ended in complete disaster. Now I know that personal problems were not really dealt with in the theory. I have a copy of eXtreme programming explained. Also, coming from a QA background the test before you code idea is a godsend. In any case, I do not take your comments personally. I am glad that you wrote them. I will take them under advisement. Like I said before this is a learning experience and I have learned much about the XML spec and how it is put together. This experience will now be wrapped up into a possibly new project that will be better than this one. Maybe one of these days I will have something to brag about ;). Again, I thank you very much for your comments.	[reply] [d/l] [select]