raindog308 has asked for the wisdom of the Perl Monks concerning the following question:

I've been tasked with parsing some (simplified) XML. And unfortunately, I've been told I can't use any external modules - only what comes with perl 5 - and it has to be "pure perl" (not using the expat C library). The XML is pretty simple, in a format like this:
<some_tag> <tag>value</tag> <tag2>value</tag2> </some_tag>
Just tags and elements - no attributes. I figured I'd write an event-based parser, invoking handlers for each tag start and tag end. But what's the best way to write the parser? A state machine? If so, should it go character-by-character (collapse the input down to one line and then process by character)? Or is there a regex-based approach? Yes, I realize this is The Dumb Way to do XML, but that's what's been dropped on my lap.

Replies are listed 'Best First'.
Re: Parsing XML...by hand (ugh)
by Anonymous Monk on Dec 04, 2010 at 02:33 UTC
    And unfortunately, I've been told I can't use any external modules - only what comes with perl 5 - and it has to be "pure perl" (not using the expat C library).
    Why? The answer to that question will determine where replies to you will begin and end.
Re: Parsing XML...by hand (ugh)
by grantm (Parson) on Dec 05, 2010 at 03:08 UTC
    I've been told I can't use any external modules - only what comes with perl 5

    If you're running on Windows, there's a good chance that your Perl installation shipped with at least one XML parser module. Both Activestate Perl and Strawberry Perl ship with XML::Parser and the underlying libexpat library. Strawberry Perl also includes XML::LibXML which would be a better choice than XML::Parser.

    If you're not running on Windows then the XML parser libraries are probably readily available from your operating system's software package repositories.

    While writing your own XML parsing code is a valuable learning experience, chances are an existing module would be a much better choice for deploying in production.

Re: Parsing XML...by hand (ugh)
by Khen1950fx (Canon) on Dec 04, 2010 at 10:28 UTC
Re: Parsing XML...by hand (ugh)
by markhh (Novice) on Dec 04, 2010 at 21:52 UTC
    I think a big while loop with three regexp will get you most of the way there.
    while (1) { /\G<([^>]+)>/gc and do { start_tag($1); next; } /\G</([^>]+)>/gc and do { end_tag($1); next; } /\G([^<]*)/gc and do { body($1); next; } last; }
    May need an /s modifier on those regexp.
Re: Parsing XML...by hand (ugh)
by Generoso (Prior) on Dec 06, 2010 at 03:37 UTC

    This by no way the finish product but it may help you get started.

    #!/usr/bin/perl -w # use strict; my ($xmlvar,$i,$k,$v); my %hash = (); while (<DATA>) { print; chomp($_); $i = 0; while ($i++ < length($_)) { print "1: '"; print $1 if /(<)/gc; print "', pos=", pos, "\n"; $k = ''; print "2: '"; $k = $1 if /\G([A-Za-z0-9]+)/gc; print "$k', pos=", pos, "\n" +; my $f = '</'.$k.'>' if defined $k; print "3: '"; print $1 if /(>)/gc; print "', pos=", pos, "\n"; $v = ''; print "4: '"; $v = $1 if /\G(.*)$f/gc; print "$v', pos=", pos, "\n"; $i = pos; print $k,' << ',$v,"\n"; $hash{$k} .= $v.','if defined $k; } print "Final: '$1', pos=",pos,"\n" if /\G(.)/; } s/,\z// for values %hash; while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } __DATA__ <dataset> <row> <name>dog</name> <category>home pet</category> </row> <row> <name>cat</name><category>home pet</category> </row> <row> <name>penguin</name> <category>fish</category> </row> <row> <name>lax</name> <category>wile</category> </row> <row> <name>whale</name> <category>fish</category> </row> <row> <name>ostrich</name> <category>bird</category> </row> <row> <name>catfish</name> <category>fish</category> </row> </dataset>