Parsing XML...by hand (ugh)

raindog308 has asked for the wisdom of the Perl Monks concerning the following question:

I've been tasked with parsing some (simplified) XML. And unfortunately, I've been told I can't use any external modules - only what comes with perl 5 - and it has to be "pure perl" (not using the expat C library). The XML is pretty simple, in a format like this:

<some_tag>
  <tag>value</tag>
  <tag2>value</tag2>
</some_tag>
[download]

Just tags and elements - no attributes. I figured I'd write an event-based parser, invoking handlers for each tag start and tag end. But what's the best way to write the parser? A state machine? If so, should it go character-by-character (collapse the input down to one line and then process by character)? Or is there a regex-based approach? Yes, I realize this is The Dumb Way to do XML, but that's what's been dropped on my lap.

Comment on Parsing XML...by hand (ugh) Download Code

Replies are listed 'Best First'.
Re: Parsing XML...by hand (ugh) by Anonymous Monk on Dec 04, 2010 at 02:33 UTC
And unfortunately, I've been told I can't use any external modules - only what comes with perl 5 - and it has to be "pure perl" (not using the expat C library). Why? The answer to that question will determine where replies to you will begin and end.	[reply]
Re: Parsing XML...by hand (ugh) by grantm (Parson) on Dec 05, 2010 at 03:08 UTC
I've been told I can't use any external modules - only what comes with perl 5 If you're running on Windows, there's a good chance that your Perl installation shipped with at least one XML parser module. Both Activestate Perl and Strawberry Perl ship with XML::Parser and the underlying libexpat library. Strawberry Perl also includes XML::LibXML which would be a better choice than XML::Parser. If you're not running on Windows then the XML parser libraries are probably readily available from your operating system's software package repositories. While writing your own XML parsing code is a valuable learning experience, chances are an existing module would be a much better choice for deploying in production.	[reply]
Re: Parsing XML...by hand (ugh) by Khen1950fx (Canon) on Dec 04, 2010 at 10:28 UTC
Maybe this will give you some ideas on how to proceed: Perl XML-like parser.	[reply]
Re: Parsing XML...by hand (ugh) by markhh (Novice) on Dec 04, 2010 at 21:52 UTC
I think a big while loop with three regexp will get you most of the way there. `while (1) { /\G<([^>]+)>/gc and do { start_tag($1); next; } /\G</([^>]+)>/gc and do { end_tag($1); next; } /\G([^<]*)/gc and do { body($1); next; } last; }` [download] May need an /s modifier on those regexp.	[reply] [d/l]
Re: Parsing XML...by hand (ugh) by Generoso (Prior) on Dec 06, 2010 at 03:37 UTC
This by no way the finish product but it may help you get started. #!/usr/bin/perl -w # use strict; my ($xmlvar,$i,$k,$v); my %hash = (); while (<DATA>) { print; chomp($_); $i = 0; while ($i++ < length($_)) { print "1: '"; print $1 if /(<)/gc; print "', pos=", pos, "\n"; $k = ''; print "2: '"; $k = $1 if /\G([A-Za-z0-9]+)/gc; print "$k', pos=", pos, "\n" +; my $f = '</'.$k.'>' if defined $k; print "3: '"; print $1 if /(>)/gc; print "', pos=", pos, "\n"; $v = ''; print "4: '"; $v = $1 if /\G(.*)$f/gc; print "$v', pos=", pos, "\n"; $i = pos; print $k,' << ',$v,"\n"; $hash{$k} .= $v.','if defined $k; } print "Final: '$1', pos=",pos,"\n" if /\G(.)/; } s/,\z// for values %hash; while ( my ($key, $value) = each(%hash) ) { print "$key => $value\n"; } __DATA__ <dataset> <row> <name>dog</name> <category>home pet</category> </row> <row> <name>cat</name><category>home pet</category> </row> <row> <name>penguin</name> <category>fish</category> </row> <row> <name>lax</name> <category>wile</category> </row> <row> <name>whale</name> <category>fish</category> </row> <row> <name>ostrich</name> <category>bird</category> </row> <row> <name>catfish</name> <category>fish</category> </row> </dataset> [download]	[reply] [d/l]