slloyd has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to use regular expression to parse an xml string. I cannot seem to figure out how to take into account attributes that may be in the inital tag. Can someone please help me?
use strict; my $xml=<<ENDOFXML; <shirt size="14"> ... stuff... </tag> <hat name="red" size="7"> ...more stuff... </hat> <shoe> ...shoe stuff... </shoe> ENDOFXML while($xml=~m/\<(.+?)\>(.+?)\<\/\1\>/sig){ my $tag=lc(strip($1)); my $attribs="?"; my $value=strip($2); print "Tag: $tag\n"; print "\tattributes: $attribs\n"; print "\tvalue: $value\n"; } print "Done\n"; exit; ############### sub strip{ #usage: $str=strip($str); #info: strips off beginning and endings returns, newlines, tabs, a +nd spaces my $str=shift; if(length($str)==0){return;} $str=~s/^[\r\n\s\t]+//s; $str=~s/[\r\n\s\t]+$//s; return $str; }

-------------------------------
Sign up now for a free monthly newsletter service!
http://www.bestgazette.com

Replies are listed 'Best First'.
Re: Regular expression - parsing and xml string
by toolic (Bishop) on Feb 22, 2008 at 16:01 UTC
    In general, parsing XML is difficult, and most problems have already been solved by others. You should consider using one of the many CPAN modules, such as XML::Simple or XML::Parser.
Re: Regular expression - parsing and xml string
by pc88mxer (Vicar) on Feb 22, 2008 at 16:10 UTC
    Try changing your while regex to something like:

    while($xml=~m/\<(.+?)(\s.*?)?\>(.+?)\<\/\1\>/sig){

    and then $2 is your attributes and $3 is your element body.

    You also might consider one of the various XML parsing modules available from CPAN like XML::Simple for instance. Robust XML parsing is not simple - there are just so many details that can trip you up.