tshabet has asked for the wisdom of the Perl Monks concerning the following question:

Back again with another question :-) I have XML of this form
<heading level=2> Introduction to Arguments</heading>
and
<heading> <index primary-key="procedures" secondary-key="definition, rest ar +guments in"/> <index primary-key="rest arguments" secondary-key="specifying, in +procedure definition"/> level=3, Specifying Rest Arguments in a Procedure Definition</heading>
And I would like to integrate the "level" attribute into the <heading> tag, so that the first would become
<heading2> Introduction to Arguments</heading2>
and the second would become
<heading3> <index primary-key="procedures" secondary-key="definition, rest ar +guments in"/> <index primary-key="rest arguments" secondary-key="specifying, in +procedure definition"/> Specifying Rest Arguments in a Procedure Definition</heading3>
To do this I've written a regex which, unfortunately, fails to work. Here's my regex:
$text =~ s/<heading([^(<\/heading>)]*?)level=(.{1})([^(<\/heading>)]*? +)<\/heading>/<heading$2>$1$3<\/heading$2>/gsix;
Any hints on where my logic goes astray? Thanks for any and all info, sorry to bother everyone with such a simple question.

Replies are listed 'Best First'.
(jeffa) Re: Regex for XML attributes...
by jeffa (Bishop) on Aug 23, 2001 at 23:39 UTC
    First off, you XML is not well-formed - don't forget to place quotes around the level values as well.

    Second off, use XML::Simple instead . . .

    use strict; use XML::Simple; local $/; # just so i can slurp __DATA__ my $old_xml = XMLin(<DATA>,forcearray=>1); my $new_xml; foreach my $heading (@{ $old_xml->{'heading'} }) { my $level = delete $heading->{'level'}; $new_xml->{"heading$level"} = $heading; } print XMLout($new_xml); __DATA__ <xml> <heading level="2">Introduction to Arguments</heading> <heading level="3"> <index primary-key="procedures" secondary-key="definition, rest argume +nts in"/> <index primary-key="rest arguments" secondary-key="specifying, in proc +edure definition"/> Specifying Rest Arguments in a Procedure Definition </heading> </xml>
    The idea is to turn the XML into an anonymous data structure and mangle that instead - all you have to do is iterate thru the 'heading' hashes and remove the level attribute/key - then you create a new data structure whose keys are the concatenation of 'heading' with the level value.

    This works, but i only tested it on the XML that i provided, your milleage may vary. ;)

    output: <opt> <heading2>Introduction to Arguments</heading2> <heading3> Specifying Rest Arguments in a Procedure Definition <index primary-key="procedures" secondary-key="definition, rest argume +nts in" /> <index primary-key="rest arguments" secondary-key="spec +ifying, in procedure definition" /> </heading3> </opt>

    jeffa

        A flute with no holes is not a flute . . .
    a doughnut with no holes is a danish.
                                    - Basho,
                                      famous philosopher
    
      When you say that it isn't well formed XML, are you referring only to the unquoted 2 in the tag, or something else? Anyway, this is a fantastic solution, and I will implement it right away...talk about service ;-P
      OK, so certainly you've illuminated the best way to do it, but....just for my own learning....What's failing with the regex? Anybody? Anybody? Beuller?
      Thanks jeffa, you're a life saver!
        Yes, the unquoted 2 and 3 level attributes need to be quoted, else XML::Simple complains:
        not well-formed (invalid token) at line . . .
        Also, i really don't think this will do what you think:
        <heading> level="3", blah blah blah </heading>
        You really should change that to:
        <heading level="3">blah blah blah</heading>
        You are welcome for the solution! . . . i would like to see a regex that solved the problem as well, but i imagine it would be an unruly beast . . . anybody? beuller? japhy?

        Hmmmm, on second thought - this is something probably best not done!

Re: Regex for XML attributes...
by mirod (Canon) on Aug 23, 2001 at 23:57 UTC

    Your logic goes astray because you try to parse XML with regexps.

    You shouldn't.

    Please read On XML parsing for a (n incomplete) list of reasons not to.

    Actually the problem seems even worse than that: it looks like you are calling XML something that is not XML: level=2 is not an attribute value in XML, you need to quote the value.

    I can't really test much right now due to Antonio keeping me busy, but something like this using XML::Twig would certainly work:

    #!/usr/perl -w use strict; use XML::Twig; my $t= XML::Twig->new( twig_roots => { heading => \&heading }, twig_print_outside_roots => 1, pretty_print => 'indented' ); $t->parse( \*DATA); sub heading { my( $t, $heading)= @_; if( defined $heading->att( 'level')) { $heading->set_gi( 'heading' . $heading->att( 'level')); $heading->del_att( 'level'); } elsif( my $pcdata= $heading->first_child( '#PCDATA')) { my $text= $pcdata->pcdata; if( $text=~ s/^\s*level\s*=\s*(\d+),//s) { $pcdata->set_pcdata( $text); $heading->set_gi( 'heading' . $1); } } $heading->print; } __DATA__ <doc> <heading> <index/> <index/> level=3, Specifying Rest Arguments in a Procedure Definition</heading> <heading level="2"> Introduction to Arguments</heading> </doc>

    And finally, if you really want to use regexps, you can have a look at XML::Regexp or at XML::Parser::Lite.