mscharrer has asked for the wisdom of the Perl Monks concerning the following question:

Dear fellow Monks, I have a XHTML file which holds several div tags as shown below. The first group of div's belong to class A and the following to class B and are marked using a class attribute. I would like to parse this part of the file with XML::Simple so that I get an hash with two array references A and B which hold the references to the div content as shown at the end of this post.

I went through the XML::Simple documentation but I didn't find anything suitable. The 'KeyAttr' option does actually create the wanted hash structure but only for one A and one B class div because the class isn't a key attribute here.

Could anyone with more XP with XML::Simple point me out the right option(s). Also maybe an other XML module must be used for that.

Thanks in advance, and sorry if it's really simple and I just oversaw it.

My simple test script so far is:

#!/usr/bin/perl use strict; use warnings; use XML::Simple; my $filename = shift or die "Usage: decode <filename>\n"; my $xmlin; open ($xmlin, '<', $filename) or die "Error: Can't open input file!\n" +; my $ref = XMLin( $xmlin, # TODO: Add correct option ); use Data::Dumper; print Dumper $ref; __END__

The XHTML content:

<div id="main"> <div class="A"> ... </div> <div class="A"> ... </div> <div class="A"> ... </div> <div class="B"> ... </div> <div class="B"> ... </div> <div class="B"> ... </div> </div>

The hash structure I want:

$VAR1 = { 'div' => { 'A' => [ { .. }, { .. }, { .. }, ], 'B' => [ { .. }, { .. }, { .. }, ] } };

Replies are listed 'Best First'.
Re: XML::Simple - Make arrays out of class attribute
by ikegami (Patriarch) on Sep 14, 2008 at 18:14 UTC

    XML is a parser. It returns what is. If you want to perform transformations, you'll have to do them yourself.

    use Data::Dumper; my $parent = { div => [ { class => 'A', content => '1' }, { class => 'A', content => '2' }, { class => 'A', content => '3' }, { class => 'B', content => '4' }, { class => 'B', content => '5' }, { class => 'B', content => '6' }, ], }; my $grouped = {}; for ( @{ $parent->{div} } ) { my $class = delete $_->{class}; push @{ $grouped->{$class} }, $_; } $parent->{div} = $grouped; print(Dumper($parent));

    By the way, the need to perform such transformations could indicate that your XML's schema incorrectly represents your data.

      OOPS

      After posting this i realized it clobbers the duplicates, which I didnt see at first.

      This is possible with the KeyAttr option:

      #!/usr/bin/perl use Data::Dumper; use XML::Simple; use strict; use warnings; my $filename = shift or die "Usage: decode <filename>\n"; # dies on its own on error my $xmldata = XMLin($filename, KeyAttr => { 'div' => 'class' }); print Dumper $xmldata; __END__
        Your code doesn't work. It deletes 4 of the 6 elements.
      Thanks ikegami for your answer.
      I was trying to avoid doing the transformation by myself. There are already a lot of XML::Simple options which change the structure of the returned hash, i.e. do some form of transformations, so I thought there might be one for this case also. For example if the class would be unique for every div the 'KeyAttr' options would do exactly what I want.

      You are right that the XML code doesn't represent the data correctly. That's because it's XHTML of a webpage where I don't have any influence. It was written for display by a web browser not as a XML database.

      Thanks for your code. I starting to think that I will need it.

Re: XML::Simple - Make arrays out of class attribute
by Jenda (Abbot) on Sep 16, 2008 at 10:18 UTC

    If you want to have more control over the generated data structure, try XML::Rules. In this case something like:

    use strict; use XML::Rules; my $parser = XML::Rules->new( normalisespaces => 1, rules => { div => sub { if (exists $_[1]->{class}) { my $class = delete $_[1]->{class}; return '@'.$class => $_[1]; } else { return $_[0] => $_[1]; } } }, ); my $data = $parser->parse(\*DATA); use Data::Dumper; print Dumper($data); __DATA__ <div id="main"> <div class="A"> ... </div> <div class="A"> ... </div> <div class="A"> ... </div> <div class="B"> ... </div> <div class="B"> ... </div> <div class="B"> ... </div> </div>
Re: XML::Simple - Make arrays out of class attribute
by mtths (Initiate) on Sep 15, 2008 at 13:03 UTC
    you can somewhat get near that hashstructure you want with
    my $xmldata = XMLin($filename, KeyAttr => [ ]);
    that produces something like:
    $VAR1 = { 'div' => [ { 'content' => 'data1', 'class' => 'A' }, { 'content' => 'data2', 'class' => 'A' }, { 'content' => 'data1', 'class' => 'B' }, { 'content' => 'data2', 'class' => 'B' } ] };
    which you can easily transform to the datastructure of your preference
      Thanks mtths,
      that's a good start. Might be the farest I can get with XML::Simple options. I think this makes the needed transformation easier.
        Easier how? KeyAttr => [] is just the defaultjust like the default in this case. This is exactly the input on which my transformation code works.