Parsing XML that contains HTML

ruhk has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks! In an earlier thread a monk recommended that I try out XML::Twig for parsing an Atom feed, and for the most part, it works out great. The code I have so far:

my $content = get("foo/bar.xml");
die "Couldn't get it!" unless defined $content;

my $t= XML::Twig->new( 
       # the twig will include just the root and selected titles 
           twig_roots   => { 'summary/div' => \&grabDiv,
                                         'title' => \&grabTitle,
                                         'entry/link' => \&grabLinks  
+         
               }
                      );
  $t->parse($content);


  sub grabDiv 
    { my( $t, $elt)= @_;
      push ( @divs , $elt->text);    # print the text (including sub-e
+lement texts)
      $t->purge;           # frees the memory
    }
    
    
  sub grabTitle 
    { my( $t, $elt)= @_;
      push( @headlines, $elt->text);    # print the text (including su
+b-element texts)
      $t->purge;           # frees the memory
    }
    
  sub grabLinks 
    { my( $t, $elt)= @_;
        my $thingey = $elt->{'href'};
      push( @links, $elt->text);    # print the text (including sub-el
+ement texts)
      $t->purge;           # frees the memory
    }
[download]

This works great, except for the fact that the links in an atom feed are stored like this:

<link href="foo/bar.html" rel="alternate" title="Foo" type="text/html"
+/>
[download]

Ive sat here furiosly staring at the screen for some time now, but I cant seem to find the answer. Is there a way XML::Twig can do this or will I have to come up with some kind of dirty regex?

Comment on Parsing XML that contains HTML Select or Download Code

Replies are listed 'Best First'.
Re: Parsing XML that contains HTML by iburrell (Chaplain) on Jul 21, 2004 at 17:13 UTC
What do you want do? From, the sample you showed the XML doesn't contain HTML. It contains a <link> element with an href attributes that points to a URL that is HTML. I don't the Atom spec to know if the link element can have content. What do you want to do with the links? Do you want the URL from the href, the link text, the HTML file pointed to? To get the URL from the href attribute, I think it is: `my $url = $elt->att('href');` [download] `$elt->text` is the right way to get the text for the link, which is empty in this case. For the HTML page pointed to, you will need to fetch it with LWP.	[reply] [d/l] [select]
Re^2: Parsing XML that contains HTML by ruhk (Scribe) on Jul 21, 2004 at 17:23 UTC
Ah yes, that was exactly what I was looking for.	[reply]
Re: Parsing XML that contains HTML by The Mad Hatter (Priest) on Jul 21, 2004 at 17:21 UTC
Is there a reason you want to parse it yourself rather than use the XML::Atom distribution (which can fetch, parse, and generate feeds)?	[reply]
Re^2: Parsing XML that contains HTML by ruhk (Scribe) on Jul 21, 2004 at 18:08 UTC
I wanted to use that, however I cant seem to get it installed. Basically, all I am doing is parsing the feed, ripping out the headlines, links, and titles, and building a javascript file with that information once every 30 mins or so.	[reply]