adamobrien has asked for the wisdom of the Perl Monks concerning the following question:
I want to render a some elements of a html page into wiki format. I'm aware that there is a HTML-to-wiki module, this doesn't work for the complex pages I will be using.
Initially, for debugging, I am just marking the text with the relevant labels (p, Link etc).
My problem is that HTML element that are 'embedded' in other elements don't render in the 'correct' order. Let me give an example, consider the following html file, test.htm:
<html> <body> <p>This is an introduction.</p> <a href="http://www.google.com">Google</a> <p>The <a href="http://www.bbc.co.uk">BBC</a>, says,...</p> </body> </html>
My script, below, renders the first 2 (substantive) lines correctly. But it doesn't work for the 3rd (substantive) line, where an <a> tag is embeded in a <p> tag. Instead of rendering:
"p The Link: http://www.bbc.co.uk Link Text: BBC p says,...".
It renders
"p The BBC says,...
Link: http://www.bbc.co.uk Link Text: BBC"
How can I get the link and the text to render in my desired order?
I'm afraid I'm quite committed to using HTML::Tree, so I would prefer solutions that use this module.
Here's the script. Thanks in advance for any help.#!/usr/bin/perl -w use HTML::Tree; use LWP::Simple; use strict; my $tree = HTML::TreeBuilder->new(); $tree->parse_file("test.htm"); #Main processing. sub wiki_render {my $element = $_[0]; if ($element->tag eq 'p') {print $element->as_text; print "\n";} elsif ($element->tag eq 'a') {print"Link: "; print $element->attr('href'); print " Link Text: "; print $element->as_text; print "\n";} # This recursion drives the 'loop' that ensures all elements of the HT +ML Tree are processed. foreach my $child ($element->content_list) { next unless (ref $child); wiki_render($child);} } wiki_render($tree);
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: HTML::Tree - I can't extract element in the 'correct' order
by choroba (Cardinal) on Oct 13, 2012 at 13:13 UTC |