comment on

Dear Monks,
I parse a certain amount of HTML pages ( > 400 ) which have a structure as shown in the <DATA> part of the script below.
The relevant part of the page begins with <div id="bodyContent"> so that I put this part only in the script.
What I need is the text between the certain <h2>-tags.
I used HTML:TreeBuilder:XPath but I did not find how I could formulate an intersection there (e.g. following of <h2>[1] and preceding of <h2>[2] at the same time).
As a workaround I take the preceding-sibling in sequence of <h2>[i] tags, stringify the output and use substr to subtract the preceding chunks of text.
This works (after some clean up) but the code looks no fun to me.
Please give me a hint how I could make it better.
Thank you!
VE

#!/usr/bin/perl
use strict;
use warnings;

use LWP::Simple;
use HTML::TreeBuilder::XPath;

my $page;
$page .= $_ while <DATA>;
my $p = HTML::TreeBuilder::XPath->new_from_content( $page );

my @page_content =$p->findnodes( '//div[@id="bodyContent"]' );

for my $content ( @page_content )
{
    my @preface = $content->findvalues( './h2[1]/preceding-sibling::*'
+ );
        my $preface_text;
        my ( $keyword, $actualised );
        for my $pref ( @preface )
        {
            # $pref =~ s/^\s*(\S+)/$1/;
            $preface_text .= $pref;
            
            # print $preface_text, "--\n";
            
            ( undef, $keyword ) = split /:\s*?/, $pref, 2 if $pref =~ 
+/^\s*?Key words/;
            ( undef, $actualised ) = split /:\s*?/, $pref, 2 if $pref 
+=~ /^Actualised/;
        }
        print $keyword, "\n";
        print $actualised, "\n";
        
        my @problems = $content->findvalues( './h2[2]/preceding-siblin
+g::*' );
        my $probl;
        $probl .= $_ for @problems;
        $probl = substr( $probl, length( $preface_text) );
        
        print $probl, "\n";
        
        my @solution_1 = $content->findvalues( './h2[3]/preceding-sibl
+ing::*' );
        my $sol;
        $sol .= $_ for @solution_1;
        $sol = substr( $sol, length( $preface_text ) + length( $probl 
+) );
        print $sol, "\n";
        
        my @solution_2 = $content->findvalues( './h2[4]/preceding-sibl
+ing::*' );
        my $sol_2;
        $sol_2 .= $_ for @solution_2;
        $sol_2 = substr( $sol_2, length( $preface_text ) + length( $pr
+obl ) + length( $sol ) );
        print $sol_2 , "\n";
        
}

__DATA__
<head>
</head>
<body>
<div id="bodyContent">
        <!-- start content -->
<p>Key words: Some words. 
</p><p>Date:  2012-01-16
</p><p>Actualised: 2008-01-08 
</p><p>Commented: 05.06.2007
</p><p>Encoded: Some code.  
</p>
<h2> <span class="mw-headline" id="Problem"> Problem </span></h2>
<p>Problem description.
</p><p>Another description.
</p>
<h2> <span class="mw-headline" id="Solution1"> Solution 1 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Solution2"> Solution 2 </span></h2>
<p>Solution description.
</p>
<h2> <span class="mw-headline" id="Comment"> Comment. </span></h2>
<p>Text of the comment.
</p><p><br />
</p>
</div>
<hr />
</body>
[download]

In reply to Extracting HTML content between the h tags by vagabonding electron

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.