Cursed Chico has asked for the wisdom of the Perl Monks concerning the following question:

Hello. You can see the html page here, it was pasted there. http://forum.vingrad.ru/act-Print/client/printer/f-5/t-326992.html It starts with <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> OTher writings are Russian but html text is in English. I need to extract the text after the line

Paper ID - Title (# Reviewers)

until

Select Reviewer(s):

I did this use strict; use warnings; use English qw(-no_match_vars); use Carp; use HTML::TreeBuilder; my $f = 'index.html'; my $tree = HTML::TreeBuilder->new; $tree->parse_file($f); my @options = $tree->find('OPTION'); foreach (@options){ print $_->as_text,"\n"; } $tree->delete; #clear memory sleep(131); but it only parses all text. I want only specific text. Please help me.

Replies are listed 'Best First'.
Re: How to extract text between two tags?
by Anonymous Monk on May 28, 2015 at 22:42 UTC
      Can you please give the exact code? i did this, it did not work
      my $document = do { local $/ = undef; open my $all, "<", $file or die "could not open $file: $!"; <$all>; }; my $p0 = index($all, "Paper ID Title"); if ($p0 > -1) { my $p1 = index($all, "\>", $p0); if ($p1 > -1) { my $p2 = index($all, "Select Reviewer(s):", $p1); if ($p2 > -1) { my $target = substr($all, $p1, ($p2 - $p1)); } } +} print "$all"; sleep(22);

        Can you please give the exact code? i did this, it did not work

        I already did ... but ok

        use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file( 'foohtml' ); my($form) = $tree->findnodes( '//form[1]' ); my($dlddp)= $form->findnodes( './dl[1]/dd[1]/p[1]' ); print $dlddp->as_text, "\n\n"; __END__ [ Paper ID - Title (# Reviewers) ]
Re: How to extract text between two tags?
by GotToBTru (Prior) on May 28, 2015 at 20:08 UTC

    I would use the debugger to inspect the value of $tree and make sure it's what you expect.

    Dum Spiro Spero
Re: How to extract text between two tags?
by FreeBeerReekingMonk (Deacon) on May 28, 2015 at 22:14 UTC

    You should use an xml parser, however, after seeing it even does not have a </body> tag, here is a oneliner I use often:
     cat foo.html |perl -ne 'print if /Paper ID/ .. /\/SELECT/'


    you also need to unescape & lt ; back to text, see Unescape characters from XML::Twig

    curl -s http://forum.vingrad.ru/act-Print/client/printer/f-5/t-326992. +html |perl -pe 's{<br />}{\n}g' |perl -ne 'print if /Paper ID/ .. /\/ +SELECT/'

    I am surprised the browser can handle and display that webpage...

      Many will complain that you should use an xml parser, however

      You don't need an XML parser to parse html, HTML::TreeBuilder will do just fine

        Not only will HTML::TreeBuilder do fine, but if it's an HTML file an XML parser is likely to die quickly on it. XML parsers are required to fail on invalid XML, while HTML parsers are allowed to be more forgiving (e.g. HTML::TreeBuilder defaults to inserting implicit end tags that would cause an XML parser to quit)