How to extract text between two tags?

Cursed Chico has asked for the wisdom of the Perl Monks concerning the following question:

Hello. You can see the html page here, it was pasted there. http://forum.vingrad.ru/act-Print/client/printer/f-5/t-326992.html It starts with <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> OTher writings are Russian but html text is in English. I need to extract the text after the line

Paper ID - Title (# Reviewers)

until

Select Reviewer(s):

I did this use strict; use warnings; use English qw(-no_match_vars); use Carp; use HTML::TreeBuilder; my $f = 'index.html'; my $tree = HTML::TreeBuilder->new; $tree->parse_file($f); my @options = $tree->find('OPTION'); foreach (@options){ print $_->as_text,"\n"; } $tree->delete; #clear memory sleep(131); but it only parses all text. I want only specific text. Please help me.

Comment on How to extract text between two tags?

Replies are listed 'Best First'.
Re: How to extract text between two tags? by Anonymous Monk on May 28, 2015 at 22:42 UTC
use htmltreexpather.pl / xpather.pl / examples(for tree-xpath and others)/walkthroughs/tutorials ... They'll give you paths you can use to reach `[ Paper ID - Title (# Reviewers) ]` Paths like these `/html/body/div[4]/form/dl/dd/p /html[1]/body[1]/div[4]/form[1]/dl[1]/dd[1]/p[1] //*[ name() = "form" and position() = 1 and @action = "/openconf/chair +/assign_reviews.php" and @method = "post" ] /dl[1] /dd[1] /p[1]` [download] These paths are easy to use with HTML::TreeBuilder::XPath or XML::LibXML They can help you visualize the html even if you choose to stick with TreeBuilder's look_down	[reply] [d/l] [select]
Re^2: How to extract text between two tags? by Cursed Chico (Initiate) on May 30, 2015 at 16:23 UTC
Can you please give the exact code? i did this, it did not work `my $document = do { local $/ = undef; open my $all, "<", $file or die "could not open $file: $!"; <$all>; }; my $p0 = index($all, "Paper ID Title"); if ($p0 > -1) { my $p1 = index($all, "\>", $p0); if ($p1 > -1) { my $p2 = index($all, "Select Reviewer(s):", $p1); if ($p2 > -1) { my $target = substr($all, $p1, ($p2 - $p1)); } } +} print "$all"; sleep(22);` [download]	[reply] [d/l]
Re^3: How to extract text between two tags? by Anonymous Monk on May 30, 2015 at 17:10 UTC
Can you please give the exact code? i did this, it did not work I already did ... but ok `use HTML::TreeBuilder::XPath; my $tree = HTML::TreeBuilder::XPath->new; $tree->parse_file( 'foohtml' ); my($form) = $tree->findnodes( '//form[1]' ); my($dlddp)= $form->findnodes( './dl[1]/dd[1]/p[1]' ); print $dlddp->as_text, "\n\n"; __END__ [ Paper ID - Title (# Reviewers) ]` [download]	[reply] [d/l]
Re: How to extract text between two tags? by GotToBTru (Prior) on May 28, 2015 at 20:08 UTC
I would use the debugger to inspect the value of `$tree` and make sure it's what you expect. Dum Spiro Spero	[reply] [d/l]
Re: How to extract text between two tags? by FreeBeerReekingMonk (Deacon) on May 28, 2015 at 22:14 UTC
You should use an xml parser, however, after seeing it even does not have a </body> tag, here is a oneliner I use often: `cat foo.html \|perl -ne 'print if /Paper ID/ .. /\/SELECT/'` you also need to unescape & lt ; back to text, see Unescape characters from XML::Twig `curl -s http://forum.vingrad.ru/act-Print/client/printer/f-5/t-326992. +html \|perl -pe 's{<br />}{\n}g' \|perl -ne 'print if /Paper ID/ .. /\/ +SELECT/'` [download] I am surprised the browser can handle and display that webpage...	[reply] [d/l] [select]
Re^2: How to extract text between two tags? by Anonymous Monk on May 28, 2015 at 22:31 UTC
Many will complain that you should use an xml parser, however You don't need an XML parser to parse html, HTML::TreeBuilder will do just fine	[reply]
Re^3: How to extract text between two tags? by bitingduck (Deacon) on May 28, 2015 at 22:46 UTC
Not only will HTML::TreeBuilder do fine, but if it's an HTML file an XML parser is likely to die quickly on it. XML parsers are required to fail on invalid XML, while HTML parsers are allowed to be more forgiving (e.g. HTML::TreeBuilder defaults to inserting implicit end tags that would cause an XML parser to quit)	[reply]
Re^4: How to extract text between two tags? by Anonymous Monk on May 28, 2015 at 23:19 UTC