Now suppose i need to make a generic code that just searched for the keywords "publications" and "interests" within a given page, how do i reform the code.
That's no moon, that's a space station -- Obiwan Kenobi.
To do text extraction based on known pattern is easy if you know what the section start and finish look like in general. However you are looking for a generic algorithm on logical text extraction, you need to build a text-classification/pattern-recognition engine, and that's going to be very very difficult. Difficult, but not impossible. But that's way beyond me, besides I don't want to lose too many brain cells over this. ;-)
I will only cover the easy way, ie, (deterministic) text extraction based on a set of known patterns...
use strict;
use warnings;
use Data::Dumper;
# build a hash of known patterns for each known web site
my %patterns = (
'www.foo.com' => {
start => "<h3><font[^>]*><b><!--KEY--></b>",
finish => "(?<!</font>\n)<br>",
},
'www.bar.com' => {
start => "...",
finish => "...",
},
);
my $html = do { local $/; <DATA> };
print ExtractSection($html, 'www.foo.com', 'Section 2'), "\n\n";
print ExtractSection($html, 'www.foo.com', 'Section 1'), "\n\n";
print ExtractSection($html, 'www.foo.com', 'Section 3'), "\n\n";
# -----------------------------------------------------
sub ExtractSection
{
my ($html, $site, $section) = @_;
my $ps = $patterns{$site}->{start};
my $pf = $patterns{$site}->{finish};
$ps =~ s/<!--KEY-->/$section/;
$pf =~ s/<!--KEY-->/$section/;
my ($text) = $html =~ /($ps.*?$pf)/sm;
return $text;
}
__DATA__
<HTML>
<h3><font size=+1><b>Section 1</b></font>
<br>
<li>Item 1
<li>Item 2
<li>Item 3
<br>
<h3><font size=+1><b>Section 2</b></font>
<br>
<li>Item 4
<li>Item 5
<li>Item 6
<br>
<h3><font size=+1><b>Section 3</b></font>
<br>
<li>Item 7
<li>Item 8
<li>Item 9
<br>
</HTML>
And the output -
<h3><font size=+1><b>Section 2</b></font>
<br>
<li>Item 4
<li>Item 5
<li>Item 6
<br>
<h3><font size=+1><b>Section 1</b></font>
<br>
<li>Item 1
<li>Item 2
<li>Item 3
<br>
<h3><font size=+1><b>Section 3</b></font>
<br>
<li>Item 7
<li>Item 8
<li>Item 9
<br>