Tails has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone! I am in need of some help for a program I am writing. What I am doing is filtering through an HTML file using REGEX and with my end result, returning how many sentences are in the main text (including title) and also searching for sentences which match an argument taken in through the command line. So for example: "perl web-scan.pl "fall|election|2009" WebPage011.htm The program will then print out the following kind of output: Input file "WebPage011.htm" contains 55 sentences composed of 905 distinct words. 3 sentences match the pattern "fall|election|2009". The sentences are: 4: "We hate elections." 16: "The dog was injured in a fall from the balcony." 24: "There will be no 2009 fall election." " So I have more or less filtered out my WebPage file using REGEX and I have counted all the lines/sentences. What I don't know how to do though is to take that an argument (In this case "fall|election|2009" ) and search the document for these words/sentences while returning the sentence number. This is my whole code so far:
#!/usr/bin/perl -w use strict; use warnings; use diagnostics; print "$ARGV[0]\n"; my $first = $ARGV[0]; print "$first This is my First Argument!\n"; my $string = do { local $/; <> }; $string=~ s/[\n\r]//g; $string=~ s/.*(<title>.*?<\/title>).*?(<body.*?<\/body>).*/$1,$2/gsi; $string=~ s/<title>(.*?)<\/title>/$1/gsi; $string=~ s/<body.*?>(.*?)<\/body>/$1/gsi; $string=~ s/24&#176;//gsi; $string=~ s/<!--.*?-->//gsi; $string=~ s/<a.*?<\/a>//sgi; $string=~ s/<form.*?<\/form>//sgi; $string=~ s/<iframe.*?<\/iframe//sgi; $string=~ s/<noscript.*?<\/noscript>//sgi; $string=~ s/<script.*?<\/script>//sgi; $string=~ s/<select .*?<\/select>//sgi; $string=~ s/<textarea.*?<\/textarea>//sgi; $string=~ s/<li.*?<\/li>//sgi; $string=~ s/<IMG.*?>//gsi; $string=~ s/<div.*?>//gsi; $string=~ s/<\/div.*?>//gsi; $string=~ s/<b.*?>|<\/b>//gsi; $string=~ s/<h1.*?>|<\/h1>//gsi; $string=~ s/<h2.*?>|<\/h2>//gsi; $string=~ s/<h3.*?>|<\/h3>//gsi; $string=~ s/<h4.*?>|<\/h4>//gsi; $string=~ s/<h5.*?>|<\/h5>//gsi; $string=~ s/<h6.*?>|<\/h6>//gsi; $string=~ s/<head.*?>|<\/head>//gsi; $string=~ s/<html.*?>|<\/html>//gsi; $string=~ s/<li.*?>|<\/li>//gsi; $string=~ s/<option.*?>|<\/option>//gsi; $string=~ s/<script.*?>|<\/script>//gsi; $string=~ s/<p.*?>|<\/p>//gsi; $string=~ s/<span.*?>//gsi; $string=~ s/<\/span.*?>//gsi; $string=~ s/<\/ul.*?>//gsi; $string=~ s/<ul.*?>//gsi; $string=~ s/<hr.*//gsi; $string=~ s/<input.*?>//gsi; $string=~ s/[^\x{00}-\x{7E}]//gsi; $string=~ s/&nbsp|&#160;/ /gsi; $string=~ s/&#39;/'/gsi; $string=~ s/&gt;/>/; $string=~ s/&amp;/&/gsi; $string=~ s/&lt;/</gsi; $string=~ s/CClear//gsi; my @list = split(/\s+/, $string); my $word_count = $#list; my @sentence = split (/\.|\?|\!/, $string); print "@list\n"; print "There are $#sentence sentences in the list\n"; print "There are $#list words.\n";
I know it's pretty messy, I was going to clean up the REGEX after I figured out how to search through the webpages.
my $count; foreach (@sentence){ $count++; if (@sentence=~ m/$first/gsi){ print "Matched! at line $count\n"; print "@sentence[10]\n"; } }
I was thinking of using something like this to count the lines and find out where the word is located, but to no avail. I also don't know how to match in an if statement. Any and all information or direction would be highly appreciated. I've hit a cap for today's work with perl lol. Thanks!

Replies are listed 'Best First'.
Re: Searching through a document and reporting results.
by GrandFather (Saint) on Jan 30, 2011 at 01:22 UTC

    You should investigate some of the CPAN modules for manipulating HTML such as HTML::TreeBuilder or HTML::Parser to make parsing your file much easier.

    If you are working with web pages then LWP and WWW::Mechanize are your friends.

    Note that in scalar context an array returns its number of elements so you can simply write my $count = @sentences;.

    True laziness is hard work
Re: Searching through a document and reporting results.
by mvaline (Friar) on Jan 30, 2011 at 05:40 UTC
    The approach that occurs to me would be to put your keyword tests into a subroutine and use the each function to test the sentences in a way that makes the index / sentence number easily accessible.
    while (($key, $value) = each @sentence) { if (has_one_or_more_keywords($value)) print "$key: $value\n"; }
    I second the suggestion to consider an HTML parsing module. You may also want to consider replacing your sentence and word splits with a more sophisticated grammar for parsing sentences and words using a module like Parse::RecDescent. For example, the period character is not a sentence terminator when used in an ellipsis, as a decimal point in a number, etc.