comment on

Hello everyone! I am in need of some help for a program I am writing. What I am doing is filtering through an HTML file using REGEX and with my end result, returning how many sentences are in the main text (including title) and also searching for sentences which match an argument taken in through the command line. So for example: "perl web-scan.pl "fall|election|2009" WebPage011.htm The program will then print out the following kind of output: Input file "WebPage011.htm" contains 55 sentences composed of 905 distinct words. 3 sentences match the pattern "fall|election|2009". The sentences are: 4: "We hate elections." 16: "The dog was injured in a fall from the balcony." 24: "There will be no 2009 fall election." " So I have more or less filtered out my WebPage file using REGEX and I have counted all the lines/sentences. What I don't know how to do though is to take that an argument (In this case "fall|election|2009" ) and search the document for these words/sentences while returning the sentence number. This is my whole code so far:

#!/usr/bin/perl -w
use strict;
use warnings;
use diagnostics;

print "$ARGV[0]\n";


my $first = $ARGV[0];
print "$first This is my First Argument!\n";


my $string = do { local $/; <> };

$string=~ s/[\n\r]//g;



$string=~ s/.*(<title>.*?<\/title>).*?(<body.*?<\/body>).*/$1,$2/gsi;
$string=~ s/<title>(.*?)<\/title>/$1/gsi;
$string=~ s/<body.*?>(.*?)<\/body>/$1/gsi;
$string=~ s/24&#176;//gsi;



$string=~ s/<!--.*?-->//gsi;
$string=~ s/<a.*?<\/a>//sgi;
$string=~ s/<form.*?<\/form>//sgi;
$string=~ s/<iframe.*?<\/iframe//sgi;
$string=~ s/<noscript.*?<\/noscript>//sgi;
$string=~ s/<script.*?<\/script>//sgi;
$string=~ s/<select .*?<\/select>//sgi;
$string=~ s/<textarea.*?<\/textarea>//sgi;
$string=~ s/<li.*?<\/li>//sgi;
$string=~ s/<IMG.*?>//gsi;
$string=~ s/<div.*?>//gsi;
$string=~ s/<\/div.*?>//gsi;
$string=~ s/<b.*?>|<\/b>//gsi;
$string=~ s/<h1.*?>|<\/h1>//gsi;
$string=~ s/<h2.*?>|<\/h2>//gsi;
$string=~ s/<h3.*?>|<\/h3>//gsi;
$string=~ s/<h4.*?>|<\/h4>//gsi;
$string=~ s/<h5.*?>|<\/h5>//gsi;
$string=~ s/<h6.*?>|<\/h6>//gsi;
$string=~ s/<head.*?>|<\/head>//gsi;
$string=~ s/<html.*?>|<\/html>//gsi;
$string=~ s/<li.*?>|<\/li>//gsi;
$string=~ s/<option.*?>|<\/option>//gsi;
$string=~ s/<script.*?>|<\/script>//gsi;
$string=~ s/<p.*?>|<\/p>//gsi;
$string=~ s/<span.*?>//gsi;
$string=~ s/<\/span.*?>//gsi;
$string=~ s/<\/ul.*?>//gsi;
$string=~ s/<ul.*?>//gsi;
$string=~ s/<hr.*//gsi;
$string=~ s/<input.*?>//gsi;


$string=~ s/[^\x{00}-\x{7E}]//gsi;
$string=~ s/&nbsp|&#160;/ /gsi;
$string=~ s/&#39;/'/gsi;
$string=~ s/&gt;/>/;
$string=~ s/&amp;/&/gsi;
$string=~ s/&lt;/</gsi;
$string=~ s/CClear//gsi;

my @list = split(/\s+/, $string);
my $word_count = $#list;
my @sentence = split (/\.|\?|\!/, $string);

print "@list\n";    

print "There are $#sentence sentences in the list\n"; 

print "There are $#list words.\n";
[download]

I know it's pretty messy, I was going to clean up the REGEX after I figured out how to search through the webpages.

my $count;

foreach (@sentence){
    $count++;
    if (@sentence=~ m/$first/gsi){
        print "Matched! at line $count\n";
        print "@sentence[10]\n";
    }
    
}
[download]

I was thinking of using something like this to count the lines and find out where the word is located, but to no avail. I also don't know how to match in an if statement. Any and all information or direction would be highly appreciated. I've hit a cap for today's work with perl lol. Thanks!

In reply to Searching through a document and reporting results. by Tails

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.