Help with Web Scraping Script

EagerforPerl has asked for the wisdom of the Perl Monks concerning the following question:

use strict;
use warnings;
use LWP::Simple;
use File::Compare;
use File::Copy;
$| = 1;

sub main {
    #Create a file with current content, compare with all present file
+s in directory if same, delete, if not, keep.
    unless(-e('filesaves') or mkdir('filesaves')) {
        die("Directory Couldn't Be Created.\n");
    }
        #create directory if it does not already exist
    my $fileName;
    print("Enter Site Directory: "); #Test input: http://caveofprogram
+ming.com
        #Gather site URL with directory
    my $siteDirectory = <STDIN>;
    print("Number of Times to Run: "); #Test input: 10
    my $runAmount = <STDIN>;
        #Gather the number of times to check the web address
    unless(opendir(DIR, 'C:\\Program Files\\OSNE')) {
        die("Unable to open directory 'C:\\Program Files\\OSNE'\n");
    }
    for(my $i = 0; $i <= $runAmount; $i++) {
        my $file = readdir(DIR);
        closedir(DIR);
        $file = grep(/\.txt$/i, $file); #Filter as to only look for .t
+xt files
        my $searchTable = get($siteDirectory); #Get HTML code from web
+site
        if(defined($searchTable)) {
            $fileName = localtime() . '.txt'; #Set file name to the ti
+me it will be created
            $fileName =~ s/:/-/g; #remove the disallowed characters an
+d replace them so that it can be the file name
            open(my $outputFile, '>', $fileName) or die("Couldn't Crea
+te File.\n");
            while($searchTable =~ m|<\s*a\s+[^>]*href\s*=\s*['"]([^>"'
+]+)['"][^>]*>\s*([^<>]*)</|sig) {
                                #HTML code title filter regex
                print $outputFile ("$2: $1\n"); #print the titles to t
+he text file
            }
            if(compare($fileName, $file) == 0) {
                close($outputFile); #close output
                unlink($fileName); #delete file
            }
            else {
                close($outputFile);
                move("C:\\Program Files\\OSNE\\'$file'","C:\\Program F
+iles\\OSNE\\filesaves\\'$file'"); 
                                #Move the old file to filesave folder 
+and keep the new file in the same directory as the script
                print("Change Detected.\n");
            }
            
        }
        else {
            print("URL Unaccessible: $siteDirectory\n");
        }
    }
}
main();
[download]

I'm new to Perl, and I am trying to make a program that reads a sites html(specifically the titles) continuously as long as the user has specified and compares it with the other scan of the website by comparing files. If the file is the same as the other, delete the newer file. If the file is different, move the old file into the filesaves folder and keep the newer file in the same directory as the script. The program runs, but doesn't create the amount of files specified by the for loop, doesn't move them to the correct file, and doesn't delete them. For example, if you specify the number of times to run as 10, then you will only have 7 text files. Console Log: readdir() attempted on invalid dirhandle DIR at C:\Program Files\OSNE\OSNE.pl line 23, <STDIN> line 2. closedir() attempted on invalid dirhandle DIR at C:\Program Files\OSNE\CPMonitor.pl line 23, <STDIN> line 2. Use of uninitialized value $_ in pattern match (m//) at C:\Program Files\OSNE\CPMonitor.pl line 24, <STDIN> line 2. Change Detected.

Comment on Help with Web Scraping Script - Updated Download Code

Replies are listed 'Best First'.
Re: Help with Web Scraping Script by 1nickt (Canon) on Oct 19, 2017 at 11:07 UTC
Hello EagerforPerl, Your program does not compile: `Global symbol "$directory" requires explicit package name (did you for +get to declare "my $directory"?) at 1201637.pl line 25. Global symbol "$directory" requires explicit package name (did you for +get to declare "my $directory"?) at 1201637.pl line 26. 1201637.pl had compilation errors.` [download] This is because the `unless` block creates scope around the contents of the block outside which the lexically declared variables are not accessible: `$ perl -Mstrict -wlE 'unless (0) { my $foo = 42 }; say $foo' Global symbol "$foo" requires explicit package name (did you forget to + declare "my $foo"?) at -e line 1. Execution of -e aborted due to compilation errors.` [download] If I change the code for opening the directory to: `opendir my $directory, 'C:\\Program Files\\OSNE' or die("Unable to open directory 'C:\\Program Files\\OSNE'\n");` [download] ... then I get a warning: `Name "main::OUTPUT" used only once: possible typo at 1201637.pl line 3 +4. 1201637.pl syntax OK` [download] This is because you open your filehandle as `$output` but try to use it as `OUTPUT` ... Also please consider that most websites don't appreciate repeated or frequent polling; I would recommend no more than a daily check if you simply want to see whether a site has new pages. The way forward always starts with a minimal test.	[reply] [d/l] [select]
Re^2: Help with Web Scraping Script by EagerforPerl (Novice) on Oct 19, 2017 at 19:17 UTC
I appreciate your response. The problem with the handle arose when I was trying to implement what the first reply suggested, and I missed one of the instances of OUTPUT, which I have fixed. Still doesn't work though. About your concern in regards to frequent polling, this is going to no more than 5 people including myself and we don't plan on using it for very long. This is really just something I'm trying to do for practice. I of course don't want to unknowingly denial of service somebody's website.	[reply]
Re: Help with Web Scraping Script by stevieb (Canon) on Oct 19, 2017 at 01:18 UTC
Welcome to the Monastery, EagerforPerl! You've provided code, that's awesome (so is formatting it well!). You're also using (at a quick glance) the majority of proper techniques (`strict`, `warnings`, 3-arg `open` etc ++). What I'd ask you to do so the Monks may be better able to help is tell us what the code currently does, and how it deviates from what you're expecting. It would also be beneficial if you could provide the data that you're sending in as standard input so the Monks can test for themselves. If the URLs/input are off-limits somehow, that's understandable too... you'll just have to provide more detail on the expected/problematic situations. ps. You do not need `sub main {...` in Perl. If your file does not contain only a package (class), the code will run just fine without a `main()` function. You can just put your code left-justified (unlike eg: C). pps. I would recommend, despite what I said above, one change to the 3-arg open you use. Bareword file handles (ie., things like `OUTPUT` are global in scope. It is best-common-practice to use lexical (ie. scoped) handles instead. To do this, simply assign a scalar variable to hold the handle as opposed to the bareword: `open my $fh, '...', '...' or die ...`	[reply] [d/l] [select]
Re^2: Help with Web Scraping Script by EagerforPerl (Novice) on Oct 19, 2017 at 01:56 UTC
I'm aware I can run it without a main function/subroutine in Perl, it's just a convention that I've decided to willingly borrow from my programming instructor. The bareword file handles are perhaps another convention really, but I will consider what you have advised. Thank you for your response.	[reply]
Re: Help with Web Scraping Script by marto (Cardinal) on Oct 19, 2017 at 10:24 UTC
~~This isn't a SSCCE (e.g. `get` is not shown), however,~~ a couple of points. You read from STDIN but don't chomp. You don't print the reason for failure when you call `die`, e.g. `....or die "Can't open directory: $!\n"`. Hopefully you have rate limited requests so you're not hammering the sites in question. ~~Also How do I post a question effectively?~~ Update: Strikeout, see below.	[reply] [d/l] [select]
Re^2: Help with Web Scraping Script by hippo (Archbishop) on Oct 19, 2017 at 10:33 UTC
This isn't a SSCCE (e.g. get is not shown) The OP is using LWP::Simple which exports get so it's probably safe to assume that it is that one.	[reply]
Re^3: Help with Web Scraping Script by marto (Cardinal) on Oct 19, 2017 at 10:40 UTC
(Picard) Face palm. I sit corrected. Too early for me.	[reply]