mcoblentz has asked for the wisdom of the Perl Monks concerning the following question:

As part of a project to scrape a web page and obtain the data inside a table for further evaluation and analysis, I wrote the code below (borrowing from public examples that I don't always understand - yes, I'm new to Perl). However, on compiliing/invocation, I get the following error response, which I don't quite understand. Can't locate object method "open" via package "GLOB" at C:/Perl/lib/IO/File.pm line 163. I tried to follow this via the debugger but it closed down at the point of the error. Since my program doesn't have 163 lines, it must be in the IO::File module, which I thought was standard. I'm running ActiveState Perl 5.10 on XP SP2. I installed nmake1.5 from MSFT in order to install the HTML::TableExtract module from CPAN. The web page is cited in the code if you are that curious.

#!/usr/bin/perl -w # based on: extract-table.pl,v 24.1 2006/10/21 01:19:37 from Raman @ K +oders.com # Accepts a URI and table spec; returns a csv file use strict; use FileHandle; use LWP::UserAgent; use HTML::TableExtract; #use IO::File; use Getopt::Long; use WWW::Mechanize; use vars qw (%options); my ($url, $file, $task, $depth, $count, $cols); my %options = (task => \$task, url => \$url, file => \$file, depth => \$depth, count => \$count, headers => \$cols); GetOptions (\%options, 'file=s', 'url=s', 'task=s', 'depth=i', 'count=i', 'headers=s'); # get the data from the web. Typically this is http://www.sailwx.info +/shiptrack/cruiseships.phtml # either pass this in as --url <page_url> when invoking or just set it +. $cols = "Ship,'last reported (UTC)',position,Callsign"; $url = "http://www.sailwx.info/shiptrack/cruiseships.phtml"; my $input; my $output = new OUTFILE ('>C:\Program Files\cron\Cruise Ships\ship_da +ta.csv'); open (OUTFILE, '>C:\Program Files\cron\Cruise Ships\ship_data.csv'); my $m = WWW::Mechanize->new(); $m->get($url); $input = $m->content; print (OUTFILE $input); my $te; if ( defined ($cols)) { my @headers = split(',', $cols); $te = new HTML::TableExtract(headers=>\@headers); } else { $te = new HTML::TableExtract( depth => $depth, count=>$count); } $te->parse_file($input); my ($ts,$row); foreach $ts ($te->table_states) { foreach $row ($ts->rows) { $output->print ( join(',', @$row), "\n"); } } close (OUTFILE); if (defined ($url)) { unlink ($input); }

The code above is a work in progress, of course; I'm just trying to scrape the page, find the table data and place in a CSV file so I can query out of it via DBI::CSV to create a text file which helps me track airplanes in flight or cruise ships (this particular project) on the screen background on the laptop.

Replies are listed 'Best First'.
Re: File open problem with "GLOB"
by ikegami (Patriarch) on Mar 15, 2008 at 22:34 UTC
    The problem is with
    my $output = new OUTFILE ('>C:\Program Files\cron\Cruise Ships\ship_da +ta.csv');

    Not exactly sure* why you get the specific error you are getting, but that statement is both wrong and the cause. That statement is suppose to mean

    my $output = OUTFILE->new('>C:\Program Files\cron\Cruise Ships\ship_da +ta.csv');

    which isn't what you want at all. You want

    open(my $output, '>', 'C:\\Program Files\\cron\\Cruise Ships\\ship_dat +a.csv');

    But now you have two file handles to the same file (OUTFILE and $output). Use

    use IO::Handle qw( ); ... my $out_fn = 'C:\\Program Files\\cron\\Cruise Ships\\ship_data.csv'; open(OUTFILE, '>', $out_fn) or die("Unable to create output file \"$out_fn\": $!\n"); ... print (OUTFILE $input); ... OUTFILE->print ( join(',', @$row), "\n");
    Or better yet, don't use global variables:
    my $out_fn = 'C:\\Program Files\\cron\\Cruise Ships\\ship_data.csv'; open(my $out_fh, '>', $out_fn) or die("Unable to create output file \"$out_fn\": $!\n"); ... print ($out_fh $input); ... $out_fh->print ( join(',', @$row), "\n");

    Of course, using two different ways of calling print is confusing. You should stick to the one you like.

    Note I added error checking to open. If anything's going to fail when you run the program, that's going to be it.

    * — It's a mixture of indirect method requiring guesswork on Perl's part, barewords often represent file handles, and file handles are blessed as IO::Handle objects by default.

      ikegami, thanks for the help. I tried all your suggestions and have decided to keep the no-globals approach you drafted. Much appreciated!
Re: File open problem with "GLOB"
by Narveson (Chaplain) on Mar 15, 2008 at 22:27 UTC

    Instead of

    my $output = new OUTFILE ('>C:\Program Files\cron\Cruise Ships\ship_da +ta.csv'); open (OUTFILE, '>C:\Program Files\cron\Cruise Ships\ship_data.csv'); # and later print (OUTFILE $input);
    say
    open my $output, '>', q{C:\Program Files\cron\Cruise Ships\ship_data.c +sv}; # and later print {$output} $input;
      No need for the curlies around $output in the print statement.

        They disambiguate the one use of the horrid indirect object syntax that you can't easily avoid in Perl 5. (I know about IO::Handle, but it's even further away from the principle of least surprise.)

        theDamian, Perl Best Practices, Chapter 10: IO

        Printing to Filehandles

        Always put filehandles in braces within any print statement.

        Dear brethren, I had the good fortune yesterday to spot an unanswered Seeker query to which I knew the answer. As I wrote my reply, I strove to be helpful, correct, but above all quick.

        There are details I might have done differently. The querent said

        '>C:\Program Files\cron\Cruise Ships\ship_data.csv'
        and when I converted to three-arg open I could have left the filename at
        'C:\Program Files\cron\Cruise Ships\ship_data.csv'
        or converted to
        'C:/Program Files/cron/Cruise Ships/ship_data.csv'
        because (as all Windows Perl programmers should be told) forward slashes work just fine for telling Windows where to go, and can never be mistakenly parsed as escapes, and I'd rather not spend any time at all thinking about when a backslash is an escape and when it isn't.

        I could also have changed $output to $output_fh and $input to $input_data, but chose instead to retain the querent's own variable names.

        Others would certainly have made different choices. I'm still surprised that the particular nit that was picked was the pair of braces around the filehandle in

        print {$output} $input;
Re: File open problem with "GLOB"
by Cody Pendant (Prior) on Mar 16, 2008 at 04:32 UTC
    ...and once you've got all your filehandle problems sorted out, you'll notice that it still doesn't work, because you've got
    $te->parse_file($input);
    which is for files/filehandles. You just want
    $te->parse($input);
    because you're working on a string.


    Nobody says perl looks like line-noise any more
    kids today don't know what line-noise IS ...
      Ah, Thanks Cody. I would have struggled with that one for a while. I was indeed getting a parse error there. I presume that's what
      Unsuccessful open on filename containing newline at C:/Perl/lib/HTML/P +arser.pm line 95. at C:/Perl/lib/HTML/Parser.pm line 95 HTML::Parser::parse_file('HTML::TableExtract=HASH(0x26dff64)', + '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"...') +called at cruise_ships2.pl line 59 main::(cruise_ships2.pl:62): my ($ts,$row);
      was trying to tell me...
Re: File open problem with "GLOB"
by ikegami (Patriarch) on Mar 16, 2008 at 04:38 UTC
    unlink ($input);

    What's that suppose to do? It doesn't look like $input holds a file name. Perhaps you meant

    undef ($input);

    but that's totally useless since the variable is destroyed when the scope ends at the end of the script, the next line.

    Another consideration is that split(',', ...) is misleading since the first argument must be a regexp. I prefer always specifying a regexp: split(/,/, ...).

    Finally,
    my $ts;
    foreach $ts (...) { ... }
    can be written as simply
    foreach my $ts (...) { ... }

      ikegami, thanks for the replies and suggestions. They worked very well, thank you. I changed the split(',', ...) statement per your suggestion. Also I deleted the  unlink/undef lines because that was an artifact from the example I borrowed. I think it was originally intended to close down an input file handle if the user supplied one in the options - I don't need that.

        For future reference, closing file handles is done with close, not unlink. Using undef or letting the variable go out of scope both do the trick as well.

        I use the last option. I don't remember ever having to explicitly close a file handle in Perl.