Extracting content from html

sdslrn123 has asked for the wisdom of the Perl Monks concerning the following question:

This is more a thought exrcise than anything as I am trying to understand how Perl interacts with web. If I have a website with a list of film titles -each film title when clicked opening upto a new page --each page contains info about film as well as a link to a text document with the script of the film (which can be opened or saved) I want to grab all the text from one specific actor in a specific film. Usually, I would just ask the user to save the files to the same file as the program. But, is there a way where if the user just inputs a film-title at command line I can automatically check whether such a FILM_NAME exists by asking program to check: www.**************.com/FILM_NAME If it does exist then Program will automatically open a text file at the webpage, search through it and remove the necessary data from specific actor? I know I should use HTML::Tokeparser::Simple and LWP::Simple Someone gave me the following example for taking quotes from a quotations website:

use strict;
use HTML::TokeParser::Simple;

my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "http://www.quotationspage.com/quotes/$letter.html";
    my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl
+ );
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "http://www.quotationspage.com" . $parent_token->get_att
+r('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser::Simple->new( url => $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}
[download]

But, I just need one push... if I wanted the contents of wikipedia (I don't really but I am just trying to work out how this works with other websites!). I have tried the following but it does not work.

use strict;
use HTML::TokeParser::Simple;

my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "http://en.wikipedia.org/wiki/$letter.html";
    my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl
+ );
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "http://en.wikipedia.org/wiki" . $parent_token->get_attr
+('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser::Simple->new( url => $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}
[download]

What am I doing wrong? Thanks again!

2006-06-14 Retitled by GrandFather, as per Monastery guidelines
Original title: 'Opening a File on a Website in Text Format?'

Comment on Extracting content from html Select or Download Code

Replies are listed 'Best First'.
Re: Extracting content from html by eibwen (Friar) on Jun 13, 2006 at 04:05 UTC
First let's clarify your objective. From what I understand of both your introduction and subsequent codes: Enter query term Attempt to download input file from known server/directory If not 404, parse html for desired link Attempt to download parsed link If not 404, parse html for desired content You seem to be able to acquire the html page, although you presume it's not 404; however in order to properly comment as to why your HTML::TokeParser::Simple code doesn't work you'll have to elaborate both the content you're trying to access ("the necessary data from specific actor" is particluarly vague) as well as the bounding HTML content. I'd further assert that while descending the HTML structure will work, it may break if the site should be redesigned. Depending on what you're trying to accomplish, it may be easier to use a regex. From what I can make of your example code, you're working with a HTML file of the form: `<html><head></head><body> <div class=wrapper> <a href="http://sub.domain.tld/folder/page.html">link text</a> </div> </body></html>` [download] which links to a page of the form: `<html><head></head><body> <table> <tr><td class=quote>To be or not to be...</td></tr> </table> </body></html>` [download] The appearant 'dt' vs 'td' typo aside, I'd still need to know what criteria you're trying to employ to select which actor, which quote, et all...	[reply] [d/l] [select]
Re: Extracting content from html by chris_nava (Acolyte) on Jun 13, 2006 at 03:57 UTC
Built from the examples in the HTTP::Request and HTTP::Response docs. **** UNTESTED **** `require HTTP::Request; $request = HTTP::Request->new(GET => "http://www.example.com/" . $fil +e_name); $ua = LWP::UserAgent->new; $response = $ua->request($request); if ($response->is_success) { print "File exists on web server\n"; } else { print "File does not exist on web server.\n"; }` [download]	[reply] [d/l]
Re: Extracting content from html by Withigo (Friar) on Jun 15, 2006 at 00:28 UTC
Another option is a kind of Kobayashi Maru scenario, where you can change the contraints of the underlying problem in order to derive a simpler solution. Instead of messing with LWP, HTTP, and parsing HTML to get at the data, just get the data itself. I know you said you want to do it that way however, as an exercise; which is cool, so feel free to ignore the rest of what I'm about to say. IMDb.com provides all of their data as well as alternative interfaces to their data. One of the interfaces is a suite of unix command line utilities useful for querying the data, which sounds exactly like you need. IMDb probably also appreciates the alternatives being used instead of screen scrapers.	[reply]