comment on

This is more a thought exrcise than anything as I am trying to understand how Perl interacts with web. If I have a website with a list of film titles -each film title when clicked opening upto a new page --each page contains info about film as well as a link to a text document with the script of the film (which can be opened or saved) I want to grab all the text from one specific actor in a specific film. Usually, I would just ask the user to save the files to the same file as the program. But, is there a way where if the user just inputs a film-title at command line I can automatically check whether such a FILM_NAME exists by asking program to check: www.**************.com/FILM_NAME If it does exist then Program will automatically open a text file at the webpage, search through it and remove the necessary data from specific actor? I know I should use HTML::Tokeparser::Simple and LWP::Simple Someone gave me the following example for taking quotes from a quotations website:

use strict;
use HTML::TokeParser::Simple;

my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "http://www.quotationspage.com/quotes/$letter.html";
    my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl
+ );
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "http://www.quotationspage.com" . $parent_token->get_att
+r('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser::Simple->new( url => $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}
[download]

But, I just need one push... if I wanted the contents of wikipedia (I don't really but I am just trying to work out how this works with other websites!). I have tried the following but it does not work.

use strict;
use HTML::TokeParser::Simple;

my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z);
my $savePath = "C:/temp/quotes.txt";
open (OUT, ">>$savePath");

foreach my $letter (@letters) {
    my $baseUrl = "http://en.wikipedia.org/wiki/$letter.html";
    my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl
+ );
    my $parent_pr;
    while ( my $parent_token = $parent_parser->get_token ) {
        if (   $parent_token->is_tag('div')
            && $parent_token->get_attr('class') eq 'authorrow' )
        {
            $parent_pr = 1;
            next;
        }
        if ( $parent_pr && $parent_token->is_tag('a') ) {
            my $authorUrl =
              "http://en.wikipedia.org/wiki" . $parent_token->get_attr
+('href');
            my $author = $parent_token->get_attr('href');
            $author =~ /\/quotes\/(.*?)\//;
            $author = $1;
            $author =~ s/_/ /g;
            my $child_parser =
              HTML::TokeParser::Simple->new( url => $authorUrl );
            my $child_pr;
            my $quote;
            while ( my $child_token = $child_parser->get_token ) {
                if (   $child_token->is_tag('dt')
                    && $child_token->get_attr('class') eq 'quote' )
                {
                    $child_pr = 1;
                    next;
                }
                if ( $child_pr && $child_token->is_text ) {
                    $quote .= $child_token->as_is;
                    next;
                }
                else {
                    if ( $child_token->is_end_tag('dt') ) {
                        $child_pr = 0;
                        print "$quote|| $author\n\n";
                        print OUT "$quote|| $author\n";
                        $quote = undef;
                        next;
                    }
                }
            }

        }
        else {
            if ( $parent_token->is_end_tag('div') ) {
                $parent_pr = 0;
            }
        }
    }
}
[download]

What am I doing wrong? Thanks again!

2006-06-14 Retitled by GrandFather, as per Monastery guidelines
Original title: 'Opening a File on a Website in Text Format?'

In reply to Extracting content from html by sdslrn123

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.