This is more a thought exrcise than anything as I am trying to understand how Perl interacts with web. If I have a website with a list of film titles -each film title when clicked opening upto a new page --each page contains info about film as well as a link to a text document with the script of the film (which can be opened or saved) I want to grab all the text from one specific actor in a specific film. Usually, I would just ask the user to save the files to the same file as the program. But, is there a way where if the user just inputs a film-title at command line I can automatically check whether such a FILM_NAME exists by asking program to check: www.**************.com/FILM_NAME If it does exist then Program will automatically open a text file at the webpage, search through it and remove the necessary data from specific actor? I know I should use HTML::Tokeparser::Simple and LWP::Simple Someone gave me the following example for taking quotes from a quotations website:
use strict; use HTML::TokeParser::Simple; my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z); my $savePath = "C:/temp/quotes.txt"; open (OUT, ">>$savePath"); foreach my $letter (@letters) { my $baseUrl = "http://www.quotationspage.com/quotes/$letter.html"; my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl + ); my $parent_pr; while ( my $parent_token = $parent_parser->get_token ) { if ( $parent_token->is_tag('div') && $parent_token->get_attr('class') eq 'authorrow' ) { $parent_pr = 1; next; } if ( $parent_pr && $parent_token->is_tag('a') ) { my $authorUrl = "http://www.quotationspage.com" . $parent_token->get_att +r('href'); my $author = $parent_token->get_attr('href'); $author =~ /\/quotes\/(.*?)\//; $author = $1; $author =~ s/_/ /g; my $child_parser = HTML::TokeParser::Simple->new( url => $authorUrl ); my $child_pr; my $quote; while ( my $child_token = $child_parser->get_token ) { if ( $child_token->is_tag('dt') && $child_token->get_attr('class') eq 'quote' ) { $child_pr = 1; next; } if ( $child_pr && $child_token->is_text ) { $quote .= $child_token->as_is; next; } else { if ( $child_token->is_end_tag('dt') ) { $child_pr = 0; print "$quote|| $author\n\n"; print OUT "$quote|| $author\n"; $quote = undef; next; } } } } else { if ( $parent_token->is_end_tag('div') ) { $parent_pr = 0; } } } }
But, I just need one push... if I wanted the contents of wikipedia (I don't really but I am just trying to work out how this works with other websites!). I have tried the following but it does not work.
use strict; use HTML::TokeParser::Simple; my @letters = qw(A B C D E F G H I J K L M N O P Q R S T U V W X Y Z); my $savePath = "C:/temp/quotes.txt"; open (OUT, ">>$savePath"); foreach my $letter (@letters) { my $baseUrl = "http://en.wikipedia.org/wiki/$letter.html"; my $parent_parser = HTML::TokeParser::Simple->new( url => $baseUrl + ); my $parent_pr; while ( my $parent_token = $parent_parser->get_token ) { if ( $parent_token->is_tag('div') && $parent_token->get_attr('class') eq 'authorrow' ) { $parent_pr = 1; next; } if ( $parent_pr && $parent_token->is_tag('a') ) { my $authorUrl = "http://en.wikipedia.org/wiki" . $parent_token->get_attr +('href'); my $author = $parent_token->get_attr('href'); $author =~ /\/quotes\/(.*?)\//; $author = $1; $author =~ s/_/ /g; my $child_parser = HTML::TokeParser::Simple->new( url => $authorUrl ); my $child_pr; my $quote; while ( my $child_token = $child_parser->get_token ) { if ( $child_token->is_tag('dt') && $child_token->get_attr('class') eq 'quote' ) { $child_pr = 1; next; } if ( $child_pr && $child_token->is_text ) { $quote .= $child_token->as_is; next; } else { if ( $child_token->is_end_tag('dt') ) { $child_pr = 0; print "$quote|| $author\n\n"; print OUT "$quote|| $author\n"; $quote = undef; next; } } } } else { if ( $parent_token->is_end_tag('div') ) { $parent_pr = 0; } } } }
What am I doing wrong? Thanks again!

2006-06-14 Retitled by GrandFather, as per Monastery guidelines
Original title: 'Opening a File on a Website in Text Format?'


In reply to Extracting content from html by sdslrn123

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.