Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear all, Basically, we are trying to install the perlscript which is shown at http://guardianunlimited.co.uk/distribution/harvesters/perlscript/0,7192,,00.html What this does is pull/harvest newsfeeds from the Guardian Unlimited's site. We would like to integrate the following literature news to our site: http://www.guardianunlimited.co.uk/Distribution/Artifact_Trail_Block/0,5184,102937-0-,00.html However, there are no instructions provided for the script and our knowledge of Perl is not great. It has taken up a lot of our time and we have no other source for help. Can anybody possibly help us? Please explain the bits of the script we need to customize/edit on this messageboard or by writing to email@litmania.com. If need be, we would be pleased to renumerate anybody who can help us for their time. I reckon it wouldn't take anybody more than half an hour to install/ammend the script. Incidentally, does anybody know of any other company which provides specific literature/book newsfeeds (other than isyndicate)? We'd like one which doesn't show the headlines, but the first paragraph or so of the actual news on our site as well. Thank you VERY much. Yours sincerely, N. Azam Litmania.com Inc. http://www.litmania.com

Replies are listed 'Best First'.
Re: Help needed with a Perl script
by turnstep (Parson) on Aug 08, 2000 at 00:09 UTC
    It looks like the only things you have to change are the sections where it says: "..need to be edited.." specifically, the variables:

    • $doc_root
    • $content_directory
    • %content_to_retrieve

    For those interested, here is the script itself: (at least it starts with -Tw yay!)

    /usr/local/bin/perl -Tw #################################### # Content Harvester # # v1.1 - Jamie Unwin, Kieran Topping # Guardian Unlimited, Guardian Newspapers Limited 2000 # # Automatically harvest distributed Guardian Unlimited content # # =================================================== # IMPORTANT NOTES - PLEASE READ # =================================================== # This script is provided "as is" and as an EXAMPLE only. # # This script will need to be modified in order to fit into your parti +cular # environment, and to add an appropriate level of error checking. # # Guardian Unlimited cannot offer technical support for implementing t +his script. # # Modification and execution should only be attempted by the Webmaster + or # Sys-Admin of your site, and then only if they have experience and # responsibility in the following fields: # # * Perl # * LWP module # * Webserver & (your particular) operating system. # # No responsibility can be accepted by Guardian Unlimited for any dama +ge caused # to your website or computer systems arising from use of this script. # # If in doubt, DO NOT EXECUTE THIS SCRIPT. # # See http://www.guardianunlimited.co.uk/distribution # for further conditions of use # ################################################ # load required modules use strict; # this turns on strict error checking use LWP::UserAgent; # this loads the LWP module (used to retrieve a w +eb page) ################################################ # Global variables ################################################ # --This scalar will need to be edited to suit your particular environ +ment-- # Path to the document root of your web space # (on your local file system) # my $doc_root = '/www/htdocs'; # --This scalar will need to be edited to suit your particular environ +ment-- # Path to local directory relative to your document root # (this is where the retrieved pages will be stored) # my $content_directory = $doc_root . '/content'; # --This hash will need to be edited to suit your particular environme +nt-- # URLs of the content you wish to retrieve. # # The format is - 'local filename' => 'remote url' # # The 'local filename' is the name that the file will have on your web +space. # This is chosen by you, and is specified relative to the content dire +ctory. # e.g. 'guardian_news.html' # # The 'remote url' is the URL of the content you wish to retrieve. # You can obtain these URLs by following the instructions at # http://www.guardianunlimited.co.uk/distribution # These will look like # http://www.guardianunlimited.co.uk/Distribution/[...].html # my %content_to_retrieve = ( 'guardian_news.html' => 'http://www.guardianunlimited.co.uk/Dis +tribution/[...].html', 'guardian_tv_radio.html' => 'http://www.guardianunlimited.co.uk/Dis +tribution/[...].html' ); ############################################## # Main ############################################# # create a user agent (this is like a browser) my $ua = new LWP::UserAgent; $ua->agent('ContentHarvester/1.1 (GU)'); # loop through each piece of content to be harvested foreach my $local_filename (keys %content_to_retrieve) { my $remote_url = $content_to_retrieve{$local_filename}; # get the page (retrieve content) my $request = new HTTP::Request('GET', $remote_url); my $response = $ua->request($request); my $content = $response->content; # check we got the page unless ($response->is_success) { die "$remote_url, $response->error_as_HTML\n"; } # save the file to the local file system open (CONTENT, ">$content_directory/$local_filename") or die "Can't store the retrived file locally, $content_director +y/$local_filename, $!\n"; print CONTENT $content . "\n"; close CONTENT; }