comment on

We've passed this now with your latest post, but I'd like to step back to see what I did for learnings sake.

When I copied this it just hang and only loaded on page I think, it never gave any error messages or excited. I adapted it a bit by using some of your new bits but basically keeping my own loop. This seemed to work pretty well as it would churn through all pids in the list and get pages until it hit the last one and then move on (after a bug fix)

#! /usr/bin/perl

use strict ;
use warnings ;
use LWP::Simple qw(get) ;
use File::Slurp ;

my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie
+s.jsp' ;
my $pidfile = 'c:/scr/pidlist.txt' ;
my $lproc = '536192' ;

# Open list of pid's
open my $pidlist, "<", $pidfile or die "Could not open $pidfile: $!\n"
+ ;

# Loop over list of pids one at a time
while (my $pid = <$pidlist>){
    chomp $pid ;
    print "Current persona processed: $pid\n" ;
    my $pcnt = 1 ;

    while (1) {
        my $url = "$pbase?page=$pcnt&pid=$pid";
        my $content = get($url);
        die "\nGet failed for $url: $!\n" unless $content;

#        my $page = get "$pbase?page=$pcnt&pid=$pid" ;

        # Exit loop if page is empty
        last if $content =~/No sorties/ ;

        # Store grabbed webpage into the file        
        append_file( "c:/scr/$pid.txt", $content ) ;

        # Exit loop if page contained last processed.
        last if $content =~/"sid=$lproc"/ ;
        
        # Update page number and run loop again.
        print "Page $pcnt\n" ;
        $pcnt++ ; 
    } ;
} ; 

# Close files
close $pidlist or die $! ;
print "\nDone!\n" ;
[download]

The serious bug it had previously was that the page count was defined early in the script, outside the loop which meant that page 1-x was processed for pid 1, then page x-> etc was processed for successive users, NOT good! Moving the variable definition inside the loop fixed it.

When the "No sorties" string was encountered it exited loop properly, but the second condition about finding the $lproc doesn't work, it never triggers even if I set a number I know it will find a few pages down in one pid.

Is there a particular reason you split the content retrieval into two lines from

my $page = get "$pbase?page=$pcnt&pid=$pid" ;
[download]

my $url = "$pbase?page=$pcnt&pid=$pid";
my $content = get($url);
[download]

From what I can tell they do exactly the same, just using an extra variable.

Thanks again!

In reply to Re^4: collect data from web pages and insert into mysql by SteinerKD
in thread collect data from web pages and insert into mysql by SteinerKD

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.