in reply to Re^3: collect data from web pages and insert into mysql
in thread collect data from web pages and insert into mysql
We've passed this now with your latest post, but I'd like to step back to see what I did for learnings sake.
When I copied this it just hang and only loaded on page I think, it never gave any error messages or excited. I adapted it a bit by using some of your new bits but basically keeping my own loop. This seemed to work pretty well as it would churn through all pids in the list and get pages until it hit the last one and then move on (after a bug fix)
#! /usr/bin/perl use strict ; use warnings ; use LWP::Simple qw(get) ; use File::Slurp ; my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp' ; my $pidfile = 'c:/scr/pidlist.txt' ; my $lproc = '536192' ; # Open list of pid's open my $pidlist, "<", $pidfile or die "Could not open $pidfile: $!\n" + ; # Loop over list of pids one at a time while (my $pid = <$pidlist>){ chomp $pid ; print "Current persona processed: $pid\n" ; my $pcnt = 1 ; while (1) { my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url); die "\nGet failed for $url: $!\n" unless $content; # my $page = get "$pbase?page=$pcnt&pid=$pid" ; # Exit loop if page is empty last if $content =~/No sorties/ ; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $content ) ; # Exit loop if page contained last processed. last if $content =~/"sid=$lproc"/ ; # Update page number and run loop again. print "Page $pcnt\n" ; $pcnt++ ; } ; } ; # Close files close $pidlist or die $! ; print "\nDone!\n" ;
The serious bug it had previously was that the page count was defined early in the script, outside the loop which meant that page 1-x was processed for pid 1, then page x-> etc was processed for successive users, NOT good! Moving the variable definition inside the loop fixed it.
When the "No sorties" string was encountered it exited loop properly, but the second condition about finding the $lproc doesn't work, it never triggers even if I set a number I know it will find a few pages down in one pid.
Is there a particular reason you split the content retrieval into two lines from
my $page = get "$pbase?page=$pcnt&pid=$pid" ;
to
my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url);
From what I can tell they do exactly the same, just using an extra variable.
Thanks again!
|
|---|