This is what I have so far, it kinda works but lack finesses and is pretty seriously flawed, but hey, it's only one day into my Perl adventure yet so I think I've done OK so far.

use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp'; my $pcnt = 1; my $pidfile = 'c:/scr/pidlist.txt'; # Open list of pid's and set first one as current pid. open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!"; my $pid = <PIDLIST>; chomp $pid; print $pid; # Grab and store sortie list pages for persona. while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; # Update page number and grab next. $pcnt += 1; }; # Close files close PIDLIST or die $!; print '\nDone!\n';

Flaws in this is that the server will quite happily keep giving you empty sortie list pages so just updating the page count and hoping for a fail to exit doesn't work (resulting in a huge file).
I want the loop to exit under either of two conditions, either the string "No more sorties" are found on the page (end of list) OR a sid string equal to the stored variable for the last one processed is reached. (sids are six digit strings that I need to collect from the collected pages).

This code is using LWP, but suggestions was for Mechanize so I need to rewrite to use that instead.
Also need to redo the load pid bit so it actually works it's way through the list of pids, it will also have to fetch two variables in pairs eventually (in addition to the pid the last processed sid).
Tried using Slurp to open and read the pidlist file, but that didn't work out as planned.
For some reason $pid isn't printed out as supposed any more.

When that's achieved comes the tricky part of collecting the actual sortie pages and extracting the data I need from them.

Any suggestions on good coding practices and habits to pick up s appreciated, might as well learn to do it right from start.


In reply to Re^2: collect data from web pages and insert into mysql by SteinerKD
in thread collect data from web pages and insert into mysql by SteinerKD

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.