in reply to collect data from web pages and insert into mysql

Sounds like an interesting project and, from what you've told us, would be entirely suitable for Perl. The other monks have given you some good pointers.

If you could show us some example input (short but typical) and what the resulting url should look like and the code you have so far for doing that we may be able to give you some pointers. If you arrange your example code so that monks can download it and run it you improve your chances of getting help.

The usual health warnings apply: use strict and warnings, make sure it compiles and try to use short but meaningful variable names.

Good luck!

  • Comment on Re: collect data from web pages and insert into mysql

Replies are listed 'Best First'.
Re^2: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Jul 31, 2010 at 11:31 UTC

    This is what I have so far, it kinda works but lack finesses and is pretty seriously flawed, but hey, it's only one day into my Perl adventure yet so I think I've done OK so far.

    use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp'; my $pcnt = 1; my $pidfile = 'c:/scr/pidlist.txt'; # Open list of pid's and set first one as current pid. open PIDLIST, "<", $pidfile or die "Could not open $pidfile: $!"; my $pid = <PIDLIST>; chomp $pid; print $pid; # Grab and store sortie list pages for persona. while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; # Update page number and grab next. $pcnt += 1; }; # Close files close PIDLIST or die $!; print '\nDone!\n';

    Flaws in this is that the server will quite happily keep giving you empty sortie list pages so just updating the page count and hoping for a fail to exit doesn't work (resulting in a huge file).
    I want the loop to exit under either of two conditions, either the string "No more sorties" are found on the page (end of list) OR a sid string equal to the stored variable for the last one processed is reached. (sids are six digit strings that I need to collect from the collected pages).

    This code is using LWP, but suggestions was for Mechanize so I need to rewrite to use that instead.
    Also need to redo the load pid bit so it actually works it's way through the list of pids, it will also have to fetch two variables in pairs eventually (in addition to the pid the last processed sid).
    Tried using Slurp to open and read the pidlist file, but that didn't work out as planned.
    For some reason $pid isn't printed out as supposed any more.

    When that's achieved comes the tricky part of collecting the actual sortie pages and extracting the data I need from them.

    Any suggestions on good coding practices and habits to pick up s appreciated, might as well learn to do it right from start.

      That looks as though you are off to a good start.

      I have a couple of observations and some more questions.

      my $pid = <PIDLIST>;
      will read the first line from the file. To look at each line in turn loop over the file handle with a while loop.

      The docs for LWP::Simple show how to check if the get is successful. Always best to do that.

      First question, what does a line in the pidlist.txt file actually look like? Is there any processing to be done to extract any data you need?

      Secondly, could you show a (cut down) sample of what the downloaded content looks like. It may be that you can extract the data you need on the fly and write the whole lot out in one go rather than save each page to disk and parse it later.

      Lastly, it looks as though the site needs a login and uses cookies. I'm curious to know how you managed to download any pages.

      Looking good though!

      This is some code illustrating the points above.

      #! /usr/bin/perl use strict; use warnings; use LWP::Simple qw(get); use File::Slurp ; # pid = Persona ID, one of a players 3 identities. # sid = Sortie ID, identifier for a mission taken by the persona. # We want to crawl all sortie list pages and collect new sid's we # have't seen before and then move on to next persona. my $pbase = q{http://csr.wwiionline.com/scripts/services/persona/sorti +es.jsp}; my $pcnt = 1; my $pidfile = q{c:/scr/pidlist.txt}; # Open list of pid's open my $pidlist, q{<}, $pidfile or die qq{Could not open $pidfile: $! +}; # loop over list of pids one at a time while (my $pid = <$pidlist>){ print $pid; chomp $pid; # what does a line from the pidlist look like? # do need to do any work on it? while (1) { my $url = qq{$pbase?page=$pcnt&pid=$pid}; my $content = get($url); die qq{get failed for $url: $!} unless $content; # parse $content # extract data we need # test if there is more work to do # Update page number and grab next. $pcnt += 1; }; } # Close files close $pidlist or die $!; print qq{\nDone!\n};

        No log/pass needed, you can log on but it isn't needed for just watching peoples stats.

        Currently the pidlist is just one pid per line followed by a linefeed, but as stated in the plan I will also need to import the lastproc sid so I guess it will soon change to something like (or however I can get the sql querie to save it).

        pidnumber, sidnumber linefeed

        A raw sortilist page (need to grab the sids)

        What we need here is the number after each "sid=" which will create a new list of pages to be processed.
        The last <TR></TR> block seen here and the ones that follow each contain 1 sid. (well actually the entire link for it). Layout might vary a bit though as we might be looking at several padded pages.

        The actual data though is contained in the sorti details pages where it's spread all over and which also contain random length lists (se picture in another answer).

        print qq{\nDone!\n};

        What's with all the q's in the code? I copies this and it doesn't seems to run.

        We've passed this now with your latest post, but I'd like to step back to see what I did for learnings sake.

        When I copied this it just hang and only loaded on page I think, it never gave any error messages or excited. I adapted it a bit by using some of your new bits but basically keeping my own loop. This seemed to work pretty well as it would churn through all pids in the list and get pages until it hit the last one and then move on (after a bug fix)

        #! /usr/bin/perl use strict ; use warnings ; use LWP::Simple qw(get) ; use File::Slurp ; my $pbase = 'http://csr.wwiionline.com/scripts/services/persona/sortie +s.jsp' ; my $pidfile = 'c:/scr/pidlist.txt' ; my $lproc = '536192' ; # Open list of pid's open my $pidlist, "<", $pidfile or die "Could not open $pidfile: $!\n" + ; # Loop over list of pids one at a time while (my $pid = <$pidlist>){ chomp $pid ; print "Current persona processed: $pid\n" ; my $pcnt = 1 ; while (1) { my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url); die "\nGet failed for $url: $!\n" unless $content; # my $page = get "$pbase?page=$pcnt&pid=$pid" ; # Exit loop if page is empty last if $content =~/No sorties/ ; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $content ) ; # Exit loop if page contained last processed. last if $content =~/"sid=$lproc"/ ; # Update page number and run loop again. print "Page $pcnt\n" ; $pcnt++ ; } ; } ; # Close files close $pidlist or die $! ; print "\nDone!\n" ;

        The serious bug it had previously was that the page count was defined early in the script, outside the loop which meant that page 1-x was processed for pid 1, then page x-> etc was processed for successive users, NOT good! Moving the variable definition inside the loop fixed it.

        When the "No sorties" string was encountered it exited loop properly, but the second condition about finding the $lproc doesn't work, it never triggers even if I set a number I know it will find a few pages down in one pid.

        Is there a particular reason you split the content retrieval into two lines from

        my $page = get "$pbase?page=$pcnt&pid=$pid" ;

        to

        my $url = "$pbase?page=$pcnt&pid=$pid"; my $content = get($url);

        From what I can tell they do exactly the same, just using an extra variable.

        Thanks again!

      Wohoo, think I sorted the loop!

      while (1) { my $page = get "$pbase?page=$pcnt&pid=$pid"; last if $page =~/No sorties/; # Store grabbed webpage into the file append_file( "c:/scr/$pid.txt", $page ) ; last if $page =~/"sid=$lproc"/; # Update page number and grab next. $pcnt++; };

      I'm sure the whole thing can be made prettier and more efficient, but now it seems to work as intended, grabbing the first set of data I need.

      Now I need to sort the bit where it does the same for all pids in the pidlist, preferably at the same time adding so it extracts both the pid and lproc data from the settings file.

Re^2: collect data from web pages and insert into mysql
by SteinerKD (Acolyte) on Jul 31, 2010 at 15:04 UTC

    Wfsp asked for more info so I grabbed some screenies to better explain what I need to do

    First I need to export from a db a list of persona IDs (pid) to process, each paired with the the id of the last processed sortie (sid).

    Script will now read first pid and sid and load the first sortie list page for the persona and store the content and keep doing so untill it finds and empty page or one containing the last processed sortie.
    Image of sortie page.

    Now I need to process the stored code and extract all sids found that are new and make a list of that for processing.

    Next step, grab all the data I need from each sortie detail page in the list (URL constructed from the grabbed sids).
    Image of sortie details page with the data I need. (Link fixed)
    Store the data in a way that can be imported into a DB. When last sid is done we're also done with the persona so load next pid/lproc and start over (save the highest numbered sid to update the lastproc stat).