langsor has asked for the wisdom of the Perl Monks concerning the following question:

Hello all-knowing Perl Monks,

This is my first offering but not my first visit past these gates.
I apologize in advance if this question is too large (too much text)!

I am trying to automate an LWP lookup at a doctor-search web site by importing a list of doctor’s names from a text file and printing the results to an html file (as well as displaying them in the browser window). I have rewritten this chunk as many times and ways as my skill and knowledge provides – and now I turn to you!

This doctor-search site will either return
  * doctorname “not found”
  * the doctor’s info
  * an options-list of found doctors by doctor-license number
When the site returns an options list, I pull out the option-values and re-search for each option.

What is/is NOT happening:
When my imported list has only one name, the script works great.
When the list has multiple names (as intended) I get “not found” results (thus a response from their site) from all the searches except the last name from my list, which then gives me the intended doctor’s info.

What I’ve tried:
I tried slowing down the script thinking maybe I was overwriting existing UserAgent requests… but decided this wasn’t the case since I CAN receive all the responses for the options-list to one doctor’s name (if it’s the last on the list), which entails multiple requests.

I broke down each code-chunk and tested it as a unique entity. They all work.


This is the core code chunk:
use LWP::UserAgent; use HTTP::Request::Common qw(POST); $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 95)"); &ParseForm; ### this sub-routine not included here ### $input = $key{'input'}; $output = $key{'output'}; if( -r $input ){ open(IN, "<$input") || print "$!"; @list = <IN>; close(IN); }else{ print "<h3>Could not read from INPUT file</h3>"; exit; } &ParseList; exit; sub ParseList { foreach $line (@list){ my($first, $last) = split(/\|/, $line); my $content = AimAgent($first,$last); if( $content =~ /<option/gi ){ $_ = $content; while( /<option.*?value=\"(.*?)\">/gi ){ my $profile = AimAgent($first,$last,$1); PublishProfile($first,$last,$profile); ### not included ### } }else{ PublishProfile($first,$last,$content); ### not included ### } } } sub AimAgent { my($first, $last, $license) = @_; my $request = HTTP::Request->new(POST=>"http://cgi.docboard.org/cgi- +shl/nhayer.exe"); $request->content_type("application/x-www-form-urlencoded"); if( $license ){ $request->content("form_id=medname&state=na&medlname=$last&medfnam +e=$first&mednumb=$license"); }else{ $request->content("form_id=medname&state=na&medlname=$last&medfnam +e=$first"); } my $response = $ua->request($request); if( $response->is_success ){ return $response->content; }else{ return $response->status_line; } }

Replies are listed 'Best First'.
Re: Batch process LWP search
by langsor (Novice) on Nov 05, 2005 at 01:04 UTC
    I just want to thank all of you for your responses!

    I have spent some time trying to clean up my code and I got rid of that old home-rolled 'ParseForm' code –- and am now using 'CGI.pm' for the job.

    I’m not great at purification but it’s never too late to learn to use ‘strict’ and ‘warnings’ which my code now passes, but I still get a CGI header warning when I apply the taint –T flag.

    Special thanks to BUU, as soon as I read what you wrote about chomp I knew you were right –- and you were! Such a simple thing hanging days worth of effort and rewrites. That seem to be how it goes sometimes.

    Thank you all again,

    langsor
    ~still so much to learn

    If any are interested -- here is how the entire code stands now. Maybe not great, but working.
    #!perl BEGIN { $|=1; print "Content-type: text/html\n\n"; use CGI::Carp('fatalsToBrowser'); } use strict; use warnings; use CGI qw(:standard escapeHTML); use LWP::UserAgent; my $ua = LWP::UserAgent->new; $ua->agent("Mozilla/4.0 (compatible; MSIE 5.0; Windows 95)"); &ParseList; exit; sub ParseList { my(@list, $line, $first, $last, $content, $profile); if( param('input') =~ /^(\w.*)$/ && -r $1 ){ open(IN, "<$1") || print "$!"; @list = <IN>; close(IN); }else{ print "<h3>Could not read from INPUT file</h3>"; exit; } foreach $line (@list){ chomp($line); ($first, $last) = split(/\|/, $line); $content = AimAgent($first,$last); if( $content =~ /<option/gi ){ while( $content =~ /<option.*?value=\"(.*?)\">/gi ){ $profile = AimAgent($first,$last,$1); PublishProfile($first, $last, $profile); } }else{ PublishProfile($first, $last, $content); } } } sub AimAgent { my($first, $last, $license, $request, $response); ($first, $last, $license) = @_; $request = HTTP::Request->new(POST=>"http://cgi.docboard.org/cgi-shl +/nhayer.exe"); $request->content_type("application/x-www-form-urlencoded"); if( $license ){ $request->content("form_id=medname&state=na&medlname=$last&medfnam +e=$first&mednumb=$license"); }else{ $request->content("form_id=medname&state=na&medlname=$last&medfnam +e=$first"); } $response = $ua->request($request); if( $response->is_success ){ return $response->content; }else{ return $response->status_line; } } sub PublishProfile { my($first, $last, @people, $person, $relevant, $profile, $html, $out +put); ($first, $last, @people) = @_; foreach $person (@people){ $person =~ /CENTER><B><B><B>(.*?)Please read the AIM/gi; $relevant = $1; $relevant =~ s/<img.*?>//gi; $relevant =~ s/<a href.*?>//gi; $relevant =~ s/<font.*?>//gi; $relevant =~ s/<b>//gi; $relevant =~ s/<\/font>|<\/a>|<\/b>//gi; $profile = <<endline; <table class="profile" width=350 align="center" border=1> <tr> <td>$relevant</td> </tr> </table> endline } $html = <<endline; <html> <head> <title>AIM Search</title> <meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1 +"> <link rel="stylesheet" type="text/css" href="aim.css"> </head> <body> <div align="center"> <table class="head" width=500 border=1 align="center"> <tr> <td><span class="title">$first $last</span></td> $profile </td> </tr> </table> <span class="label">Please Read Output File For Details</span><br><b +r> <a href="index.html"><span class="label">CLICK HERE TO RETURN TO SEA +RCH PAGE</span></a><br></div> </div> </body> </html> endline if( param('output') =~ /^(\w.*)$/ && -w $1 && -T $1){ $output = $1; open(OUT, ">>$output") || print "$!"; }else{ open(OUT, ">>results.html") || print "$!"; } print OUT $html; print $html; close(OUT) || print "$!"; }
Re: Batch process LWP search
by BUU (Prior) on Nov 04, 2005 at 23:42 UTC
    First off, you're using something called: &ParseForm which sounds like it comes from "cgi-lib.pl", which has been deprecated for security and design reasons for something like 5 years now.

    Secondly, you use globals wrongly in several places, for example, sub ParseList should take @list as an argument, and you should declare all the variables you use inside there as lexicals local to the subroutine.

    Thirdly I notice you don't "chomp" or otherwise remove the line endings from the lines you read from the filehandle. From this description you've given it sounds like this is the cause of your main problem.

    When you do <IN> you get a line terminated by the line ending, which you then try to feed to the website, which probably can't find the number plus the line ending. This would also account for the last line in a multiple line file working fine, as you don't add a line ending to the last line.
Re: Batch process LWP search
by diotalevi (Canon) on Nov 04, 2005 at 21:35 UTC

    You didn't post the part of the code that has the bug. Try applying strict and warnings to your script. When you've fixed it so it passes both, you may have found your problem. You'll certainly have better code. As-is, you're using lots of globals and declaring little and that's often a bad sign.

Re: Batch process LWP search
by InfiniteSilence (Curate) on Nov 05, 2005 at 00:15 UTC
    Real bummer, I got the following message when I tried to visit this site:

    # ip address of your computer: # ip space of your lan [if applicable] # ip address of firewall/proxy server Note: The above information is p +rivate and will not be shared with ANYONE. It is needed however by te +chsupport to resolve the problem. HTTP 403.6 - Forbidden: IP address rejected Technical Information (for support personnel) * Background: This error is caused when the server has a list of IP addresses +that are not allowed to access the site, and the IP address you are u +sing is in this list.

    Perhaps you are running into a server security issue?

    Celebrate Intellectual Diversity