Is this a memory rewriting or else....!

Kanishka has asked for the wisdom of the Perl Monks concerning the following question:

I have a very curious problem. I'm sorry that i have to pulish the long code. The script gives the desired result ONLY once. If you run the script u'll know what i mean. I run this on W2000 OS.

#!/usr/bin/perl -w
use strict;
use LWP::UserAgent;
use HTTP::Cookies;

system 'del BechtleResult.txt';

initHttp();

login();
exit 0;

sub login{
    local $1;
    local $2;
    my $content;
    
    $content = sendRequest('GET', 'http://www.bechtle.be/');

    $content =~ m@\<frame frameborder="0" name="[^"]*" noresize scroll
+ing="auto"[ \n\r\t]*?src="([^"]*)"@s; # [ \t\n\r]*?src="([^"]*)"
    my $url = $1;
    
    $content = sendRequest('GET',$url);
    my $count = 0;
    while ($content=~ m@javascript:self.parent.sendRequest\('([^']*)'\
+).*?>([^<]+?)</a>@g ) {
        $url = $1;
        my $desc = $2;
        if ($count > 0) { #we ignore the first node since its the titl
+e
        print "> $desc\n";
        
        my $rs = sendRequest('GET',$url);
        parseMainNodes($desc, $rs);
        }
        $count ++;
    }
}

sub parseMainNodes {
    local $1;
    local $2;
    #my ($category, $content2) = @_;
    my $category = shift;
    my $content2 = shift;
    
    while ($content2=~ m@javascript:self.parent.sendRequest\('([^']*ex
+pand=true[^']*)'\).*?>([^<]+?)</a>@g ) {
      my $url = $1;
       my $desc = $2;
    
    #print "$url\n";
    
       print "> $category -> $desc\n";
       my $rs = sendRequest('GET',$url);
    #print $url."\n";
       parseSubNodes($category, $desc, $rs);
    #print $url."\n";
    }
}

sub parseSubNodes {
    local $1;
    local $2;
    #my ($category, $section, $content1) = @_;
    my $category = shift;
    my $section = shift;
    my $content1 = shift;
    
    $content1 =~ s@<!-- Produktvergleich -->.*@@s; #remove the junk af
+ter the tree section
    
    while ($content1=~ m@javascript:self.parent.sendRequest\('([^']*no
+dechip_id=[^']*)'\).*?>([^<]+?)</a>@g ) {
    htmDump($content1);
    my $url = $1;
    my $desc = $2;
       print "> $category -> $section -> $desc\n";

    getLstPage($url);
    }
}

sub getLstPage{
    local $1;
    local $2;
    my $prdLstUrl = shift;
    my $rs1 = sendRequest('GET',$prdLstUrl);
    #lstPgDump($rs);
    extractPrd($rs1);
}

######################

sub extractPrd{
    local $1;
    local $2;
    my $lstPgDump = shift;
    print "fetching product page \n";
    $lstPgDump =~ s!.*?Bestellen!!s;
    $lstPgDump =~ s!(.*Op fabrikant).*!$1!sg;
    
    my @prdLine = split (/<td style="text-align: center;">/, $lstPgDum
+p);
    shift @prdLine;
    
    foreach my $prditem(@prdLine){
        #print $prditem."\n";
        processitem($prditem);
    }
    
    #print $lstPgDump;
    #htmDump($lstPgDump);
}

sub processitem{
    my $item = shift;
    my @resLine;
    
    #desc
    if ($item =~ /title="([^"]*)"/){
        my $dsc = $1;
        push (@resLine, $dsc);
    }
    
    #pnb
    if ($item =~ m!<span class="small">.*?<.*?>(.*?)<!s){
        my $pnb = $1;
        push (@resLine, $pnb);
    }
    
    #price
    if ($item =~ m!<span class="price_brutto">[ \s\t]+(\&euro\;\&nbsp\
+;)*(.*?)<!s){
        my $price = $2;
        $price =~ s/[\s]*$//;
        push (@resLine, $price);
    }
    
    my $result = (join('","',@resLine ));
    print "\"$result\"\n";
    
    resDump($result);
}

sub shtmLoad {
    open OUT, '<lstPgDump.htm';
    local $/ = undef;
    my $html = <OUT>;
    close OUT;
    
    return $html;
}

sub resDump{
    my $reultLine = shift;
    open RESOUT, '>>BechtleResult.txt';
    print RESOUT "\"$reultLine\"\n";
    close RESOUT;
}
################################3



sub htmDump{
    my $html = shift;
    open OUT, '>htmlDump.htm';
    print OUT $html;
    close OUT;
}

sub lstPgDump{
    my $lstPg = shift;
    open OUT, '>lstPgDump.htm';
    print OUT $lstPg;
    close OUT;
}


#############################################

my $lasturl; #last url fetched
my $ua;      #user agent for http requests
my $cookies; #http cookies

#############################################
####
#inits the useragent and inits the cookies file
sub initHttp {
  $ua = LWP::UserAgent->new();
  $ua->agent('Mozilla/5.0');
  #$ua->proxy(http  => 'http://mahasen:8080');
  my $cookies = new HTTP::Cookies();
  $ua->cookie_jar($cookies);
}

sub sendRequest {
    #print "Sleeping...\n";
    #sleep(2);
    #print "Continuning...\n";
  my ($method, $url, $content, $referer) = @_;
  #$method = shift;
  #$url = shift;
  #$content = shift;
  #$referer = shift;
  
  $referer=$lasturl unless defined($referer);
  
  my ($request, $response);
  $request = HTTP::Request->new($method, $url);
  if (defined($content)) {
    $request->content($content);
    $request->content_type("application/x-www-form-urlencoded");
    $request->content_length(length($content));
  }
  if (defined($referer)) {
    $request->referer($referer);
  }
  $request->header(Connection => "close"); #force no keep alive
  $response = $ua->request($request);
  
  #if the response has a refresh 0 then handle that as well
  my $refresh = $response->header("Refresh");
  if (defined($refresh)) {
    $refresh =~ s!0; *URL=!!i;
    $refresh = makeurl($response->request->url, $refresh);
    $request = HTTP::Request->new("GET", $refresh);
    $response = $ua->request($request);
  }

  #if the resoponse has a Location then handle that as well
  $refresh = $response->header("Location");
  if (defined($refresh)) {
    $refresh = makeurl($response->request->url, $refresh);
    $request = HTTP::Request->new("GET", $refresh);
    $response = $ua->request($request);
  }

  #print $response->as_string;
#  if ($response->code==207 || $response->code==200 || $response->code
+==301) {
  if ($response->code==207 || $response->code==200) {
    $lasturl = $response->request->url;
#    return $response->as_string;
    return $response->content;
  } else {
    print "ERROR : Request \'$url\' Failed (".$response->status_line."
+)!\n";
    print "Last URL : ".$response->request->url."\n";
    return undef;
  }
}

sub makeurl {
  my $currenturl = shift;
  my $newurl = shift;
  
  if ($newurl =~ /^http[s]?:/) { #make sure that the newurl is not a a
+bsolute one
    return $newurl;
  } elsif ($newurl =~ /^\//) { #if the new url starts with a / then we
+ ignore the current path and take the hostname only
    $currenturl =~ m!([^:]+://[^/]+)!;
    $currenturl = $1.$newurl;
    return $currenturl;
  } else { #if the new url doesnt start with a / then we discard the l
+ast page of current url and append the new url
    $currenturl =~ m!([^:]+://.+/)!;
    $currenturl = $1.$newurl;
    return $currenturl;
  }
}
[download]

Edited by planetscape - added readmore tags

Comment on Is this a memory rewriting or else....! Download Code

Replies are listed 'Best First'.
Re: Is this a memory rewriting or else....! by Tanktalus (Canon) on Jan 24, 2006 at 05:07 UTC
"If you run the script u'll know what i mean" And for those of us who don't want to run this, perhaps you can give us any error messages or warnings that you get from the code when it doesn't work. Or you can tell us whatever you get from the code vs what you expect. The thing is, maybe with that information, someone who doesn't have the time or inclination to run your code (I fit one of those categories - I'll let you imagine which one) may be able to rattle off a quick answer that will help based on having seen the same symptoms in the past. Also, seeing as you are referencing external URLs, perhaps someone will later happen upon your node, but not see how the solution you get from someone would also be their solution. And, of course, because you don't have the error text in your node, it will make it harder for others to find your node when it is the same problem because Super Search won't know that your question is actually about their problem. If you follow the conventions of the monastary, you'll not only get more help, you'll actually provide more help as well - and PM is more about providing help than getting help.	[reply]
Re: Is this a memory rewriting or else....! by ikegami (Patriarch) on Jan 24, 2006 at 05:49 UTC
I didn't think Perl's regexp engine could process two regexps at the the same time — but I don't know for sure — yet I see `... while ($content=~ m@...@g ) { ... parseMainNodes($desc, $rs); ... sub parseMainNodes { ... while ($content2=~ m@...@g ) { ...` [download]	[reply] [d/l]
Re^2: Is this a memory rewriting or else....! by davido (Cardinal) on Jan 24, 2006 at 06:33 UTC
Actually, in that respect, Perl's regexp engine does handle mutiple "`m//g`" instances properly. Witness the following code: `use strict; use warnings; use diagnostics; my( $first, $second ) = ( "a1b2c3d4e5f6", "AaBbCc" ); while( $first =~ m/(\d)/g ) { print "\$first matched on $1\n"; while( $second =~ m/([a-z])/g ) { print "\t\$second matched on $1\n"; } }` [download] You might be thinking of this quote, from perltodo: A re-entrant regexp engine This will allow the use of a regex from inside `(?{ })`, `(??{ })` and `(?(?{ })\|)` constructs. Dave	[reply] [d/l] [select]