Using GET in a loop

New Novice has asked for the wisdom of the Perl Monks concerning the following question:

I want to retrieve information from a number of pages that all have the same structure. Thus, I would like to use a retrieve them via GET and process their information in a loop. However, GET seems to store the information in a hidden variable and appends it. Instead of replacing the content of my "temporary" variable the content of the new website is appended to the content of all previously retrieved ones. Thus, my information processing (getting various information from the "temporary" content variable and assigning it to other variables) always yields the same results, the one from the first website retrieved. Because I rely on the structure of the website to identify the necessary bits and pieces of information, I need the content variable to be replaced everytime.

I already tried global and local variables. Furthermore, I "empty" the content variable every time prior to retrieving a new site. None of this works.

Here is the (abbreviated) code:

use LWP::Simple;
foreach $ID (@ID) {
 $content=" ";
 $url="http:://...docid=$ID";
 $content=get($url);
 # subsequently I extract information from the string  
 # $content, which only serves as a contemporary storage 
 Open FILE, ">> C:/perl/output.txt";
 Print FILE $content; 
 # actually it's all the other variables whose values I 
 # want to save
}
[download]

Comment on Using GET in a loop Download Code

Replies are listed 'Best First'.
Re: Using GET in a loop by EdwardG (Vicar) on Sep 17, 2004 at 09:09 UTC
perldoc -f open `If MODE is '>>', the file is opened for appending, again being created if necessary.` [download] Try this - `Open FILE, "> C:/perl/output.txt"; # <-- notice the single '>'`	[reply] [d/l] [select]
Re^2: Using GET in a loop by New Novice (Sexton) on Sep 17, 2004 at 09:23 UTC
I want to append the file, but with new information. $content gets appended, that is the problem. Every time the loop is executed, $content gets bigger and bigger as the newly retrieved website is added to the previously retrieved websites. It still does so when I "empty" $content ($content= " ";) every time before retrieving a new website and use a local $content variable. Thus, my problem is not with the output of the data into the file but with the input. Thank you for your rapid reaction!	[reply]
Re^3: Using GET in a loop by EdwardG (Vicar) on Sep 17, 2004 at 09:30 UTC
As Zaxo points out, your code sample doesn't run. Can I be so bold as to suggest that your (abbreviated) code is not useful for troubleshooting and that you should cut and paste the actual code; the code that contains the bug.	[reply]
Re^4: Using GET in a loop by New Novice (Sexton) on Sep 17, 2004 at 09:56 UTC
Re^5: Using GET in a loop by EdwardG (Vicar) on Sep 17, 2004 at 10:17 UTC
Re: Using GET in a loop by Zaxo (Archbishop) on Sep 17, 2004 at 09:19 UTC
Isn't your program failing from undefined 'Open' and 'Print'? The perl builtins are all lower-case. Anyhow, what else is failing? You explicitly place the results all in the same file by trying to open to append. You might as well open the output file before the loop and close afterwards. Also, your url looks fishy. Is that just to hide the address you're scraping? Lexical $content should get rid of previous $content just fine. After Compline, Zaxo	[reply]
Re^2: Using GET in a loop by Happy-the-monk (Canon) on Sep 17, 2004 at 09:24 UTC
looks like you should be giving `use strict; #` no more loose variables among other things `use warnings; #` no more undefined functions among other things a try. Cheers, Sören	[reply]
Re^3: Using GET in a loop by New Novice (Sexton) on Sep 17, 2004 at 09:33 UTC
I am using strict and warnings. Makes no difference. Cheers!	[reply]
Re: Using GET in a loop by davidj (Priest) on Sep 17, 2004 at 09:47 UTC
Obviously, due to the fact that it doesn't even compile, the code you have posted is not the code you are using, but a strip-down for this question. I would suggest that you post the code you are using. I guess you could dummy the urls you are fetching, but the rest of the code should be posted as is. davidj	[reply]
Re^2: Using GET in a loop by New Novice (Sexton) on Sep 17, 2004 at 09:59 UTC
Here is the compilable code. I thought it would be easier to focus on the problem directly. Sorry about any inconvencience caused. #! C:/programme/perl use LWP::Simple; use LWP::UserAgent; use HTML::Stripper; use warnings; use strict; our $stripper = HTML::Stripper->new( skip_cdata => 1, strip_ws => 1 ); our $ID; our @ID=(161060, 160920, 160999, 160899); our $count=1; foreach $ID (@ID) { my $content; my $content_full; my $url="http://europa.eu.int/prelex/detail_dossier_real.cfm?CL=en&Do +sId="."$ID"; $content_full=" "; $content_full=get($url); $content=$stripper->strip_html($content_full); our $i_type=index($content, " COM "); our $d_type=substr($content, $i_type+1,3); our $d_year=substr($content, $i_type+6,4); our $d_number=substr($content, $i_type+12,3); our $proposal="$d_type "."$$d_year$"." $d_number"; print "Proposal\: $proposal \n"; open DB, ">> C:/programme/perl/test/prelex.dta" or die "Problem: $!"; flock (DB, 2); print DB "$proposal\n"; close DB; } [download]	[reply] [d/l]
Re^3: Using GET in a loop by EdwardG (Vicar) on Sep 17, 2004 at 10:22 UTC
By "focusing on the problem", you managed to focus away the part of the code with the bug. Now, with your assumptions tempered, it is obvious that the problem lies in the re-use of the $stripper object. Easiest solution; make a new $stripper in each iteration. `foreach $ID (@ID) { $stripper = HTML::Stripper->new( ... ); ... }` [download] And in case it isn't clear, this has nothing to do with get().	[reply] [d/l]
Re^4: Using GET in a loop by New Novice (Sexton) on Sep 17, 2004 at 10:32 UTC
Re^4: Using GET in a loop by itub (Priest) on Sep 17, 2004 at 15:17 UTC
Re^3: Using GET in a loop by davidj (Priest) on Sep 17, 2004 at 10:22 UTC
Well, now the next question: is it `$content_full` or `$content` that is getting appended to instead of replaced? That is, is it the result of `LWP::Simple's` get function or `HTML::Stripper's` strip_html function that is not working correctly? davidj	[reply] [d/l] [select]
Re^4: Using GET in a loop by New Novice (Sexton) on Sep 17, 2004 at 10:28 UTC
Re: Using GET in a loop by ccn (Vicar) on Sep 17, 2004 at 09:09 UTC
`use LWP::Simple; foreach $ID (@ID) { $content=" "; $url="http:://...docid=$ID"; # this is completely NEW value $content=get($url); # here you APPEND new value to the previous values open FILE, ">> C:/perl/output.txt"; print FILE $content; }` [download]	[reply] [d/l]
Re^2: Using GET in a loop by New Novice (Sexton) on Sep 17, 2004 at 09:30 UTC
This is exactly what I expected to happen, but it doesn't. Each time the (global or local) variable $content gets a new value assigned (via GET), the content of the variable is not replaced by the new value (content of the newly retrieved website) but it is appended. This is the case even when I explicitly "empty" it ($content= " "). Apparently, GET uses an internal variable to store the information retrieved, which is appended. Thus, each time the loop is executed $content simply gets bigger and bigger as this internal variable adds the content of newly retrieved webpages to all previously retrieved ones. Thank you for your response!	[reply]