New Novice has asked for the wisdom of the Perl Monks concerning the following question:

I want to retrieve information from a number of pages that all have the same structure. Thus, I would like to use a retrieve them via GET and process their information in a loop. However, GET seems to store the information in a hidden variable and appends it. Instead of replacing the content of my "temporary" variable the content of the new website is appended to the content of all previously retrieved ones. Thus, my information processing (getting various information from the "temporary" content variable and assigning it to other variables) always yields the same results, the one from the first website retrieved. Because I rely on the structure of the website to identify the necessary bits and pieces of information, I need the content variable to be replaced everytime.

I already tried global and local variables. Furthermore, I "empty" the content variable every time prior to retrieving a new site. None of this works.

Here is the (abbreviated) code:

use LWP::Simple; foreach $ID (@ID) { $content=" "; $url="http:://...docid=$ID"; $content=get($url); # subsequently I extract information from the string # $content, which only serves as a contemporary storage Open FILE, ">> C:/perl/output.txt"; Print FILE $content; # actually it's all the other variables whose values I # want to save }

Replies are listed 'Best First'.
Re: Using GET in a loop
by EdwardG (Vicar) on Sep 17, 2004 at 09:09 UTC

    perldoc -f open

    If MODE is '>>', the file is opened for appending, again being created if necessary.

    Try this -

    Open FILE, "> C:/perl/output.txt"; # <-- notice the single '>'

     

      I want to append the file, but with new information.

      $content gets appended, that is the problem. Every time the loop is executed, $content gets bigger and bigger as the newly retrieved website is added to the previously retrieved websites. It still does so when I "empty" $content ($content= " ";) every time before retrieving a new website and use a local $content variable.

      Thus, my problem is not with the output of the data into the file but with the input.

      Thank you for your rapid reaction!

        As Zaxo points out, your code sample doesn't run. Can I be so bold as to suggest that your (abbreviated) code is not useful for troubleshooting and that you should cut and paste the actual code; the code that contains the bug.

         

Re: Using GET in a loop
by Zaxo (Archbishop) on Sep 17, 2004 at 09:19 UTC

    Isn't your program failing from undefined 'Open' and 'Print'? The perl builtins are all lower-case.

    Anyhow, what else is failing? You explicitly place the results all in the same file by trying to open to append. You might as well open the output file before the loop and close afterwards. Also, your url looks fishy. Is that just to hide the address you're scraping? Lexical $content should get rid of previous $content just fine.

    After Compline,
    Zaxo

      looks like you should be giving

      use strict; # no more loose variables among other things
      use warnings; # no more undefined functions among other things

      a try.

      Cheers, Sören

        I am using strict and warnings. Makes no difference. Cheers!
Re: Using GET in a loop
by davidj (Priest) on Sep 17, 2004 at 09:47 UTC
    Obviously, due to the fact that it doesn't even compile, the code you have posted is not the code you are using, but a strip-down for this question. I would suggest that you post the code you are using. I guess you could dummy the urls you are fetching, but the rest of the code should be posted as is.

    davidj

      Here is the compilable code. I thought it would be easier to focus on the problem directly. Sorry about any inconvencience caused.

      #! C:/programme/perl use LWP::Simple; use LWP::UserAgent; use HTML::Stripper; use warnings; use strict; our $stripper = HTML::Stripper->new( skip_cdata => 1, strip_ws => 1 ); our $ID; our @ID=(161060, 160920, 160999, 160899); our $count=1; foreach $ID (@ID) { my $content; my $content_full; my $url="http://europa.eu.int/prelex/detail_dossier_real.cfm?CL=en&Do +sId="."$ID"; $content_full=" "; $content_full=get($url); $content=$stripper->strip_html($content_full); our $i_type=index($content, " COM "); our $d_type=substr($content, $i_type+1,3); our $d_year=substr($content, $i_type+6,4); our $d_number=substr($content, $i_type+12,3); our $proposal="$d_type "."\($d_year\)"." $d_number"; print "Proposal\: $proposal \n"; open DB, ">> C:/programme/perl/test/prelex.dta" or die "Problem: $!"; flock (DB, 2); print DB "$proposal\n"; close DB; }

        By "focusing on the problem", you managed to focus away the part of the code with the bug.

        Now, with your assumptions tempered, it is obvious that the problem lies in the re-use of the $stripper object.

        Easiest solution; make a new $stripper in each iteration.

        foreach $ID (@ID) { $stripper = HTML::Stripper->new( ... ); ... }

        And in case it isn't clear, this has nothing to do with get().

         

        Well, now the next question: is it $content_full or $content that is getting appended to instead of replaced? That is, is it the result of LWP::Simple's get function or HTML::Stripper's strip_html function that is not working correctly?

        davidj

Re: Using GET in a loop
by ccn (Vicar) on Sep 17, 2004 at 09:09 UTC

    use LWP::Simple; foreach $ID (@ID) { $content=" "; $url="http:://...docid=$ID"; # this is completely NEW value $content=get($url); # here you APPEND new value to the previous values open FILE, ">> C:/perl/output.txt"; print FILE $content; }
      This is exactly what I expected to happen, but it doesn't.

      Each time the (global or local) variable $content gets a new value assigned (via GET), the content of the variable is not replaced by the new value (content of the newly retrieved website) but it is appended.

      This is the case even when I explicitly "empty" it ($content= " "). Apparently, GET uses an internal variable to store the information retrieved, which is appended. Thus, each time the loop is executed $content simply gets bigger and bigger as this internal variable adds the content of newly retrieved webpages to all previously retrieved ones.

      Thank you for your response!