New Novice has asked for the wisdom of the Perl Monks concerning the following question:

Hi,

I am writing a program that extracts information from websites. As the websites all use the same format, I use a loop to extract the information I am interested in. For some documents, I get the error message

substr outside of string at extract.pl line 187
use of uninitialized value in string eq at extract.pl line 182
use of uninitialized value in concenation (.) at extract.pl line 185
use of uninitialized value in concenation (.) at extract.pl line 185

This message is then repeated endlessly (presumably because of the loop?). What causes the "substr outside of string" warning?

Here is the code the message is referring to, the first line is number 178:

if ($content_sub2=~'Rapporteur') { $ch=substr($', $ch_count, 1); my $ch_count2=0; while ($ch_count2<2) { if ($ch eq" ") { $ch_count2++; } $d_ep_rap=$d_ep_rap.$ch; $ch_count++; $ch=substr($', $ch_count, 1); } }

Thank you

Replies are listed 'Best First'.
Re: Substr warning
by davido (Cardinal) on Sep 24, 2004 at 14:58 UTC

    After your code acts upon the first HTML file, do you ever reset $ch_count to zero, or does it keep on getting incremented larger and larger with each subsequent file? The code snippet you've provided doesn't tell us that. Also, is $ch_count starting out being undef initially?

    In particular, the first scenario could get you into trouble. If you're acting upon multiple files, and $ch_count has grown to some value that exceeds the length of the postmatch string, you've got a problem.


    Dave

      Hi,

      I always reset ch_count. I use this structure of code for a number of occasion and each time I reset ch_count.

Re: Substr warning
by JediWizard (Deacon) on Sep 24, 2004 at 14:36 UTC

    This is a little off topic, but I think it is of note anyway. You might want to consider changing your code so as not to use $', $', $` and $& are not set by perl unless you use them in your code. If you use any of these variables for one regex, perl must suplly them for all regexes (performace hit, memory hit). I don't see any reason that would stop you from modifying your regex so that the desired additional characters are returned in $1. This might help avoid the errors you are seeing, as well as boost the performance of your script.

    May the Force be with you
Re: Substr warning
by Happy-the-monk (Canon) on Sep 24, 2004 at 14:09 UTC

    The first message means that your variable $ch actually is empty at that time in the loop.
    The second means the same.
    "substr outside of string"   means you are trying to get at characters where there is no string left to get them from.
    Imagine you're trying to get the second character of a string that's only one character long.

    Cheerio, Sören

      Hi,

      thanks for this. There is plenty of string left, maybe there es a problem with special characters. The problems occurs at least one time while trying to read in a French name (with accent grave).

      Is there a way to tell perl to ignore these special characters for the moment, so that I can transform them later?

      Cheers!

        No, you're barking up the wrong tree. That warning can only mean that your starting index is bigger then the length of the string — or, if negative, too large on the other side. Perl starts counting from the end of the string backwards, then. Special characters have nothing at all to do with it.

        As an aside: I think your approach is wrong. You're tackling this as if you're programming in C. Perl has better ways to parse strings than this extremely low level stuff: you really should take a step back, decide what you're actually trying to achieve, and use a regex or maybe two, to do the same job.

Re: Substr warning
by TrekNoid (Pilgrim) on Sep 24, 2004 at 15:21 UTC
    There's not a lot to go on here, but my gut feeling is that $ch_count must be getting larger than the string you're looking at.

    It looks to me like you're trying to grab the first two words out of each line... maybe a different approach would work better?

    my $line = 'This is a test'; my (@words) = split( /\s+/, $line); my $twowords = $words[0] . ' ' . $words[1]; print "$twowords\n";
    Trek
Re: Substr warning
by Jenda (Abbot) on Sep 24, 2004 at 14:55 UTC

    Could you explain what is the code supposed to do? It seems to me it could be written in one or two lines but I just can't get it straight. Maybe showing and explaining a bigger chunk would help. All in all it looks to me like you are making things much more complex than they have to be.

    Jenda
    Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
       -- Rick Osborne

      Hi,

      I am extracting information from a html-site, which I have stripped of tags, etc. beforehand. In this instance, I am interested in the two words following the string "Rapporteur: ".

      There is probably a better way of doing it, but at the moment I am using $' to get the text after the string I searched for and then read in every single character once at a time, add it to my variable until there have been two character that equal " " (thus, I have two words).

      I am confused about the substr warning as I am sure, that there are plenty of characters left in the substring.

      Hope this makes my dilemma clearer.

        Yes it does. You can (and should) do this by a single regex. Like this:

        if ($html =~ /Rapporteur: (\S+ \S+)/) { $reporters_name = $1; }

        The regexp searches for "Rapporteur:" followed by a single space, then some non-space characters, a single space and again some nonspace chars. And it captures those two groups of nonspace characters.

        Jenda
        Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.
           -- Rick Osborne