Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
My script aims to read-in a dummy HTML file from my hard disk (E drive, as the disk is partitioned), call a sub which will run a regex looking for image tags and replacing them with nothing. The interpreter is telling me that I have a global variable which needs an explicit package name, and I've tried to remedy this but still won't work! I'm new the programming, so please bear with me. Here's the code:
#!/usr/bin/perl #htmltest2.plx # Program will read in an html file, remove the img tag and print out # no need for file variable yet: open (INFILE, "<".$htmlFile) or die(" +Can't read source file!\n"); use warnings; use diagnostics; use strict; my @htmlLines; open INFILE, "E:\\Documents and Settings\\Richard Lamb\\My Documents\\ +HTMLworkspace\\HTML practice\\My First Page!\\firsttest\.html" or die + ("Sod! Can't open this file.\n"); @htmlLines = <INFILE>; scrapTag(); # calls method to remove image tags sub scrapTag # removes image tags from HTML document { while($htmlLines[$i] =~ m/<IMG\s+([^>]+)>/ig) # finds each instance +of image tag in the input file { s/<IMG\s+([^>]+)>/ig//ig # replaces each instance of image tag wit +h nothing! } return @htmlLines; } for my $i (0..@htmlLines-1) { print $htmlLines[$i]; } print "\n\n"; sleep 2; print "Success?!\n"
Any suggestions/hints as to where I'm going wrong?

Cheers,

Richard

update (broquaint): added formatting

Replies are listed 'Best First'.
Re: HTML image tag stripping
by Aristotle (Chancellor) on Aug 05, 2003 at 16:07 UTC

    You say $htmlLines[$i] but that subroutine has no $i, and in the main program it gets declared further down.

    When you mean "last element of array, it is better to write $#htmlLines rather than @htmlLines-1.

    Also, avoiding using indices when you could do without.

    for my $line (@htmlLines) { print $line; }
    Lastly, your scrapTag procedure is pretty defective with regards to valid HTML. You should parse the HTML, not just look for character sequences. Use one of the many excellent modules for that purposes - I recommend HTML::TokeParser::Simple:
    use warnings; use strict; use HTML::TokeParser::Simple; my $file = "E:\\Documents and Settings\\Richard Lamb\\My Documents\\HT +MLworkspace\\HTML practice\\My First Page!\\firsttest\.html"; my $parser = HTML::TokeParserSimple->new($file) or die "Can't open $file: $!\n"; while ( my $token = $p->get_token ) { next if $token->is_tag('img'); print $token->as_is; }

    Makeshifts last the longest.

Re: HTML image tag stripping
by Lachesis (Friar) on Aug 05, 2003 at 16:11 UTC
    Your error is because you haven't declared $i. Your substitution expression will also cause problems - you only need to put the modifiers at the very end so use s/<IMG\s+([^>]+)>//ig rather than s/<IMG\s+([^>]+)>/ig//ig
    To achieve what you actually want, you don't need to worry about storing an array.
    open FH,'filename' or die "Failed to open filename - $!"; while (<FH>) { s/<IMG\s+([^>]+)>//ig; print; }
    That will run through each line of the file, strip any img tags and then print the line out again.
    This won't work with an image tag split across multiple lines. In that case you will be better off using something like HTML::Parser
Re: HTML image tag stripping
by bm (Hermit) on Aug 05, 2003 at 15:50 UTC
    The interpreter is telling me that I have a global variable which needs an explicit package name

    You will probably get better advice if you include the actual output from Perl (not to mention <code> tags)...

    See How (Not) To Ask A Question

    Please help us help you.
    --
    bm

      On the other hand, that particular error often occurs when you 'use strict', but forget to declare a variable with 'my'. Strict then thinks it's a package global and asks you to say which package it's in, but you should really declare it instead. (Alternately, if you have declared it, move the declaration to a wider scope to encompass the place you use it.)

Re: HTML image tag stripping
by bm (Hermit) on Aug 05, 2003 at 16:19 UTC
    while($htmlLines[$i] =~ m/<IMG\s+([^>]+)>/ig)

    $i has not been declared within that block. This is probably where your error is coming from. It is certainly not doing what you want. How about:

    foreach ( @htmlLines ) { # do stuff with $_ }

    But as I am sure others will point out that parsing HTML with your own regex's is dangerous, and will break sooner or later.
    Use one of the HTML::* CPAN modules instead, click here for a list.
    Also, note that:
    for my $i (0..@htmlLines-1) { print $htmlLines[$i]; }
    may be re-written as  print for @htmlLines
    Hope this helps
    --
    bm
Re: HTML image tag stripping
by CombatSquirrel (Hermit) on Aug 05, 2003 at 16:22 UTC
    You have a problem with the undeclared $i in the while($htmlLines[$i] line. Also, you do not use the index of the array in the loop, so consider using a for(each) loop.
    Then, your substitution is wrong, it should be ...//ig instead of /ig//ig. And you don't have to look for a match, just do a global replace; that is going to save you a little time on large files.
    And one more thing: You seem to be using a Windows environment (the "Documents and Settings" line). In this case you can make the first line just #!perl, because /usr/bin won't exist anyways.

    Hope I helped.

    P.S.: Here is your (hopefully) fixed code:
    #!/usr/bin/perl #htmltest2.plx # Program will read in an html file, remove the img tag and print out # no need for file variable yet: open (INFILE, "<".$htmlFile) or die(" +Can't read source file!\n"); use warnings; use diagnostics; use strict; my @htmlLines; open INFILE, "t.htm" or die ("Sod! Can't open this file.\n"); @htmlLines = <INFILE>; @htmlLines = scrapTag(@htmlLines); # calls method to remove image tags sub scrapTag # removes image tags from HTML document { my @htmlLines = @_; for (@htmlLines) { $_ =~ s/<IMG\s+([^>]+)>//ig # replaces each instance of image tag + with nothing! } return @htmlLines; } for (@htmlLines) { print; } print "\n\n"; sleep 2; print "Success?!\n"