HTML image tag stripping

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello Monks,
My script aims to read-in a dummy HTML file from my hard disk (E drive, as the disk is partitioned), call a sub which will run a regex looking for image tags and replacing them with nothing. The interpreter is telling me that I have a global variable which needs an explicit package name, and I've tried to remedy this but still won't work! I'm new the programming, so please bear with me. Here's the code:

#!/usr/bin/perl
#htmltest2.plx
# Program will read in an html file, remove the img tag and print out
# no need for file variable yet: open (INFILE, "<".$htmlFile) or die("
+Can't read source file!\n");
use warnings;
use diagnostics;
use strict;

my @htmlLines;
open INFILE, "E:\\Documents and Settings\\Richard Lamb\\My Documents\\
+HTMLworkspace\\HTML practice\\My First Page!\\firsttest\.html" or die
+ ("Sod! Can't open this file.\n");
@htmlLines = <INFILE>;

scrapTag(); # calls method to remove image tags

sub scrapTag  # removes image tags from HTML document
{
  while($htmlLines[$i] =~ m/<IMG\s+([^>]+)>/ig) # finds each instance 
+of image tag in the input file
  {

    s/<IMG\s+([^>]+)>/ig//ig # replaces each instance of image tag wit
+h nothing!
  }
  return @htmlLines;
}

for my $i (0..@htmlLines-1)
{
  print $htmlLines[$i];
}

print "\n\n";
sleep 2;
print "Success?!\n"
[download]

Any suggestions/hints as to where I'm going wrong?

Cheers,

Richard

update (broquaint): added formatting

Comment on HTML image tag stripping Download Code

Replies are listed 'Best First'.
Re: HTML image tag stripping by Aristotle (Chancellor) on Aug 05, 2003 at 16:07 UTC
You say `$htmlLines[$i]` but that subroutine has no `$i`, and in the main program it gets declared further down. When you mean "last element of array, it is better to write `$#htmlLines` rather than `@htmlLines-1`. Also, avoiding using indices when you could do without. `for my $line (@htmlLines) { print $line; }` [download] Lastly, your `scrapTag` procedure is pretty defective with regards to valid HTML. You should parse the HTML, not just look for character sequences. Use one of the many excellent modules for that purposes - I recommend HTML::TokeParser::Simple: `use warnings; use strict; use HTML::TokeParser::Simple; my $file = "E:\\Documents and Settings\\Richard Lamb\\My Documents\\HT +MLworkspace\\HTML practice\\My First Page!\\firsttest\.html"; my $parser = HTML::TokeParserSimple->new($file) or die "Can't open $file: $!\n"; while ( my $token = $p->get_token ) { next if $token->is_tag('img'); print $token->as_is; }` [download] Makeshifts last the longest.	[reply] [d/l] [select]
Re: HTML image tag stripping by Lachesis (Friar) on Aug 05, 2003 at 16:11 UTC
Your error is because you haven't declared $i. Your substitution expression will also cause problems - you only need to put the modifiers at the very end so use `s/<IMG\s+([^>]+)>//ig` rather than `s/<IMG\s+([^>]+)>/ig//ig` To achieve what you actually want, you don't need to worry about storing an array. `open FH,'filename' or die "Failed to open filename - $!"; while (<FH>) { s/<IMG\s+([^>]+)>//ig; print; }` [download] That will run through each line of the file, strip any img tags and then print the line out again. This won't work with an image tag split across multiple lines. In that case you will be better off using something like HTML::Parser	[reply] [d/l] [select]
Re: HTML image tag stripping by bm (Hermit) on Aug 05, 2003 at 15:50 UTC
The interpreter is telling me that I have a global variable which needs an explicit package name You will probably get better advice if you include the actual output from Perl (not to mention <code> tags)... See How (Not) To Ask A Question Please help us help you. -- bm	[reply]
Re: Re: HTML image tag stripping by Anonymous Monk on Aug 05, 2003 at 15:53 UTC
On the other hand, that particular error often occurs when you 'use strict', but forget to declare a variable with 'my'. Strict then thinks it's a package global and asks you to say which package it's in, but you should really declare it instead. (Alternately, if you have declared it, move the declaration to a wider scope to encompass the place you use it.)	[reply]
Re: HTML image tag stripping by bm (Hermit) on Aug 05, 2003 at 16:19 UTC
`while($htmlLines[$i] =~ m/<IMG\s+([^>]+)>/ig)` `$i` has not been declared within that block. This is probably where your error is coming from. It is certainly not doing what you want. How about: `foreach ( @htmlLines ) { # do stuff with $_ }` [download] But as I am sure others will point out that parsing HTML with your own regex's is dangerous, and will break sooner or later. Use one of the `HTML::*` CPAN modules instead, click here for a list. Also, note that: `for my $i (0..@htmlLines-1) { print $htmlLines[$i]; }` [download] may be re-written as `print for @htmlLines` Hope this helps -- bm	[reply] [d/l] [select]
Re: HTML image tag stripping by CombatSquirrel (Hermit) on Aug 05, 2003 at 16:22 UTC
You have a problem with the undeclared `$i` in the `while($htmlLines[$i]` line. Also, you do not use the index of the array in the loop, so consider using a `for(each)` loop. Then, your substitution is wrong, it should be `...//ig` instead of `/ig//ig`. And you don't have to look for a match, just do a global replace; that is going to save you a little time on large files. And one more thing: You seem to be using a Windows environment (the "Documents and Settings" line). In this case you can make the first line just `#!perl`, because `/usr/bin` won't exist anyways. Hope I helped. P.S.: Here is your (hopefully) fixed code: #!/usr/bin/perl #htmltest2.plx # Program will read in an html file, remove the img tag and print out # no need for file variable yet: open (INFILE, "<".$htmlFile) or die(" +Can't read source file!\n"); use warnings; use diagnostics; use strict; my @htmlLines; open INFILE, "t.htm" or die ("Sod! Can't open this file.\n"); @htmlLines = <INFILE>; @htmlLines = scrapTag(@htmlLines); # calls method to remove image tags sub scrapTag # removes image tags from HTML document { my @htmlLines = @_; for (@htmlLines) { $_ =~ s/<IMG\s+([^>]+)>//ig # replaces each instance of image tag + with nothing! } return @htmlLines; } for (@htmlLines) { print; } print "\n\n"; sleep 2; print "Success?!\n" [download]	[reply] [d/l] [select]