in reply to Re: Re: Re: Intercharacter spacing
in thread Intercharacter spacing

Graff, Cheers for the pointers. I've opened my HTML test file, written regexps to remove image and anchor tags, and printed them out. Need to write these mods to the original file, then refresh the HTML page - Happy Days! My supervisor mentioned the that regexps may have limitations, so i'm beginning to look into the HTML parse-tree approach (is that the same approach you recommende?). Here's the source code i've put together so far! Rich
#!/usr/bin/perl # remove img & anchor tags.plx # Program will read in an html file, remove the img tag and print out +entire doc. # 1. No need for file variable yet: open (INFILE, "<".$htmlFile) or di +e("Can't read source file!\n"); # 2. Alternative: m/<A\s+HREF=[^>]+>(.*?)<\/A>/ - Will not remove clo +sing tag though - why? # 3. Why is interpreter flipping-out over an 'undefined variable', whe +n # original regexp, m/<A\s+HREF=[^>]+>(.*?)<\/A>/, is known to work. + What am I missing? use warnings; use diagnostics; use strict; use HTML::Parser; # Include this module for future reference - may +need to abandon # regexps in favour of parse-trees. # Declare and initialise variables. my $pattern1 = '<IMG\s+(.*)>'; my $pattern2 = '<A\s+HREF\s*=[^>]+>'; my $pattern3 = '</A>'; my @htmlLines; # Open HTML test file and read into array. open INFILE, "E:\\Documents and Settings\\Richard Lamb\\My Documents\\ +HTMLworkspace\\HTML practice\\My First Page!\\firsttest.html" or die +"Sod! Can't open this file.\n"; @htmlLines = <INFILE>; close (INFILE); # Test for presence of patterns in HTML file if($pattern1) { scrapImageTag(); # calls to remove image tags } else { print "No tags matching this pattern within the HTML document.\n"; } if($pattern2 && $pattern3) { scrapAnchorTag(); } else { print "No tags matching this pattern within the HTML document.\n"; } # Removes image tag elements in array sub scrapImageTag { foreach my $line (@htmlLines) { # replace <IMG ...> with nothing. $line =~ s/$pattern1//ig; # case insensitivity and global search +for pattern } } # Removes anchor tag elements in array sub scrapAnchorTag { foreach my $line (@htmlLines) { # replace <A HREF ...> with nothing. $line =~ s/$pattern2//ig; # case insensitivity and global search +for pattern $line =~ s/$pattern3//ig; # case insensitivity and global search +for pattern } } printHTML(); # prints the reformatted HTML doc sub printHTML { for my $i (0..@htmlLines-1) { print $htmlLines[$i]; } } print "\n\n"; sleep 2; print "Success?!\n";

Replies are listed 'Best First'.
Re^5: Intercharacter spacing
by graff (Chancellor) on Aug 12, 2003 at 01:31 UTC
    Okay -- that is very likely what you intend most of the time, in terms of getting rid of unwanted tags. But you should note that some of the conditionals are not doing what the comments and messages say they are doing:
    # Test for presence of patterns in HTML file if($pattern1) { scrapImageTag(); # calls to remove image tags } else { print "No tags matching this pattern within the HTML document.\n"; }
    Well, the condition "if($pattern1)" does NOT test for the presence of image tags in the html data. It merely tests that some (non-empty, non-zero) value has been assigned to the scalar $pattern1, and since you have done so a few lines above this, the test will always be true -- it would be true if no data were read in from the html file.

    To test for the presence of image tags in the html data, the condition would have to be:

    if ( grep /$pattern1/i, @htmlLines )
    but there's really no reason to do the test -- just go ahead and call the "scrap" functions. If those regex substitutions apply, fine. If not, no harm done (and not that much cpu work either).
      See your point regarding the test conditions. Have a slim-line code as a result. The DzSoft Perl Editor has an 'In Browser' facility where I can view the fruits of my code. It displays the HTML exactly as it would if I'd been able to write the altered code back to the sourse file on my hard drive, i.e. images and anchors have been removed. Which is where I'm having (more) problems. This hacking business is certainly hard work, though fun (when I can get code to run).! I'm trying to write the changed code back to the file on the hard-drive, by writing on a filehandle, so I can re-open the html document. I use a print operator? I have to come clean and say that the file writing's confusing the Hell out of me. Are file tests the answer, assign to a new list variable? Time to try. Here's the code I've written so far - the file won't open for writing (yet). Confusion!!! Rich
      #!/usr/bin/perl # write mods to HTML file.plx # Program will read in an html file, remove the img tag and rewrite HT +ML on E-drive. # 1. No need for file variable yet: open (INFILE, "<".$htmlFile) or di +e("Can't read source file!\n"); # 2. Alternative: m/<A\s+HREF=[^>]+>(.*?)<\/A>/ - Will not remove clo +sing tag though - why? # 3. Why is interpreter flipping-out over an 'undefined variable', whe +n # original regexp, m/<A\s+HREF=[^>]+>(.*?)<\/A>/, is known to work. + What am I missing? use warnings; use diagnostics; use strict; # Declare and initialise variables. my $pattern1 = '<IMG\s+(.*)>'; my $pattern2 = '<A\s+HREF\s*=[^>]+>'; my $pattern3 = '</A>'; my @htmlLines; my @htmlFile; # Open HTML test file and read into array. open INFILE, "E:/Documents and Settings/Richard Lamb/My Documents/HTML +/test1InDocCSS.html" or die "Sod! Can't open this file.\n"; @htmlLines = <INFILE>; close (INFILE); scrapImageTag(); scrapAnchorTag(); # Removes image tag elements in array sub scrapImageTag { foreach my $line (@htmlLines) { # replace <IMG ...> with nothing. $line =~ s/$pattern1//ig; # case insensitivity and global search +for pattern } } # Removes anchor tag elements in array sub scrapAnchorTag { foreach my $line (@htmlLines) { # replace <A HREF ...> with nothing. $line =~ s/$pattern2//ig; # case insensitivity and global search +for pattern $line =~ s/$pattern3//ig; # case insensitivity and global search +for pattern } } # Am I deleting the contents of the list with this? Not sure... open (OUTFILE, ">@htmlLines") or die("Can't rewrite the HTML file.\n") +; print OUTFILE "@htmlLines\n"; close (OUTFILE);
        Certainly, you don't really want to do this:
        open (OUTFILE, ">@htmlLines") or die("Can't rewrite the HTML file.\n") +;
        You're using all the contents of the array -- which would appear to be all the text contents of a file -- as the file name. (Try including the same "string" in the error message that reports failure to open the file, so you'll know when the failure is due to a bad file name.)

        You seem to have a usable, sensible file name for opening "INFILE", and you should do the similar thing when opening "OUTFILE" -- maybe change the name a bit, so you don't obliterate the original data file, or else rename the original file to something else, first (before you use that file name again for output), to preserve the input data -- this is important when debugging this sort of script.