comment on

Graff, Cheers for the pointers. I've opened my HTML test file, written regexps to remove image and anchor tags, and printed them out. Need to write these mods to the original file, then refresh the HTML page - Happy Days! My supervisor mentioned the that regexps may have limitations, so i'm beginning to look into the HTML parse-tree approach (is that the same approach you recommende?). Here's the source code i've put together so far! Rich

#!/usr/bin/perl
# remove img & anchor tags.plx
# Program will read in an html file, remove the img tag and print out 
+entire doc.
# 1. No need for file variable yet: open (INFILE, "<".$htmlFile) or di
+e("Can't read source file!\n");
# 2. Alternative: m/<A\s+HREF=[^>]+>(.*?)<\/A>/  - Will not remove clo
+sing tag though - why?
# 3. Why is interpreter flipping-out over an 'undefined variable', whe
+n
#    original regexp, m/<A\s+HREF=[^>]+>(.*?)<\/A>/, is known to work.
+ What am I missing?

use warnings;
use diagnostics;
use strict;
use HTML::Parser;    # Include this module for future reference - may 
+need to abandon
                     # regexps in favour of parse-trees.

# Declare and initialise variables.
my $pattern1 = '<IMG\s+(.*)>';
my $pattern2 = '<A\s+HREF\s*=[^>]+>';
my $pattern3 = '</A>';
my @htmlLines;

# Open HTML test file and read into array.
open INFILE, "E:\\Documents and Settings\\Richard Lamb\\My Documents\\
+HTMLworkspace\\HTML practice\\My First Page!\\firsttest.html" or die 
+"Sod! Can't open this file.\n";
@htmlLines = <INFILE>;
close (INFILE);

# Test for presence of patterns in HTML file
if($pattern1)
{
  scrapImageTag(); # calls to remove image tags
}
else
{
  print "No tags matching this pattern within the HTML document.\n";
}

if($pattern2 && $pattern3)
{
  scrapAnchorTag();
}
else
{
  print "No tags matching this pattern within the HTML document.\n";
}

# Removes image tag elements in array
sub scrapImageTag
{
  foreach my $line (@htmlLines)
  {
    # replace <IMG ...> with nothing.
    $line =~ s/$pattern1//ig;  # case insensitivity and global search 
+for pattern
  }
}

# Removes anchor tag elements in array
sub scrapAnchorTag
{
  foreach my $line (@htmlLines)
  {
    # replace <A HREF ...> with nothing.
    $line =~ s/$pattern2//ig;  # case insensitivity and global search 
+for pattern
    $line =~ s/$pattern3//ig;  # case insensitivity and global search 
+for pattern
  }
}

printHTML();

# prints the reformatted HTML doc
sub printHTML
{
  for my $i (0..@htmlLines-1)
  {
    print $htmlLines[$i];
  }
}

print "\n\n";
sleep 2;
print "Success?!\n";
[download]

In reply to Re: Re: Re: Re: Intercharacter spacing by Tricky
in thread Intercharacter spacing by Tricky

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.