http://qs1969.pair.com?node_id=287071

Tricky has asked for the wisdom of the Perl Monks concerning the following question:

Greetings Monks,

I have some simple code to read-in an HTML file to an array, remove the image and anchor tags, and write these changes to the source file on my hard-drive. So far, so good.

1. Is there a better way to initialise the variable containing the pattern? Should it be a string literal of the tag i want to remove / change? The code's below, for your perusal.

Once I've read the file in , I'd like to check for the presence of the tags, and if true then call the subs which remove the tags/attributes. One of the brothers, a little while ago, thought that if I tested the patterns as they are at the moment, they would return 'true' as the value of the pattern variables were non-empty strings . Am I doing this right?

2. How may i go about testing for the presence of tags/attributes, without falling down this pit-fall?

I'm also looking into how to alter the font size values of in-line styles via the same approach.

Surely, there are better solutions...

Trix

#!/usr/bin/perl # write mods to HTML file.plx # Program will read in an html file, remove the img tag and rewrite HT +ML on E-drive. # 1. No need for file variable yet: open (INFILE, "<".$htmlFile) or di +e("Can't read source file!\n"); # 2. Alternative: m/<A\s+HREF=[^>]+>(.*?)<\/A>/ - Will not remove clo +sing tag though - why? # 3. Why is interpreter flipping-out over an 'undefined variable', whe +n # original regexp, m/<A\s+HREF=[^>]+>(.*?)<\/A>/, is known to work. + What am I missing? use warnings; use diagnostics; use strict; # Declare and initialise variables. my $pattern1 = '<IMG\s+(.*)>'; my $pattern2 = '<A\s+HREF\s*=[^>]+>'; my $pattern3 = '</A>'; my @htmlLines; # Open HTML test file and read into array. open INFILE, "E:/Documents and Settings/Richard Lamb/My Documents/HTML +/dummy1.html" or die "Sod! Can't open this file.\n"; @htmlLines = <INFILE>; # Call tag-scrapping subs scrapImageTag(); scrapAnchorTag(); # Removes image tag elements in array sub scrapImageTag { # interates through each element (i.e. HTML line) in array foreach my $line (@htmlLines) { # replace <IMG ...> with nothing. $line =~ s/$pattern1//ig; # case insensitivity and global search +for pattern } } # Removes anchor tag elements in array sub scrapAnchorTag { # interates through each element (i.e. HTML line) in array foreach my $line (@htmlLines) { # replace <A HREF ...> with nothing. $line =~ s/$pattern2//ig; # case insensitivity and global search +for pattern $line =~ s/$pattern3//ig; # case insensitivity and global search +for pattern } } # Replacing original file with reformatted file! open (OUTFILE, ">E:/Documents and Settings/Richard Lamb/My Documents/H +TML/dummy1.html") or die("Can't rewrite the HTML file.\n"); print (OUTFILE @htmlLines); close (INFILE); close (OUTFILE);
Cheers,

T

update (broquaint): shifted <code> tags, added formatting and <readmore> tag

Replies are listed 'Best First'.
Re: Regexps to change HTML tags/attributes
by Ovid (Cardinal) on Aug 27, 2003 at 16:00 UTC

    As a general rule, don't use regular expressions to parse HTML. You typically want a parser. Here's a short example that will remove all anchor tags (beginning and ending) and also change font sizes (though you should really use CSS) and delete the "alt" attribute of images (which you also shouldn't do, but it's here as an example):

    use HTML::TokeParser::Simple 2.1; my $parser = HTML::TokeParser::Simple->new($html_file); my $new HTML = ''; while (defined(my $token = $parser->get_token)) { next if $token->is_tag('a'); # strip anchor tags if ($token->is_start_tag('font')) { $token->set_attr('size' 7); } if ($token->is_tag('img')) { $token->delete_attr('alt'); } $html .= $token->as_is; } open HTML, ">", $new_html_doc or die "Cannot open ($new_html_doc) for +writing: $!"; print HTML $html; close HTML;

    As a side note, if you want your HTML "cleaned up" a little bit, prior to the $html .= $token->as_is; line, add:

    $token->rewrite_tag;

    That will preserve and double-quote the values, automatically lowercase the tag name and attribute names (as they properly should be) and preserve an ending forward slash if it's used in a self closing tag:

    # before <img SRC=foo.jpg height='13' width=14 ALT="SOME alt Value +" /> # after <img src="foo.jpg" height="13" width="14" alt="SOME alt Value +" />

    This method is automatically called on tags that have attributes added, changed, or deleted.

    In other words, this is a very common task and HTML::TokeParser::Simple, version 2.1 does all of that for you and then some.

    Cheers,
    Ovid

    New address of my CGI Course.

    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Regexps to change HTML tags/attributes
by Abigail-II (Bishop) on Aug 27, 2003 at 16:04 UTC
    There's much wrong with your program. First, if you are going to modify the file line-by-line, it's a total waste to first read in all lines into an array. However, when dealing with HTML, it's wrong to look at individual lines. HTML does not have a concept of lines, and tags can have newlines inside them.

    As for the regexes, the first pattern will not do the right thing if there's another tag at the same line. The second pattern will fail to do the right thing if the anchor has another attribute before "HREF", or if it has an attribute value containing a ">".

    You would be far better off using one of the many HTML parsing modules found on CPAN.

    Abigail

Re: Regexps to change HTML tags/attributes
by Aristotle (Chancellor) on Aug 28, 2003 at 10:59 UTC
    How would you build a regex to change the src here?
    <img alt="></(/>" src="/img_handler?alpha=>0.9;name=fish.png" />

    Even if you managed to get this right, the likelihood is very high that your pattern could be broken easily. Don't just blindly look for strings in your HTML.

    Write an actual parser, if you have a lot of time to spare. Otherwise, use one of the existing ones (see Ovid's reply) and get on with your life.

    Makeshifts last the longest.

A reply falls below the community's threshold of quality. You may see it by logging in.