Greetings Monks,

I have some simple code to read-in an HTML file to an array, remove the image and anchor tags, and write these changes to the source file on my hard-drive. So far, so good.

1. Is there a better way to initialise the variable containing the pattern? Should it be a string literal of the tag i want to remove / change? The code's below, for your perusal.

Once I've read the file in , I'd like to check for the presence of the tags, and if true then call the subs which remove the tags/attributes. One of the brothers, a little while ago, thought that if I tested the patterns as they are at the moment, they would return 'true' as the value of the pattern variables were non-empty strings . Am I doing this right?

2. How may i go about testing for the presence of tags/attributes, without falling down this pit-fall?

I'm also looking into how to alter the font size values of in-line styles via the same approach.

Surely, there are better solutions...

Trix

#!/usr/bin/perl # write mods to HTML file.plx # Program will read in an html file, remove the img tag and rewrite HT +ML on E-drive. # 1. No need for file variable yet: open (INFILE, "<".$htmlFile) or di +e("Can't read source file!\n"); # 2. Alternative: m/<A\s+HREF=[^>]+>(.*?)<\/A>/ - Will not remove clo +sing tag though - why? # 3. Why is interpreter flipping-out over an 'undefined variable', whe +n # original regexp, m/<A\s+HREF=[^>]+>(.*?)<\/A>/, is known to work. + What am I missing? use warnings; use diagnostics; use strict; # Declare and initialise variables. my $pattern1 = '<IMG\s+(.*)>'; my $pattern2 = '<A\s+HREF\s*=[^>]+>'; my $pattern3 = '</A>'; my @htmlLines; # Open HTML test file and read into array. open INFILE, "E:/Documents and Settings/Richard Lamb/My Documents/HTML +/dummy1.html" or die "Sod! Can't open this file.\n"; @htmlLines = <INFILE>; # Call tag-scrapping subs scrapImageTag(); scrapAnchorTag(); # Removes image tag elements in array sub scrapImageTag { # interates through each element (i.e. HTML line) in array foreach my $line (@htmlLines) { # replace <IMG ...> with nothing. $line =~ s/$pattern1//ig; # case insensitivity and global search +for pattern } } # Removes anchor tag elements in array sub scrapAnchorTag { # interates through each element (i.e. HTML line) in array foreach my $line (@htmlLines) { # replace <A HREF ...> with nothing. $line =~ s/$pattern2//ig; # case insensitivity and global search +for pattern $line =~ s/$pattern3//ig; # case insensitivity and global search +for pattern } } # Replacing original file with reformatted file! open (OUTFILE, ">E:/Documents and Settings/Richard Lamb/My Documents/H +TML/dummy1.html") or die("Can't rewrite the HTML file.\n"); print (OUTFILE @htmlLines); close (INFILE); close (OUTFILE);
Cheers,

T

update (broquaint): shifted <code> tags, added formatting and <readmore> tag


In reply to Regexps to change HTML tags/attributes by Tricky

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.