Use a Config module! I really like Config::General myself. At this point, the easiest pieces of data assignment you could abstract out of your code are PATH and FILE - create a config file like so:
PATH = "E:/path/to/html/file" FILE = "test1InLineCSS.html"
And here is code to read it and open the file:
use Config::General; my $conf = Config::General->new("foo.conf"); my %config = $conf->getall; my $filename = join('/',$config{PATH}, $config{FILE}); open INFILE, $filename or die "Can't open $filename: $!";
You could also devise a scheme to load your regexes with the config file ... but hold the press right there. I for one am really getting tired shouting "Please use an HTML parser for this!"

Please use an HTML parser for this!

You state that "I know that the HTML parser modules are better..." No. You don't know that the HTML parser modules are better, you haven't used one yet! We keep telling you that they are better but you keep on trucking with your array of HTML lines. (And what happens when tags are split across lines? Your array solution falls apart!)

You also stated that you are "...exploring this approach for my MSc." What?!? There is nothing "Masters" about "parsing" HTML contained in an array with regular expressions. (UPDATE: i should have said "directly parsing with regexes" - subtle difference) No, that is very UNDERgraduate, my friend. Still don't believe me? Read on.

Your current method does this: That is extremely inefficient. Now. Compare this with how an HTML parser works: In other words, you loop across the HTML file ONCE! Not 12 times like you have in your code.

Now. Because i am really crazy, here is your a rewrite of your code. Maybe this will finally convice you to get on the right track. Maybe. ;)

Note that i do not write back to the original file.
use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new('tricky.html'); # these are the tags we just want to skip my %skip = ( u => 1, b => 1, i => 1, em => 1, big => 1, img => 1, strong => 1, ); # these are the styles we are going to add to h, p, and li tags my %modify = ( h => ';text-indent: 10px; word-spacing: 30px; letter-spacing: 3px; + color: black', p => ';text-indent: 10px; word-spacing: 10px; letter-spacing: 2px; + color: black', li => ';text-indent: 10px; word-spacing: 10px; letter-spacing: 2px; + color: black', ); while (my $token = $parser->get_token) { # replace body bgcolor if ($token->is_start_tag('body')) { $token->set_attr(style => 'background-color: white'); } # find and skip our "skip" tags next if $token->is_tag and $skip{$token->return_tag}; # find and modify attributes for our "modify" tags if ($token->is_start_tag) { my $candidate = $token->return_tag; $candidate =~ s/h[1-6]/h/i; #hack to handle all h tags # here we get the original style attr and add the new CSS if (my $add_attr = $modify{$candidate}) { my $orig_attr = $token->return_attr; $orig_attr->{style} .= $add_attr; $token->set_attr(%$orig_attr); } } # just print to STDOUT ... change to fit your needs print $token->as_is; }
Simply amazing, no? :) Yes. But even more amazing would be to simply OVERRIDE THE CSS! maybe something like:
body { text-color: black; } u,b,i,em,img,big,strong {text-decoration: none;} h1,h2,h3,h4,h5,h6 {text-indent: 10px; word-spacing: 30px; letter-spaci +ng: 3px; color: black;} p,li {text-indent: 10px; word-spacing: 10px; letter-spacing: 2px; colo +r: black; }
No Perl needed at all. Not sure if this will work, but had your web page used proper CSS in the first place (that is, CSS defined in a seperate file, not inlined into the HTML), this would have made your task next to trivial.

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

In reply to Re: Using a config file in my regexp script. by jeffa
in thread Using a config file in my regexp script. by Tricky

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.