in reply to Using a config file in my regexp script.

Use a Config module! I really like Config::General myself. At this point, the easiest pieces of data assignment you could abstract out of your code are PATH and FILE - create a config file like so:
PATH = "E:/path/to/html/file" FILE = "test1InLineCSS.html"
And here is code to read it and open the file:
use Config::General; my $conf = Config::General->new("foo.conf"); my %config = $conf->getall; my $filename = join('/',$config{PATH}, $config{FILE}); open INFILE, $filename or die "Can't open $filename: $!";
You could also devise a scheme to load your regexes with the config file ... but hold the press right there. I for one am really getting tired shouting "Please use an HTML parser for this!"

Please use an HTML parser for this!

You state that "I know that the HTML parser modules are better..." No. You don't know that the HTML parser modules are better, you haven't used one yet! We keep telling you that they are better but you keep on trucking with your array of HTML lines. (And what happens when tags are split across lines? Your array solution falls apart!)

You also stated that you are "...exploring this approach for my MSc." What?!? There is nothing "Masters" about "parsing" HTML contained in an array with regular expressions. (UPDATE: i should have said "directly parsing with regexes" - subtle difference) No, that is very UNDERgraduate, my friend. Still don't believe me? Read on.

Your current method does this: That is extremely inefficient. Now. Compare this with how an HTML parser works: In other words, you loop across the HTML file ONCE! Not 12 times like you have in your code.

Now. Because i am really crazy, here is your a rewrite of your code. Maybe this will finally convice you to get on the right track. Maybe. ;)

Note that i do not write back to the original file.
use strict; use warnings; use HTML::TokeParser::Simple; my $parser = HTML::TokeParser::Simple->new('tricky.html'); # these are the tags we just want to skip my %skip = ( u => 1, b => 1, i => 1, em => 1, big => 1, img => 1, strong => 1, ); # these are the styles we are going to add to h, p, and li tags my %modify = ( h => ';text-indent: 10px; word-spacing: 30px; letter-spacing: 3px; + color: black', p => ';text-indent: 10px; word-spacing: 10px; letter-spacing: 2px; + color: black', li => ';text-indent: 10px; word-spacing: 10px; letter-spacing: 2px; + color: black', ); while (my $token = $parser->get_token) { # replace body bgcolor if ($token->is_start_tag('body')) { $token->set_attr(style => 'background-color: white'); } # find and skip our "skip" tags next if $token->is_tag and $skip{$token->return_tag}; # find and modify attributes for our "modify" tags if ($token->is_start_tag) { my $candidate = $token->return_tag; $candidate =~ s/h[1-6]/h/i; #hack to handle all h tags # here we get the original style attr and add the new CSS if (my $add_attr = $modify{$candidate}) { my $orig_attr = $token->return_attr; $orig_attr->{style} .= $add_attr; $token->set_attr(%$orig_attr); } } # just print to STDOUT ... change to fit your needs print $token->as_is; }
Simply amazing, no? :) Yes. But even more amazing would be to simply OVERRIDE THE CSS! maybe something like:
body { text-color: black; } u,b,i,em,img,big,strong {text-decoration: none;} h1,h2,h3,h4,h5,h6 {text-indent: 10px; word-spacing: 30px; letter-spaci +ng: 3px; color: black;} p,li {text-indent: 10px; word-spacing: 10px; letter-spacing: 2px; colo +r: black; }
No Perl needed at all. Not sure if this will work, but had your web page used proper CSS in the first place (that is, CSS defined in a seperate file, not inlined into the HTML), this would have made your task next to trivial.

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Replies are listed 'Best First'.
Extract inline styles to an external style sheet ( was Re: Re: Using a config file in my regexp script.)
by clscott (Friar) on Sep 17, 2003 at 19:13 UTC

    In regards to jeffas final suggestion here is a utility that will strip out your inline style attributes, put them in an external stylesheet, output a new html file with the style attributes removed and the new css file linked in.

    All with HTML::TokeParser::Simple

    Some work to beautify the output or make it operate over mulitple files is left as an excercise for the reader

    #!/opt/perl/bin/perl -w use strict; use warnings; use HTML::TokeParser::Simple; use IO::File; my ( $htmlInfile, $htmlOutfile, $cssOutfile ) = @ARGV; $htmlOutfile ||= 'out.html'; $cssOutfile ||= 'out.css' ; my $htmlFile = new IO::File "> $htmlOutfile" or die "Can't open $ht +mlOutfile for writing: $!\n"; my $cssFile = new IO::File "> $cssOutfile" or die "Can't open $css +Outfile for writing: $!\n"; my $parser = HTML::TokeParser::Simple->new($htmlInfile); my %styles; while (my $token = $parser->get_token) { # link in our new css file if ($token->is_end_tag('/head') ){ $htmlFile->print( "<link rel='stylesheet' type='text/css' href +='$cssOutfile'>\n" ); } # find and remove inline style definitions if ($token->is_start_tag) { my $tag = $token->return_tag; my $attr = $token->return_attr; # If we have an inline style attribute get the value # then delete it from the tag if ( defined $attr->{style} ){ $styles{ $attr->{style} }{$tag} += 1; $token->delete_attr('style'); } } $htmlFile->print( $token->as_is ); } # print our collected styles into the css file while (my ( $style, $tags ) = each %styles ){ #li, p { font-family: Times; font-size: 10pt } $cssFile->print( join( ', ', sort keys( %$tags ) ) . " { $style }\ +n"); }
    --
    Clayton
      This does not handle fonts?