Re: Using a config file in my regexp script.

Use a Config module! I really like Config::General myself. At this point, the easiest pieces of data assignment you could abstract out of your code are PATH and FILE - create a config file like so:

PATH = "E:/path/to/html/file"
FILE = "test1InLineCSS.html"
[download]

And here is code to read it and open the file:

use Config::General;

my $conf   = Config::General->new("foo.conf");
my %config = $conf->getall;
my $filename = join('/',$config{PATH}, $config{FILE});

open INFILE, $filename or die "Can't open $filename: $!";
[download]

You could also devise a scheme to load your regexes with the config file ... but hold the press right there. I for one am really getting tired shouting "Please use an HTML parser for this!"

Please use an HTML parser for this!

You state that "I know that the HTML parser modules are better..." No. You don't know that the HTML parser modules are better, you haven't used one yet! We keep telling you that they are better but you keep on trucking with your array of HTML lines. (And what happens when tags are split across lines? Your array solution falls apart!)

You also stated that you are "...exploring this approach for my MSc." What?!? There is nothing "Masters" about "parsing" HTML contained in an array with regular expressions. (UPDATE: i should have said "directly parsing with regexes" - subtle difference) No, that is very UNDERgraduate, my friend. Still don't believe me? Read on.

Your current method does this:

slurp the HTML file into an array
loop over that entire array to change X
loop again over that entire array to change Y
loop yet again over that entire array to change Z
loop again and again and again ...

That is extremely inefficient. Now. Compare this with how an HTML parser works:

get a file handle on the HTML file
parse tags ... one at time
for each tag parsed, pass the tag contents off to a handler (a subroutine) if one exists
for each handler called, modify this tag and/or it's text/attributes and return the modified result

In other words, you loop across the HTML file ONCE! Not 12 times like you have in your code.

Now. Because i am really crazy, here is your a rewrite of your code. Maybe this will finally convice you to get on the right track. Maybe. ;)

Note that i do not write back to the original file.

use strict;
use warnings;
use HTML::TokeParser::Simple;

my $parser = HTML::TokeParser::Simple->new('tricky.html');

# these are the tags we just want to skip
my %skip = (
   u      => 1,
   b      => 1,
   i      => 1,
   em     => 1,
   big    => 1,
   img    => 1,
   strong => 1,
);

# these are the styles we are going to add to h, p, and li tags
my %modify = (
   h  => ';text-indent: 10px; word-spacing: 30px; letter-spacing: 3px;
+ color: black',
   p  => ';text-indent: 10px; word-spacing: 10px; letter-spacing: 2px;
+ color: black',
   li => ';text-indent: 10px; word-spacing: 10px; letter-spacing: 2px;
+ color: black',
);

while (my $token = $parser->get_token) {

   # replace body bgcolor
   if ($token->is_start_tag('body')) {
      $token->set_attr(style => 'background-color: white');
   }

   # find and skip our "skip" tags
   next if $token->is_tag and $skip{$token->return_tag};

   # find and modify attributes for our "modify" tags
   if ($token->is_start_tag) {
      my $candidate = $token->return_tag;
      $candidate =~ s/h[1-6]/h/i;  #hack to handle all h tags

      # here we get the original style attr and add the new CSS
      if (my $add_attr = $modify{$candidate}) {
         my $orig_attr = $token->return_attr;
         $orig_attr->{style} .= $add_attr;
         $token->set_attr(%$orig_attr);
      }
   }

   # just print to STDOUT ... change to fit your needs
   print $token->as_is; 
}
[download]

Simply amazing, no? :) Yes. But even more amazing would be to simply OVERRIDE THE CSS! maybe something like:

body { text-color: black; }
u,b,i,em,img,big,strong {text-decoration: none;}

h1,h2,h3,h4,h5,h6 {text-indent: 10px; word-spacing: 30px; letter-spaci
+ng: 3px; color: black;}

p,li {text-indent: 10px; word-spacing: 10px; letter-spacing: 2px; colo
+r: black; }
[download]

No Perl needed at all. Not sure if this will work, but had your web page used proper CSS in the first place (that is, CSS defined in a seperate file, not inlined into the HTML), this would have made your task next to trivial.

jeffa

L-LL-L--L-LL-L--L-LL-L--
-R--R-RR-R--R-RR-R--R-RR
B--B--B--B--B--B--B--B--
H---H---H---H---H---H---
(the triplet paradiddle with high-hat)

Comment on Re: Using a config file in my regexp script. Select or Download Code

Replies are listed 'Best First'.
Extract inline styles to an external style sheet ( was Re: Re: Using a config file in my regexp script.) by clscott (Friar) on Sep 17, 2003 at 19:13 UTC
In regards to jeffas final suggestion here is a utility that will strip out your inline style attributes, put them in an external stylesheet, output a new html file with the style attributes removed and the new css file linked in. All with HTML::TokeParser::Simple Some work to beautify the output or make it operate over mulitple files is left as an excercise for the reader #!/opt/perl/bin/perl -w use strict; use warnings; use HTML::TokeParser::Simple; use IO::File; my ( $htmlInfile, $htmlOutfile, $cssOutfile ) = @ARGV; $htmlOutfile \|\|= 'out.html'; $cssOutfile \|\|= 'out.css' ; my $htmlFile = new IO::File "> $htmlOutfile" or die "Can't open $ht +mlOutfile for writing: $!\n"; my $cssFile = new IO::File "> $cssOutfile" or die "Can't open $css +Outfile for writing: $!\n"; my $parser = HTML::TokeParser::Simple->new($htmlInfile); my %styles; while (my $token = $parser->get_token) { # link in our new css file if ($token->is_end_tag('/head') ){ $htmlFile->print( "<link rel='stylesheet' type='text/css' href +='$cssOutfile'>\n" ); } # find and remove inline style definitions if ($token->is_start_tag) { my $tag = $token->return_tag; my $attr = $token->return_attr; # If we have an inline style attribute get the value # then delete it from the tag if ( defined $attr->{style} ){ $styles{ $attr->{style} }{$tag} += 1; $token->delete_attr('style'); } } $htmlFile->print( $token->as_is ); } # print our collected styles into the css file while (my ( $style, $tags ) = each %styles ){ #li, p { font-family: Times; font-size: 10pt } $cssFile->print( join( ', ', sort keys( %$tags ) ) . " { $style }\ +n"); } [download] -- Clayton	[reply] [d/l]
Re: Extract inline styles to an external style sheet ( was Re: Re: Using a config file in my regexp script.) by Anonymous Monk on Jul 08, 2015 at 14:36 UTC
This does not handle fonts?	[reply]