How to decide if an HTML is expendable?

monsieur_champs has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks
I'm working with HTML files to generate TIFF images, as part of a fax-to-email gateway. I currently have a funny problem, my fax-to-email gateway generates blank, empty pages (really no content at all) depending on the body of the email is HTML-Encoded or not.

I decided to work arround this problem by verifying if the HTML file presented to generate the TIFF image is capable of generating any content. At this point, I ended writting this little module (for reusability) and complementary test files (below)

# File IsHTMLEmpty.pm:
package IsHTMLEmpty;

use strict;
use warnings;
use Carp qw/croak/;

use base qw/ Exporter /;
use vars qw/ @EXPORT /;
@EXPORT = qw/ &isHTMLEmpty /;

sub isHTMLEmpty( $ ){
  my $filename = shift;
  return undef
    unless -r $filename;

  open IN, $filename
    or croak $!;
  local $/ = undef;
  my $html = <IN>;
  close IN
    or croak $!;

  return $html =~ m{<body[^>]*>\s*</body>}mo;

}
1;
__END__
[download]

This module just wraps a single function isHTMLEmpty() that decides is the file presented is capable of generating viewable content or not. To use it, you could use something like this example script:

#!/usr/bin/perl
# File test:
use warnings;
use strict;
use lib '/path/to/my/lib_dir/';

use Carp qw/ croak confess /;
use IsHTMLEmpty;

confess "isHTMLEmpty isn't defined. I'm sorry.\n"
  unless defined &isHTMLEmpty;

confess "Sorry, this HTML can generate content.\n"
  unless isHTMLEmpty './test.html';

print "Ok.\n"
  if isHTMLEmpty './test.html';
__END__
[download]

And finally, when presented to the file below, I get the right answer, that is: this file is expendable, you can safely discard it and generate one less TIFF image to send via fax:

<html>
<head> <title>Titles aren't considered content.</title>
</head><body>

</body>
</html>
[download]

But, when I add only a single <br> tag, the file still expendable, and should be discarded, as shall it if there is just an empty <p> inside it. The problem is that my code isn't capable of deciding this (yet) and tell me that this file is necessary because it can (?) generate viewable content.

Ok, enough talk. The question is: shall I implement a big regular expression to deal with all (the most part?) of the cases and forget it, or There Is A Perlish Way To Do It^(tm)?

What I expect as answer: suggestions, snippets or pointers to modules capable of implementing this as faster as possible. I need this ready as soon as possible. And yes, you can golf down my problem if you're capable.

Thank you all for your attention, and may the gods bless you all.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
monsieur_champs

Comment on How to decide if an HTML is expendable? Select or Download Code

Replies are listed 'Best First'.
Re: How to decide if an HTML is expendable? by fglock (Vicar) on Jul 14, 2003 at 21:36 UTC
You can use HTML::Strip and check if it returns only spaces. `HTML::Strip - Perl extension for stripping HTML markup from text. SYNOPSIS use HTML::Strip; my $hs = HTML::Strip->new(); my $clean_text = $hs->parse( $raw_html );` [download]	[reply] [d/l]
Re: How to decide if an HTML is expendable? by diotalevi (Canon) on Jul 14, 2003 at 21:26 UTC
Another useless use of /o. (added)Far More Than Everything You've Ever Wanted to Know about Prototypes in Perl and When to use Prototypes? - in this case you shouldn't Now with that out of the way, the general problem is to decide whether you have non-markup text. You could further generalize and include images in your "non-blank" idea but that just makes life harder. I'd probably want you to use something like HTML::TokeParser and make your markup/non-markup decisions that way. For the moment though, this will work, sort-of. You should decide whether you want to use my cheap-n-dirty test or do it properly. `sub isHTMLEmpty { my $filename = shift; local HTML; local $/; open HTML, "<", $filename or die "Couldn't open $filename for read +ing: $!"; my $html = <HTML>; close HTML; $html =~ s(<[^>]>)()g; $html =~ s/\s+//g; return !! $html; }` [download]	[reply] [d/l]
Re: How to decide if an HTML is expendable? by TVSET (Chaplain) on Jul 14, 2003 at 21:28 UTC
Take a look at html2ps. It is written in perl and does a pretty good job of converting HTML to Postscript. You can then either use PostScript with your fax program, or convert PostScript to TIFF with Image::Magick's `convert`. Leonid Mamtchenkov aka TVSET	[reply] [d/l]
Re: Re: How to decide if an HTML is expendable? by Willard B. Trophy (Hermit) on Jul 15, 2003 at 21:11 UTC
html2ps is good, if rather slow. As a quick initial "emptiness test", I suggest piping the HTML through the w3m browser, and seeing what you get. And isn't this really an email-to-fax gateway? -- bowling trophy thieves, die!	[reply]