Fellow monks
I'm working with HTML files to generate TIFF images, as part of a fax-to-email gateway. I currently have a funny problem, my fax-to-email gateway generates blank, empty pages (really no content at all) depending on the body of the email is HTML-Encoded or not.

I decided to work arround this problem by verifying if the HTML file presented to generate the TIFF image is capable of generating any content. At this point, I ended writting this little module (for reusability) and complementary test files (below)

:
# File IsHTMLEmpty.pm: package IsHTMLEmpty; use strict; use warnings; use Carp qw/croak/; use base qw/ Exporter /; use vars qw/ @EXPORT /; @EXPORT = qw/ &isHTMLEmpty /; sub isHTMLEmpty( $ ){ my $filename = shift; return undef unless -r $filename; open IN, $filename or croak $!; local $/ = undef; my $html = <IN>; close IN or croak $!; return $html =~ m{<body[^>]*>\s*</body>}mo; } 1; __END__

This module just wraps a single function isHTMLEmpty() that decides is the file presented is capable of generating viewable content or not. To use it, you could use something like this example script:

#!/usr/bin/perl # File test: use warnings; use strict; use lib '/path/to/my/lib_dir/'; use Carp qw/ croak confess /; use IsHTMLEmpty; confess "isHTMLEmpty isn't defined. I'm sorry.\n" unless defined &isHTMLEmpty; confess "Sorry, this HTML can generate content.\n" unless isHTMLEmpty './test.html'; print "Ok.\n" if isHTMLEmpty './test.html'; __END__

And finally, when presented to the file below, I get the right answer, that is: this file is expendable, you can safely discard it and generate one less TIFF image to send via fax:

<html> <head> <title>Titles aren't considered content.</title> </head><body> </body> </html>

But, when I add only a single <br> tag, the file still expendable, and should be discarded, as shall it if there is just an empty <p> inside it. The problem is that my code isn't capable of deciding this (yet) and tell me that this file is necessary because it can (?) generate viewable content.

Ok, enough talk. The question is: shall I implement a big regular expression to deal with all (the most part?) of the cases and forget it, or There Is A Perlish Way To Do It(tm)?

What I expect as answer: suggestions, snippets or pointers to modules capable of implementing this as faster as possible. I need this ready as soon as possible. And yes, you can golf down my problem if you're capable.

Thank you all for your attention, and may the gods bless you all.
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
monsieur_champs


In reply to How to decide if an HTML is expendable? by monsieur_champs

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.