Re: Re: Parsing emails with attachments

I've said it before, and I'll say it again, this is one damn interesting regex, and I didn't write it ;) ( strip HTML tags )

use Benchmark 'cmpthese';
my $data = join'',<DATA>;

print untag($data),
      "\n\n\n",
      'X' x 79,
      "\n\n\n",
      untagg($data),
      "\n\n\n",
      'X' x 79,
      "\n\n\n",;

warn "benchmarking the dumb way";
cmpthese(-3,
{
    regex => sub { untag($data);},
    parse => sub { untagg($data); },
});

warn "benchmarking the smart way";
warn "benchmarking the smart way";
    use HTML::Parser;
    my $p = HTML::Parser->new( api_version => 3);
    my $ret ="";
    $p->handler(default =>
                sub { $ret .= $_[0] if $_[1] eq 'text'},'text,event');

cmpthese(-3,
{
    regex => sub { untag($data);},
    parse => sub { $p->parse($data); },
});


sub untagg {
    local $_ = $_[0] || $_;
    require HTML::Parser;
    my $p = HTML::Parser->new( api_version => 3);
    my $ret ="";
    $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'}
                ,'text,event');
    $p->parse($_);
    return($ret);
}

sub untag {
  local $_ = $_[0] || $_;
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
  s{
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(4)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(1)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(2)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(3)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\3)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   }{}gsx;          # STRIP THIS TAG
  return $_ ? $_ : "";
}



__DATA__
u h a h
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
    <li>Unix</li>
    <li>Perl</li>
    <li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but 
+the text does. 
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>

> < >
< > <

! ] [ ] [ ] [ ] [ - <!-- --> 2 3 4 5 5

<<a href<<a>>

<!-- foo bar -->
<SCRIPT language="javascript">
// this is valid html
// whether you like it or not
// same goes for older browsers
</SCRIPT>
[download]

And the results are ;)


E:\TEH-\F$\dev>perl regexstriphtml.pl
u h a h





This is some text containing the term 'perl'.

    Unix
    Perl
    Linux

Notice how the term perl in the following link doesn't change, but the text does.
Perlmonks.org



> < >
< > <

! ] [ ] [ ] [ ] [ -  2 3 4 5 5

<>






XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


u h a h


This title contains Perl but does not get changed.


This is some text containing the term 'perl'.

    Unix
    Perl
    Linux

Notice how the term perl in the following link doesn't change, but the text does.
Perlmonks.org



> < >
< > <

! ] [ ] [ ] [ ] [ -  2 3 4 5 5

<>



// this is valid html
// whether you like it or not
// same goes for older browsers



XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


benchmarking the dumb way at regexstriphtml.pl line 13, <DATA> line 30.
Benchmark: running parse, regex, each for at least 3 CPU seconds...
     parse:  4 wallclock secs ( 3.30 usr +  0.00 sys =  3.30 CPU) @ 5741.58/s (n=18930)
     regex:  3 wallclock secs ( 3.20 usr +  0.00 sys =  3.20 CPU) @ 11885.77/s (n=38082)
         Rate parse regex
parse  5742/s    --  -52%
regex 11886/s  107%    --
benchmarking the smart way at regexstriphtml.pl line 20, <DATA> line 30.
benchmarking the smart way at regexstriphtml.pl line 21, <DATA> line 30.
Benchmark: running parse, regex, each for at least 3 CPU seconds...
     parse:  3 wallclock secs ( 3.30 usr +  0.00 sys =  3.30 CPU) @ 6661.51/s (n=21963)
     regex:  3 wallclock secs ( 3.20 usr +  0.00 sys =  3.20 CPU) @ 11816.48/s (n=37860)
         Rate parse regex
parse  6662/s    --  -44%
regex 11816/s   77%    --

E:\TEH-\F$\dev>

____________________________________________________
** The Third rule of perl club is a statement of fact: pod is sexy.

Comment on Re: Re: Parsing emails with attachments Download Code