in reply to Re: Parsing emails with attachments
in thread Parsing emails with attachments

I've said it before, and I'll say it again, this is one damn interesting regex, and I didn't write it ;) ( strip HTML tags )
use Benchmark 'cmpthese'; my $data = join'',<DATA>; print untag($data), "\n\n\n", 'X' x 79, "\n\n\n", untagg($data), "\n\n\n", 'X' x 79, "\n\n\n",; warn "benchmarking the dumb way"; cmpthese(-3, { regex => sub { untag($data);}, parse => sub { untagg($data); }, }); warn "benchmarking the smart way"; warn "benchmarking the smart way"; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'},'text,event'); cmpthese(-3, { regex => sub { untag($data);}, parse => sub { $p->parse($data); }, }); sub untagg { local $_ = $_[0] || $_; require HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'} ,'text,event'); $p->parse($_); return($ret); } sub untag { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > s{ < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(4) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(1) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(2) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(3) # if one of tags-containers (3) </ # wait for end (?i:\3) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed }{}gsx; # STRIP THIS TAG return $_ ? $_ : ""; } __DATA__ u h a h <html> <head> <title>This title contains Perl but does not get changed.</title> </head> <body> <p>This is some text containing the term 'perl'.</p> <ol> <li>Unix</li> <li>Perl</li> <li>Linux</li> </ol> <p>Notice how the term perl in the following link doesn't change, but +the text does. <a href="http://www.perlmonks.org">Perlmonks.org</a></p> </body> </html> > < > < > < ! ] [ ] [ ] [ ] [ - <!-- --> 2 3 4 5 5 <<a href<<a>> <!-- foo bar --> <SCRIPT language="javascript"> // this is valid html // whether you like it or not // same goes for older browsers </SCRIPT>
And the results are ;)

E:\TEH-\F$\dev>perl regexstriphtml.pl
u h a h





This is some text containing the term 'perl'.

    Unix
    Perl
    Linux

Notice how the term perl in the following link doesn't change, but the text does.
Perlmonks.org



> < >
< > <

! ] [ ] [ ] [ ] [ -  2 3 4 5 5

<>






XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


u h a h


This title contains Perl but does not get changed.


This is some text containing the term 'perl'.

    Unix
    Perl
    Linux

Notice how the term perl in the following link doesn't change, but the text does.
Perlmonks.org



> < >
< > <

! ] [ ] [ ] [ ] [ -  2 3 4 5 5

<>



// this is valid html
// whether you like it or not
// same goes for older browsers



XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX


benchmarking the dumb way at regexstriphtml.pl line 13, <DATA> line 30.
Benchmark: running parse, regex, each for at least 3 CPU seconds...
     parse:  4 wallclock secs ( 3.30 usr +  0.00 sys =  3.30 CPU) @ 5741.58/s (n=18930)
     regex:  3 wallclock secs ( 3.20 usr +  0.00 sys =  3.20 CPU) @ 11885.77/s (n=38082)
         Rate parse regex
parse  5742/s    --  -52%
regex 11886/s  107%    --
benchmarking the smart way at regexstriphtml.pl line 20, <DATA> line 30.
benchmarking the smart way at regexstriphtml.pl line 21, <DATA> line 30.
Benchmark: running parse, regex, each for at least 3 CPU seconds...
     parse:  3 wallclock secs ( 3.30 usr +  0.00 sys =  3.30 CPU) @ 6661.51/s (n=21963)
     regex:  3 wallclock secs ( 3.20 usr +  0.00 sys =  3.20 CPU) @ 11816.48/s (n=37860)
         Rate parse regex
parse  6662/s    --  -44%
regex 11816/s   77%    --

E:\TEH-\F$\dev>

____________________________________________________
** The Third rule of perl club is a statement of fact: pod is sexy.

Replies are listed 'Best First'.
Re^3: Parsing emails with attachments
by Anonymous Monk on Oct 08, 2007 at 19:34 UTC
    I am using this script but I want the attachments to go to the same directory. I put this in the script with $parser->output_dir("opt/htdocs/webcache/attachments/"); Then I take the attachment names and hyperlink them in the body of the message to go to a loadfile for a database. attachment: file.doc body: https://webpage/webcache/attachments/file.doc the issue is that when MIME::Entity puts the file there it does collision resolution, I need to know how to get the filename it is actually assigning to the file and not the "recommended" filename from the message. Can you help with that?