in reply to Parsing emails with attachments

After a quick chat in the CB, it was noted that this line:
$body =~ s/<.+\n*.*?>//g; #remove all <> html tages
isn't reliable.

Excerpt from CB:
<ferrency> LTjake: given sufficiently complicated html, no regex can remove the tags and only the tags
<Petruchio> Hehe... I'm not so sure, given a sufficiently complicated regex. :-)

Maybe Ovid's code could help? (thread parent)

Replies are listed 'Best First'.
Re: Re: Parsing emails with attachments
by PodMaster (Abbot) on Sep 06, 2002 at 12:32 UTC
    I've said it before, and I'll say it again, this is one damn interesting regex, and I didn't write it ;) ( strip HTML tags )
    use Benchmark 'cmpthese'; my $data = join'',<DATA>; print untag($data), "\n\n\n", 'X' x 79, "\n\n\n", untagg($data), "\n\n\n", 'X' x 79, "\n\n\n",; warn "benchmarking the dumb way"; cmpthese(-3, { regex => sub { untag($data);}, parse => sub { untagg($data); }, }); warn "benchmarking the smart way"; warn "benchmarking the smart way"; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'},'text,event'); cmpthese(-3, { regex => sub { untag($data);}, parse => sub { $p->parse($data); }, }); sub untagg { local $_ = $_[0] || $_; require HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'} ,'text,event'); $p->parse($_); return($ret); } sub untag { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > s{ < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(4) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(1) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(2) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(3) # if one of tags-containers (3) </ # wait for end (?i:\3) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed }{}gsx; # STRIP THIS TAG return $_ ? $_ : ""; } __DATA__ u h a h <html> <head> <title>This title contains Perl but does not get changed.</title> </head> <body> <p>This is some text containing the term 'perl'.</p> <ol> <li>Unix</li> <li>Perl</li> <li>Linux</li> </ol> <p>Notice how the term perl in the following link doesn't change, but +the text does. <a href="http://www.perlmonks.org">Perlmonks.org</a></p> </body> </html> > < > < > < ! ] [ ] [ ] [ ] [ - <!-- --> 2 3 4 5 5 <<a href<<a>> <!-- foo bar --> <SCRIPT language="javascript"> // this is valid html // whether you like it or not // same goes for older browsers </SCRIPT>
    And the results are ;)
      I am using this script but I want the attachments to go to the same directory. I put this in the script with $parser->output_dir("opt/htdocs/webcache/attachments/"); Then I take the attachment names and hyperlink them in the body of the message to go to a loadfile for a database. attachment: file.doc body: https://webpage/webcache/attachments/file.doc the issue is that when MIME::Entity puts the file there it does collision resolution, I need to know how to get the filename it is actually assigning to the file and not the "recommended" filename from the message. Can you help with that?
Re: Re: Parsing emails with attachments
by injunjoel (Priest) on Sep 23, 2003 at 23:25 UTC
    try to remove all <> tags with:
    $body =~ s/<[^>]*>//sg;
    The s modifier (at the end) tells the search to treat the $body variable as a single line. so \n's don't matter.