Re: Parsing emails with attachments

Replies are listed 'Best First'.
Re: Re: Parsing emails with attachments by PodMaster (Abbot) on Sep 06, 2002 at 12:32 UTC
I've said it before, and I'll say it again, this is one damn interesting regex, and I didn't write it ;) ( strip HTML tags ) use Benchmark 'cmpthese'; my $data = join'',<DATA>; print untag($data), "\n\n\n", 'X' x 79, "\n\n\n", untagg($data), "\n\n\n", 'X' x 79, "\n\n\n",; warn "benchmarking the dumb way"; cmpthese(-3, { regex => sub { untag($data);}, parse => sub { untagg($data); }, }); warn "benchmarking the smart way"; warn "benchmarking the smart way"; use HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'},'text,event'); cmpthese(-3, { regex => sub { untag($data);}, parse => sub { $p->parse($data); }, }); sub untagg { local $_ = $_[0] \|\| $_; require HTML::Parser; my $p = HTML::Parser->new( api_version => 3); my $ret =""; $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'} ,'text,event'); $p->parse($_); return($ret); } sub untag { local $_ = $_[0] \|\| $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > s{ < # open tag (?: # open group (A) (!--) \| # comment (1) or (\?) \| # another comment (2) or (?i: # open group (B) for /i ( TITLE \| # one of start tags SCRIPT \| # for which APPLET \| # must be skipped OBJECT \| # all content STYLE # to correspond ) # end tag (3) ) \| # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(4) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] \| # and not close tag or [\s=] # \s or "=" with `[^`]` \| # something in quotes ` or [\s=] # \s or "=" with '[^']' \| # something in quotes ' or [\s=] # \s or "=" with "[^"]" # something in quotes " ) # repeat (C) 0 or more times \| # else (if previous case is not (4)) .? # minimum of any chars ) # end if previous char is (4) (?(1) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(2) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(3) # if one of tags-containers (3) </ # wait for end (?i:\3) # of this tag (?:\s[^>])? # skip junk to ">" ) # end if (3) > # tag closed }{}gsx; # STRIP THIS TAG return $_ ? $_ : ""; } __DATA__ u h a h <html> <head> <title>This title contains Perl but does not get changed.</title> </head> <body> <p>This is some text containing the term 'perl'.</p> <ol> <li>Unix</li> <li>Perl</li> <li>Linux</li> </ol> <p>Notice how the term perl in the following link doesn't change, but +the text does. <a href="http://www.perlmonks.org">Perlmonks.org</a></p> </body> </html> > < > < > < ! ] [ ] [ ] [ ] [ - <!-- --> 2 3 4 5 5 <<a href<<a>> <!-- foo bar --> <SCRIPT language="javascript"> // this is valid html // whether you like it or not // same goes for older browsers </SCRIPT> [download] And the results are ;) Read more... (2 kB)	[reply] [d/l]
Re^3: Parsing emails with attachments by Anonymous Monk on Oct 08, 2007 at 19:34 UTC
I am using this script but I want the attachments to go to the same directory. I put this in the script with $parser->output_dir("opt/htdocs/webcache/attachments/"); Then I take the attachment names and hyperlink them in the body of the message to go to a loadfile for a database. attachment: file.doc body: https://webpage/webcache/attachments/file.doc the issue is that when MIME::Entity puts the file there it does collision resolution, I need to know how to get the filename it is actually assigning to the file and not the "recommended" filename from the message. Can you help with that?	[reply]
Re: Re: Parsing emails with attachments by injunjoel (Priest) on Sep 23, 2003 at 23:25 UTC
try to remove all <> tags with: `$body =~ s/<[^>]*>//sg;` [download] The s modifier (at the end) tells the search to treat the $body variable as a single line. so \n's don't matter.	[reply] [d/l]

use Benchmark 'cmpthese';
my $data = join'',<DATA>;

print untag($data),
      "\n\n\n",
      'X' x 79,
      "\n\n\n",
      untagg($data),
      "\n\n\n",
      'X' x 79,
      "\n\n\n",;

warn "benchmarking the dumb way";
cmpthese(-3,
{
    regex => sub { untag($data);},
    parse => sub { untagg($data); },
});

warn "benchmarking the smart way";
warn "benchmarking the smart way";
    use HTML::Parser;
    my $p = HTML::Parser->new( api_version => 3);
    my $ret ="";
    $p->handler(default =>
                sub { $ret .= $_[0] if $_[1] eq 'text'},'text,event');

cmpthese(-3,
{
    regex => sub { untag($data);},
    parse => sub { $p->parse($data); },
});


sub untagg {
    local $_ = $_[0] || $_;
    require HTML::Parser;
    my $p = HTML::Parser->new( api_version => 3);
    my $ret ="";
    $p->handler(default => sub { $ret .= $_[0] if $_[1] eq 'text'}
                ,'text,event');
    $p->parse($_);
    return($ret);
}

sub untag {
  local $_ = $_[0] || $_;
# ALGORITHM:
#   find < ,
#       comment <!-- ... -->,
#       or comment <? ... ?> ,
#       or one of the start tags which require correspond
#           end tag plus all to end tag
#       or if \s or ="
#           then skip to next "
#           else [^>]
#   >
  s{
    <               # open tag
    (?:             # open group (A)
      (!--) |       #   comment (1) or
      (\?) |        #   another comment (2) or
      (?i:          #   open group (B) for /i
        ( TITLE  |  #     one of start tags
          SCRIPT |  #     for which
          APPLET |  #     must be skipped
          OBJECT |  #     all content
          STYLE     #     to correspond
        )           #     end tag (3)
      ) |           #   close group (B), or
      ([!/A-Za-z])  #   one of these chars, remember in (4)
    )               # close group (A)
    (?(4)           # if previous case is (4)
      (?:           #   open group (C)
        (?!         #     and next is not : (D)
          [\s=]     #       \s or "="
          ["`']     #       with open quotes
        )           #     close (D)
        [^>] |      #     and not close tag or
        [\s=]       #     \s or "=" with
        `[^`]*` |   #     something in quotes ` or
        [\s=]       #     \s or "=" with
        '[^']*' |   #     something in quotes ' or
        [\s=]       #     \s or "=" with
        "[^"]*"     #     something in quotes "
      )*            #   repeat (C) 0 or more times
    |               # else (if previous case is not (4))
      .*?           #   minimum of any chars
    )               # end if previous char is (4)
    (?(1)           # if comment (1)
      (?<=--)       #   wait for "--"
    )               # end if comment (1)
    (?(2)           # if another comment (2)
      (?<=\?)       #   wait for "?"
    )               # end if another comment (2)
    (?(3)           # if one of tags-containers (3)
      </            #   wait for end
      (?i:\3)       #   of this tag
      (?:\s[^>]*)?  #   skip junk to ">"
    )               # end if (3)
    >               # tag closed
   }{}gsx;          # STRIP THIS TAG
  return $_ ? $_ : "";
}



__DATA__
u h a h
<html>
<head>
<title>This title contains Perl but does not get changed.</title>
</head>
<body>
<p>This is some text containing the term 'perl'.</p>
<ol>
    <li>Unix</li>
    <li>Perl</li>
    <li>Linux</li>
</ol>
<p>Notice how the term perl in the following link doesn't change, but 
+the text does. 
<a href="http://www.perlmonks.org">Perlmonks.org</a></p>
</body>
</html>

> < >
< > <

! ] [ ] [ ] [ ] [ - <!-- --> 2 3 4 5 5

<<a href<<a>>

<!-- foo bar -->
<SCRIPT language="javascript">
// this is valid html
// whether you like it or not
// same goes for older browsers
</SCRIPT>
[download]