in reply to Reliable email parsing

Word reliable in my question mean: * defect free * comply to all email format related RFC

To me, reliable means that it deals well with emails that don't follow the RFCs. I have found Mail::Box to be the best at dealing with whatever the internet throws at it. Yes, it's a complex OO beast, but it's been around for years and it's been battle tested.

"Defect free" is a meaningless phrase to me and it's unrealistic in any complex piece of software -- which I would consider any email parser to be. How would you confirm it anyway? Create a test set of emails and see if each parser can deal with it.

If you really want to do a code review, Mail::Box::Parser, Mail::Box::Parser::Perl and Mail::Box::Parser::C are all fairly self-evident -- you don't really need to understand all the OO code to examine the parsers.

-xdg

Code written by xdg and posted on PerlMonks is public domain. It is provided as is with no warranties, express or implied, of any kind. Posted code may not have been tested. Use of posted code is at your own risk.

Replies are listed 'Best First'.
Re^2: Reliable email parsing
by powerman (Friar) on Sep 13, 2006 at 12:35 UTC
    Dealing with emails which don't follow RFC also required, of course. Here is good info about such emails: http://cr.yp.to/immhf.html.

    But when I say "RFC compliant" I mean support for all possible formats for email addresses, comments in email headers and all things like locale/language/encoding-specific, for example:

    Content-Disposition: attachment; filename*0*=koi8-r'ru'%F0%D2%C9%D7%C5%D4 filename*1*=%20from filename*2=" russia.txt"

    And about 'defect free'. There enough information on this topic now. Good example of it is DJB software and most basic&simple *NIX utils. I'm sure: more code == more bugs, so I always prefer smaller/simpler solutions.

    I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already).

      > I'm sure solution compliant to all these RFC can be coded under ~1000-1500 lines of code (I've partially done it already).

      Not to be rude, but often a large portion of code complexity comes from the last bits and pieces, that get left till last because they are hard to implement.

      So to sum up the apparent answers.

      No, there isn't yet something that does what you want. Email:: should be close. I'm sure they'd appreciate your code to help finish off the set of Email:: module.
      Good example of it is DJB software
      OK, now I know you're trolling.
Re^2: Reliable email parsing
by powerman (Friar) on Sep 14, 2006 at 20:00 UTC
    Ok, I'm playing now with Mail::Box. Before trying to parse emails I need to parse my mailbox - to fetch individual emails. So I create simple oneliner which calculate messages from my mbox file 'Mail/-default'.
    $ time perl -MMail::Box::Manager -le ' $m=Mail::Box::Manager->new; $f=$m->open(folder=>"Mail/-default"); print scalar $f->messages ' Unexpected end of header (C parser): charset="iso-8859-1" 3554 real 0m8.222s user 0m7.928s sys 0m0.200s
    Oops! Mutt say there 3549 message in this file, not 3554... So I've developed own reader for mbox format:
    $ time ./mbox_scan Mail/-default 3549 Mail/-default real 0m0.492s user 0m0.408s sys 0m0.080s
    Hmm. 16 times faster?! Wow. And correct - it found 3549 messages, just as mutt. So, maybe I misunderstand something about this world, but WHY my parser on pure perl much faster than C parser in mature CPAN module? Ok, maybe that Mail::Box do a lot of additional parsing which I doesn't do, maybe... but why it produce incorrect results?

    Here is code of my parser, if interested (sorry, there lines up to 80 columns):

      Well, your test mailbox looks to have invalid data (or at least invalid as far as Mail::Box is concerned; looks to be a bad MIME header). Figure out what the offending message is and let the maintainers know.

      As for speed, your code is doing nothing more than reading the message body in; Mail::Box is building up objects to represent each of the messages. That code which (basically) throws away the data it's reading rather than doing anything useful with it runs faster isn't really surprising.

      A reply falls below the community's threshold of quality. You may see it by logging in.
A reply falls below the community's threshold of quality. You may see it by logging in.