Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi
I have mbx files (which is the collection of mails).If mbx is opened in text editor, it shows up like html file. I want to extract some text from all of the mails.

I think i can handle it just like a text file. But, How can i do it effieciently? Is there Module or something?

Pls give ur suggestions?

Thanx for ur time.

Replies are listed 'Best First'.
Re: extract text from mbx
by holli (Abbot) on Feb 23, 2005 at 05:47 UTC
    If itīs html, use Html::Parser or one of itīs children. If itīs well-formed use one of the many xml-parsers like Xml::Parser.


    holli, /regexed monk/
      Thanx Holli

      I have few informations to add. The file is not entirely an html file. It has junk characters( which i think represnts the header of each mail), then it is followed by the data. The data is enclosed in html tags.

      The single mbx file is the collection of many such mails.

        I don't think you are giving us enough information here. What is the software that is creating these 'mbx' files? Is it possible to give us an example that might help us determine the format?. If it is a standard mail storage format then it is possible that one of the Email::Folder or it's friends might be able to deal with it, you might even be able to identify the type of the mailbox with the Email::FolderType module.

        /J\