Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

RegExp to delete mail attachments

by Anonymous Monk
on Sep 12, 2000 at 00:11 UTC ( [id://31969]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I'm having trouble writing a script to delete every image attachment in a mailbox file (created by Netscape 4). I think I got the regular expression string correct:

s/--.*Content-Type: image.*From /From /gs

but I'm now getting "Out of Memory" when I'm running the script on the file. Is my regExp faulty or is just too optimistic to try and do this on 20MB at one go?

Replies are listed 'Best First'.
Re: RegExp to delete mail attachments
by swiftone (Curate) on Sep 12, 2000 at 00:18 UTC
    Read Death to Dot Star!. Basically the star operator is greedy, and will eat AS MUCH as possible before matching the rest of the expression. So your first dot star will eat most of the 20MB file (i.e. all but the last occurance of the rest of the string), which probably causes some memory problems.

    The solution is to either make the operator non-greedy (put a ? at the end: .*?) or to restate the regexp to get rid of the dot so it doesn't match as much.

    In your case though, You'd probably be better off searching CPAN for some modules to parse the mail for you. Parsers are tricky beasts.

RE: RegExp to delete mail attachments
by BlaisePascal (Monk) on Sep 12, 2000 at 01:08 UTC
    Looking on CPAN is your best bet. If you wanted to do it yourself, I'd suggest really reading the MIME RFCs, because I can see some problems with what you have there immediately.

    I'm gonna assume that the .* issues others have mentioned have been resolved, and you are using .*? for a non-greedy look-ahead. That'll help immediately.

    But the end-terminator for an attachment isn't "From", it's a second line matching the first "--.*" line. Worse, the proper value of the ".*" is specified on a different line, which you don't wan't to chop out.

    Ignoring that last problem, you could probably use something like:

    s{ (\n--).*?\s*\n # Match boundary line Content-Type:\ image # Find image part .*? # Match part non-greedily (\1) # Match next boundary line }{\1}gs # Replace with boundary line
    This should match the proper beginnings and endings better. I'd love to get rid of the .*? parts, but I'm not sure if it can be done. I though of using \S* for the first, but the boundary line can contain spaces, so that won't work.
Re: RegExp to delete mail attachments
by adamsj (Hermit) on Sep 12, 2000 at 00:16 UTC
    That's a very greedy regexp, with that .* in there and the /s modifier at the end. I bet that's your problem. Try creating a little mailbag and trying it on that--I bet it'll run, and you'll see how your regex fails.

    (If you can't get it to work, tell us a little more about how those mail files are structured--I don't think it would be hard to make it work reliably, but I don't know a darn thing about Netscape mail files.)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://31969]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (7)
As of 2024-03-29 11:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found