Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Remove attachment data from the read file

by chanakya (Friar)
on Jan 10, 2009 at 09:43 UTC ( [id://735373]=perlquestion: print w/replies, xml ) Need Help??

chanakya has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks, I'm working on a task to load millions of hash records which are generated based on the data in the text files on the system.

To accomplish this I wrote a script which reads every text file from the system , gets data from "Subject" line until the eof() and generates a MD5 hash.

After loading thru some data I found an issue, where some particular text files (having XUZ in the "Subject") have an attachment, as given below.

I'd like to know the best way to remove this attachment headers and the attachment data (from the "Content-Type:" ) from the temporarily read text files.

Below is the format of the actual text file
Message-Id: <200707020704.l6274QG9029301@smtp2.corp.abb.com> Subject: System Alert from XUZ of sts WARNING Mime-Version: 1.0 dhcp 0 ip 172.19.22.255 netmask 255.255.255.255 gateway 192.168.1.1 HOSTNAME=ABB dns.enable on ab_to_abb[0] ab_to_local[0] plex_to_abb[1] plex_to_local[1] ab_to_abb[1] Content-Type: application/x-gzip Content-Disposition: attachment; filename="sys_logs.gz" Content-Transfer-Encoding: base64 H4sIAHujiEYCA+1dW2/bOhJ+D5DwIN9aAukqqirrcUukJM2pznIpUjScxZYLApZlhJtZMm +rS9Ls27znSElx4ouiWXasRMVhW1J5MfhcDgcfhlJO3979I8cR1fkL5QkqR2nJPUnLrnIQ +vJ7 FhBCiaJYqm5p8EOWzZ2dpsppnIWOnbpjYqel6qqlGJZm8urk1vMDN4amko/JbSArH289lV +zb CUmus5SMo7tQkqTd3R34f3B0eEGCyLEDkrjxLdSKwqI+/5KJzyumfng1VnO9tkZSodQ1Yn +C cUImUew+H7dGork+UQP7pFPy7fMl+eesLxZUl5I0mk7dsXVnxyHI8C8rx7TIu/lO03d/JX +lR 0dDK6qDVdmitAj299T3JDoJjP7xJPoMurYPzo8ujg/1jaGIO2fdkahEoSAIsydS+u/N1// +jSIk6QJSmMzRSsM4RvHJvUvnFDEuHgvfcmH3AEq+KYTBy0u90dsttsum44rlo9M1t90Gr +1StOU CREATED_ON=Mon Jul 2 00:04:28 2007
After removal of the content-type headers and the attachment, the text file data should look as below
Message-Id: <200707020704.l6274QG9029301@smtp2.corp.abb.com> Subject: System Alert from XUZ of sts WARNING Mime-Version: 1.0 dhcp 0 ip 172.19.22.255 netmask 255.255.255.255 gateway 192.168.1.1 HOSTNAME=ABB dns.enable on ab_to_abb[0] ab_to_local[0] plex_to_abb[1] plex_to_local[1] ab_to_abb[1] CREATED_ON=Mon Jul 2 00:04:28 2007
Below is the script which is used only to read the text files and generate hash. Please let me know how to remove the unwated attachment in the most efficient manner.
#!/usr/bin/perl use strict; use warnings; my @dates = qw(20070202 20070703 20070704); my $datapath = "/p/data/"; foreach my $date (@dates){ if(-e "$datapath/$date"){ opendir(THISDIR, "$datapath/$date") || die("Cannot read dir $d +atapath/$date"); my @files_list = grep(/^ABB.*/, readdir(THISDIR)); closedir THISDIR; print OUT "$datapath/$date COUNT:". scalar(@files_list) ."\n"; foreach my $file(@files_list) { if (-e "$datapath/$date/$file/HDR_FILE"){ open(DATA, "$datapath/$date/$file/HDR_FILE") ; my $content=""; while(<DATA>) { if( (/Subject/ .. ( eof() ) ) { $content .= $_; } } close DATA; my $hex = md5_hex($content); if(!$hex || !$content) { print "Could not generate MD5 : $file \n"; next; } } else { print "No HDR_FILE Found, skipp \n"; } } }
Thanks for your time.

Replies are listed 'Best First'.
Re: Remove attachment data from the read file
by chromatic (Archbishop) on Jan 10, 2009 at 09:58 UTC

    Instead of parsing the text myself, I'd use Email::Simple and Email::MIME to extract the subject and body, while discarding any attachments. You may have to experiment with messages to figure out how they appear to Email::MIME in the case where they have attachments and where they don't, but that's likely to be far simpler than writing your own parsing code.

Re: Remove attachment data from the read file
by Perlbotics (Archbishop) on Jan 10, 2009 at 12:59 UTC

    Some observations and assumptions:

    • Lines in the sample file end on CR (0x0D). Maybe you need to take care of that?
    • For larger files computing MD5 incrementally might yield better performance than appending the file line by line to $content... That's what I thought until I did a short speed comparison. There was no significant difference for files up to 150MB here (after caching). Anyway, I would recommend to use the incremental MD5 computation for memory efficiency when treating large files.
    • Since you have your own idea on how a digest is constructed, I would recommend to put this procedure into a separate module and harness it with a lot of tests. Reproducible digest computation offers a lot of pitfalls - keep them in known territory.
    FWIW, here's something that works for your sample:
    use strict; use warnings; use MD5; my $in_attachment = 0; # 0: mail; 1: attachm. header 2: attachm. cont +ent my $subject_seen = 0; my $md5 = new MD5; # read until (Subject: ...) while ( <DATA> ) { $subject_seen = 1 if /^Subject:\s/; next if not $subject_seen; ## When in doubt: # die "didn't expect that [$.]: $_" # if /^Content-Type:\s/i and not /application\/x-gzip/i $in_attachment = 1, next if not $in_attachment and /^ +Content-Type: /; $in_attachment = ++$in_attachment % 3 if $in_attachment and /^ +\s*$/; next if $in_attachment; # skip MD5 computation while in attachment $md5->add($_); # incremental calculation } # MD5 of "Subject: .... " .. "CREATED_ON=..." # MD5: a138724a0766a9b685ccc60ce9c85de3 print "MD5: ", $md5->hexdigest(),"\n"; __DATA__ Message-Id: <200707020704.l6274QG9029301@smtp2.corp.abb.com> Subject: System Alert from XUZ of sts WARNING Mime-Version: 1.0 dhcp 0 ip 172.19.22.255 netmask 255.255.255.255 gateway 192.168.1.1 HOSTNAME=ABB dns.enable on ab_to_abb[0] ab_to_local[0] plex_to_abb[1] plex_to_local[1] ab_to_abb[1] Content-Type: application/x-gzip Content-Disposition: attachment; filename="sys_logs.gz" Content-Transfer-Encoding: base64 H4sIAHujiEYCA+1dW2/bOhJ+D5DwIN9aAukqqirrcUukJM2pznIpUjScxZYLApZlhJtZMm +rS9Ls27znSElx4ouiWXasRMVhW1J5MfhcDgcfhlJO3979I8cR1fkL5QkqR2nJPUnLrnIQ +vJ7 FhBCiaJYqm5p8EOWzZ2dpsppnIWOnbpjYqel6qqlGJZm8urk1vMDN4amko/JbSArH289lV +zb CUmus5SMo7tQkqTd3R34f3B0eEGCyLEDkrjxLdSKwqI+/5KJzyumfng1VnO9tkZSodQ1Yn +C cUImUew+H7dGork+UQP7pFPy7fMl+eesLxZUl5I0mk7dsXVnxyHI8C8rx7TIu/lO03d/JX +lR 0dDK6qDVdmitAj299T3JDoJjP7xJPoMurYPzo8ujg/1jaGIO2fdkahEoSAIsydS+u/N1// +jSIk6QJSmMzRSsM4RvHJvUvnFDEuHgvfcmH3AEq+KYTBy0u90dsttsum44rlo9M1t90Gr +1StOU CREATED_ON=Mon Jul 2 00:04:28 2007

Re: Remove attachment data from the read file
by BrowserUk (Patriarch) on Jan 10, 2009 at 11:42 UTC

    If the attachments always start with a content-type header, try changing your condition to:

    if( my $seq = /Subject/ .. ( /^content-type/i ) ) { $content .= $_ if $seq !~ /E0$/; last if /E0$/; ## no point in parsing further }

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Remove attachment data from the read file
by chanakya (Friar) on Jan 11, 2009 at 12:30 UTC
    All,
    Thanks for your comments and sample code.
    Perlbotics I liked the idea of incremental MD5 computation. I will do some testing with it.

    Cheers

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://735373]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (6)
As of 2024-04-23 12:07 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found