amearse has asked for the wisdom of the Perl Monks concerning the following question:

Happy New Year and many more! Here's my story. I'm working on a script that will process huge text files (forwarded spam) and pull certain elements that I need to work with in a database.
#!/usr/bin/perl -w ### sting ### ### Table Setup ### # CREATE TABLE subject_info # (id int not null auto_increment, plaintif varchar(100) not null, sub +ject longtext not null, # date varchar(50) not null, primary key (id)); use warnings; use strict; use DBI; #file variables and flags# my $flag = '0'; my $spamcop_email = shift || 'c:\frodo\mail\spamcop_email.txt'; my $text_output = shift || 'c:\frodo\output\text_output.txt'; #regex parameters and variables# my $from = 'From: '; my $full_subject = 'Subject: '; my $date = 'Date: '; my @buffer = ('',''); #initialize arrays# open (SCEMAIL, "$spamcop_email") || die "Can't open $spamcop_email"; my @spamcop_email_array=<SCEMAIL>; close (SCEMAIL); open (TEXTOUT, ">$text_output") || die "Can't open $text_output"; #loop and fill @buffer# foreach (@spamcop_email_array){ if(s/.*$from//){ $buffer[0] = $_; $flag = 0; } if(s/.*$full_subject//){ $buffer[1] = $_; $flag = 0; } if(s/.*$date//){ $buffer[2] = $_; $flag = 0; print TEXTOUT @buffer; } } close TEXTOUT; #DBI Connect and Insertion# my $dbh = DBI->connect("DBI:mysql:database=SpamCopBot; host=lo +calhost", "amearse", "tttttt", {'RaiseError' => 1}); my $sth =$dbh->prepare("INSERT INTO subject_info (plaintif, su +bject, date) VALUES ('$buffer[0]', '$buffer[1]', '$buffer[2]')"); $sth->execute(); $sth->finish(); $dbh->disconnect();
As you can see, the three elements are printed to both text file and database. I have put the flags there to reject dupes, but I know that they currently do nothing. The first problem occurs in the text output. Here is a snippet to show you what I'm talking about.
52387348@reports.spamcop.net Wednesday, January 02, 2002 10:52 PM 52432604@reports.spamcop.net [SpamCop (http://web1.customoffers.com/unsubscribe.asp?emid=1008&email +=x) id:52387348] 4HourWireless Special of the Month - Signal Booster Wednesday, January 02, 2002 11:28 PM 52496384@reports.spamcop.net [SpamCop (http://web1.customoffers.com/unsubscribe.asp?emid=1008&email +=x) id:52432604] 4HourWireless Special of the Month - Signal Booster Thursday, January 03, 2002 12:20 AM 52553913@reports.spamcop.net [SpamCop (http://web1.customoffers.com/unsubscribe.asp?emid=1009&email +=x) id:52496384] AWARD CONFIRMATION Thursday, January 03, 2002 01:04 AM
Notice how the first result is missing the subject line? Well actually, it has been bumped down to the next result, creating a real problem when it comes to output validity. What is the cause of this? I have checked my data sources and they are fine, all the necessary info is there. It is strange to me, I am testing this on 20 emails, so I should see 60 lines of text output, however I only get 59 lines, with the last subject line dropped, but replaced by the one above it. That said, the next problem is in the database entry. When I check the results, the database has only aquired one entry though I had expected 20.
| id | plaintif | subject | date | | 6 | 52531044@reports.spamcop.net | [SpamCop (http://web1.customoffers.com/unsubscribe.asp?emid=1032&em +ail=x) id:52531044] Membership Confirmation for G G | Thursday, January 03, 2002 12:48 AM |
This entry is the correct three elements from the last email of the twenty. I'm a bit lost, could you please sound off on any possible solutions to clean up my output and database entries? Bests, amearse

Replies are listed 'Best First'.
Re: The contents of a misguided array.
by Albannach (Monsignor) on Jan 05, 2002 at 09:09 UTC
    Two thoughts for you:
    • Can you be certain that your data are always ordered as from-subject-date? You are using Date to trigger your printing, so if your first message's headers happen to be in the order from-date-subject, it will print the From and Date, then collect the Subject from the first message, then the From and Date from the second message, which then triggers the printing of From2-Subject1-Date2, as you observed. It is just a guess as I don't have your data, but from casual observation I know that you can't rely on the order of these header lines, so using a specific line essentially as an end-of-header marker is dangerous.
    • Since you only store the most recent from-subject-date lines in your @buffer (you keep overwriting the array), that is all that is available to be written to your database, and hence you will only store your final three such lines. Actually you are only trying to store the first three values from your array - how could you expect more than one e-mail to be recorded in the database? I'm guessing you intend this database section as a plug-in replacement for your print TEXTOUT line where at least it would save a record for each e-mail, but still suffer from the data order problem I described above.

    I hope this helps!

    --
    I'd like to be able to assign to an luser

Re: The contents of a misguided array.
by jlongino (Parson) on Jan 05, 2002 at 11:13 UTC
    Processing order seems to be the problem as Albannach points out. I haven't read his post closely, but here is a modified version of your program that seems to run correctly with a data set that I copied/modified from a sendmail inbox:
    use strict; my $flag = '0'; ### I had to take a few of the extra blanks out of the next three se +arch variables my $from = 'From: '; my $full_subject = 'Subject: '; my $date = 'Date: '; my @buffer = (); my @spamcop_email_array = <DATA>; open (TEXTOUT, ">junk.out") || die "Can't open 'junk.out'"; #loop and fill @buffer# foreach (@spamcop_email_array){ if(s/.*$from//){ ### changed [0] to [1] $buffer[1] = $_; $flag = 0; } if(s/.*$full_subject//){ ### changed [1] to [2] $buffer[2] = $_; $flag = 0; ### moved print from date if block to here print TEXTOUT @buffer; } if(s/.*$date//){ ### changed [2] to [0] $buffer[0] = $_; $flag = 0; } } close TEXTOUT; __DATA__ From MAILER-DAEMON Fri Jan 4 18:04:20 2002 Date: 04 Jan 2002 18:04:20 -0600 From: Mail System Internal Data <MAILER-DAEMON@loopey.faraway.net> Subject: DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA Message-ID: <1010189060@loopey.faraway.net> X-IMAP: 0910805650 0000003175 Status: RO This text is part of the internal format of your mail folder, and is n +ot a real message. It is created automatically by the mail system softwa +re. If deleted, important folder data will be lost, and it will be re-crea +ted with the data reset to initial values. From root@loopey.faraway.net Sun Jun 20 10:37:57 1999 Received: from localhost (root@localhost) by loopey.faraway.net (8.8.8+Sun/8.8.8) with ESMTP id KAA05403 +; Sun, 20 Jun 1999 10:37:57 -0500 (CDT) Date: Sun, 20 Jun 1999 10:37:56 -0500 (CDT) From: Super-User <root@loopey.faraway.net> To: <light@loopey.faraway.net>, xxxx xxxxx <joey@loopey.faraway.net> Subject: Re: accounts for DISL (fwd) Message-ID: <Pine.GSO.4.10.9906201036380.5380-100000@loopey.faraway.ne +t> MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 1064 Status: RO X-Status: X-Keywords: X-UID: 263 joey, Looks like you may have sent Randy a note from root - here's her rep +ly. I'll address the SAS issues. - Andy ---------- Forwarded message ---------- Date: Fri, 18 Jun 1999 09:01:04 -0500 From: rsch <rsch@xxxx.xxx> To: Super-User <root@loopey.faraway.net> Subject: Re: accounts for DISL joey, If possible I would like to keep my account (rsch) to run SAS jobs for the faculty from time to time. Thanks, Randy From light@loopey.faraway.net Wed Mar 28 12:05:57 2001 Date: Wed, 28 Mar 2001 12:05:48 -0600 (Central Standard Time) From: xxxx xxxx <light@loopey.faraway.net> To: xxx xxx xxx <joey@loopey.faraway.net> Subject: xxxxx account Sender: light@loopey.faraway.net MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Content-Length: 364 Status: RO X-Status: X-Keywords: X-UID: 1565 joey, when you get a chance, could you please do whatever is necessary to remove xxxx xxxxx xxxxx account? We will want to leave a forwarding record in place in the aliases file.
    The output file:
    04 Jan 2002 18:04:20 -0600 Mail System Internal Data <MAILER-DAEMON@loopey.faraway.net> DON'T DELETE THIS MESSAGE -- FOLDER INTERNAL DATA Sun, 20 Jun 1999 10:37:56 -0500 (CDT) Super-User <root@loopey.faraway.net> Re: accounts for DISL (fwd) Fri, 18 Jun 1999 09:01:04 -0500 rsch <rsch@xxxx.xxx> Re: accounts for DISL Wed, 28 Mar 2001 12:05:48 -0600 (Central Standard Time) xxxx xxxx <light@loopey.faraway.net> xxxxx account
    While this code works in most cases, you should really beef it up with some error checking. I seem to remember a recent post that described a CPAN module that parses email. If so, you might consider switching since it might be vetted already and insure more accurate results. BTW, I guess that $flag will serve a future purpose. It doesn't seem to do anything now.

    HTH,

    --Jim

Probable misuse of "" instead of "or".
by metadoktor (Hermit) on Jan 05, 2002 at 07:39 UTC
    I haven't bothered to read your code in depth but you are probably misusing the "||" operator. You should be using "or" and your open statements should look like this:

    open (SCEMAIL, "<$spamcop_email") or die "Can't open $spamcop_email";
    
    and not this:
    open (SCEMAIL, "$spamcop_email") || die "Can't open $spamcop_email";
    

    Note: I only threw in the "<" char in the open statement because I like completeness although it is not necessary.

    Your problems with the output may or may not be due to this problem.

    metadoktor

    "The doktor is in."

      As long as open have its parameters in brackets || is ok. However I do prefer or version since it allows me to use open without brackets. That is:
      open SCEMAIL, "< $spamcop_email" or die "Can't open $spamcop_email: $! +";
      Note: It is good idea to add $! in string printed by die because it helps to diagnose the problem which caused fail of open.

      --
      Ilya Martynov (http://martynov.org/)