missingthepoint has asked for the wisdom of the Perl Monks concerning the following question:

Greetings, monks.

Recently while reinstalling XP on a home machine, I forgot to back up the Thunderbird address book - oops. But since Thunderbird uses mbox format mailboxes, I suspected Perl could assist. The following code is the result of that suspicion. I grab From and To headers and parse them for email addresses (and associated names), then generate a CSV file that Thunderbird can import as an address book.

#!perl -w # # Parse an mbox file (Thunderbird et al) and extract email addresses ( +and names, if any) # Generate a CSV file suitable for import into Thunderbird's Address B +ook # Written because /somebody/ deleted the address book :D # bjp 2008-12-21 # use strict; my %owner; # below line not split in actual code my $col_line = <<'EOL'; First Name,Last Name,Display Name,Nickname,Primary Email,Secondary Ema +il, Work Phone,Home Phone,Fax Number,Pager Number,Mobile Number,Home Addre +ss, Home Address 2,Home City,Home State,Home ZipCode,Home Country,Work Add +ress,Work Address 2, Work City,Work State,Work ZipCode,Work Country,Job Title,Department,Or +ganization,Web Page 1, Web Page 2,Birth Year,Birth Month,Birth Day,Custom 1,Custom 2,Custom 3 +,Custom 4,Notes, EOL my $input_mbox = $ARGV[0] || die "Usage: $0 <mbox file>\n"; my $to_buf = ''; my $within_to = 0; open( my $fh, '<', $input_mbox ) or die "open: $!\n"; while (<$fh>) { chomp; if (/^From:/) { s/^From://; s/^\s+//; parse_save_addrs($_); } if ( /^[-a-z]+:/i && $within_to ) { $within_to = 0; } if ( /^To:/ || $within_to ) { $within_to = 1; s/^To://; s/^\s+//; $to_buf .= $_; } if ( !$within_to && length $to_buf ) { $to_buf =~ s/\r\n|\n//g; #print "to_buf: $to_buf\n\n"; parse_save_addrs($to_buf); $to_buf = ''; } } print $col_line; for ( sort keys %owner ) { print ",,$owner{$_},,$_,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,\n"; } sub strip_lt_white { my ($str) = @_; $str =~ s/^\s+//; $str =~ s/\s+$//; $str; } sub parse_save_addrs { my ($addrs) = @_; my @list = split /,|;/, $addrs; for my $entry (@list) { $entry =~ s/^\s+//; $entry =~ s/\s+$//; if ($entry =~ / (?: (?:['"]*) # optional quotes ([^<'"]*) # anything up to end quotes or email delim (?:['"]*) # optional quotes \s* # optional whitespace <(.+)> # email in lt-gt delimiters ) | ([._a-z0-9]+\@[.-a-z0-9]+) /xi ) { #no warnings; #print "1=[$1] 2=[$2] 3=[$3]\n"; my $name = $1 || ''; my $email = $2 || $3; # skip junk next if $email =~ /<|>|=/ || $email =~ /mailto:/i; $name = strip_lt_white($name); $email = strip_lt_white($email); $owner{$email} ||= $name; #print "name=[$name], email=[$email]\n"; } else { print STDERR "Entry parse failed: $entry\n"; } # print "$entry\n"; } }

This was enough to 'get the job done'. But how could I improve that code? Is there any advantage to using a CSV module (e.g. Text::CSV)? Also, using that regex to parse the names and addresses strikes me as fragile. How would you monks attack this problem? Any criticism of the above code is welcome too.


Life is denied by lack of attention,
whether it be to cleaning windows
or trying to write a masterpiece...
-- Nadia Boulanger

Replies are listed 'Best First'.
Re: Reconstructing a Thunderbird address book from an mbox file
by tirwhan (Abbot) on Dec 21, 2008 at 11:38 UTC

    Since this seems to be a one-off project, the success of which can more or less be determined by hand, if this code works for you then fine. In general, I'd say you should use one of the many Mbox-handling modules on CPAN (Mail::MboxParser, Mail::Mbox::MessageParser or the 800-pound gorilla of all mailbox-handling modules Mail::Box). There are some corner cases which your code does not cover but these modules will help you avoid, for example, your above code adds addresses included in the From: lines of forwarded messages to your own address book, which may not be what you want. Similar thing for the email addresses, use Email::Address and you won't have to worry about stuff like commas in the quoted section of the name preceding an email address. And if you want to see what a complete email-parsing regular expression looks like then run

    perl -MEmail::Address -e 'print $Email::Address::mailbox'

    All dogma is stupid.

      Many thanks++

      I should have spent more time searching CPAN before writing that... I came across Mail::Box but thought 'too intimidating and probably overkill' and moved on. I might try a rewrite with Mail::MboxParser - it looks like what I wanted to use.

      You have a valid point regarding the one-off nature of this, but 'in general' one-offs tend not to be - hence my question :)

      perl -MEmail::Address -e 'print $Email::Address::mailbox'

      Bloody hell.

      :P


      Life is denied by lack of attention,
      whether it be to cleaning windows
      or trying to write a masterpiece...
      -- Nadia Boulanger