neilwatson has asked for the wisdom of the Perl Monks concerning the following question:

This is my first experiment with using fork. This program is a valid email checker much like this one. However, I am trying to speed it up by having the program check several addresses at once. Consider this code:

#!/usr/bin/perl -w #checks for valid email address #usage validemail <file containing email addresses> use warnings; use strict; use Email::Valid::Loose; use Net::DNS; use Parallel::ForkManager; my $pm=new Parallel::ForkManager(20); my $resolver=Net::DNS::Resolver->new(); my $addrfile = $ARGV[0]; my ($is_valid, $host, $ip, @goodaddr, @badaddr, $x, $record, @mx, $add +, @adds); #custom words that make emails invalid to you my @custom = qw( postmaster webmaster ); open (EMAILS, "$addrfile"); while (<EMAILS>){ $_ =~ s/\015//; chomp $_; push @adds, $_; } close (EMAILS); OUTER: foreach $add (@adds){ $pm->start and next; foreach $x (@custom){ if ($add =~ m/$x/){ push (@badaddr, $add); next OUTER; } } #if email is invalid move on if (!defined(Email::Valid::Loose->address($add))){ push (@badaddr, $add); next OUTER; } #if email is valid get hostname $is_valid = Email::Valid::Loose->address($add); if ($is_valid =~ m/\@(.*)$/) { $host = $1; } $is_valid=""; # perform dsn lookup to check domain @mx=mx($resolver, $host); if (@mx) { push (@goodaddr, $add); #address is good }else{ push (@badaddr, $add); #address is bad } $pm->finish; } #warning! I will delete existing files as I open them! open (BADADDR, ">badmails") || die; foreach $x (@badaddr){ print BADADDR "$x\n"; } close (BADADDR); open (GOODADDR, ">goodmails") || die; foreach $x (@goodaddr){ print GOODADDR "$x\n"; } close (GOODADDR);

The program seems to run ok but at the end it doesn't write anything to the goodmails or badmails files. The files are created but they're empty.

Can anyone help me?

Neil Watson
watson-wilson.ca

Replies are listed 'Best First'.
Re: Fork Me I need help
by Abigail-II (Bishop) on May 30, 2002 at 16:13 UTC
    Well, I don't know how Parallel::ForkManager, but from the looks of your code it seems that $pm -> start; does a fork(), (with the parent getting a true value, and the child a false one) and $pm -> finish; does an exit.

    If that is true, all the checking is done by the children, who each record which addresses are bad and good. However, the writing is done by the parent process, which didn't do any checking, and hence, has nothing to record. Note also that the children are in certain cases doing a 'next OUTER', meaning they will fork() too (and meaning some addresses will get checked many times).

    Let the children write to the file. But be *VERY* careful how you open these files, you don't want one child erasing the work of another child.

    If $pm -> start () does something totally different than a fork(), or $pm -> finish () something else than an exit, then this node was never written.

    Abigail

Re: Fork Me I need help
by Molt (Chaplain) on May 30, 2002 at 16:14 UTC

    When a process forks the child gets a copy of the parent's environment, but it's not the same set so adding to the arrays in the child won't affect those in the parent. This is more a task for threading rather than fork'ing, or write your output to a file (With suitable file locking) and when done have the parent read in from there.

    Forks are for where you want the child to spin off with minimal contact, threads are where you want shared data and so forth. Personally I tend to stick to fork, this way I don't have to deal with threadsafe code, and use the file technique above.

    Another small point, you call the 'next' from the child. This is, I believe, very bad since you don't end up calling $pm->finish which is what I'd replace your next OUTER calls with.

    As an added freebie, here's the routine I'm currently using to do the 'adding to a file' bit. I appreciate this isn't particularly wonderous code, but it does seem to work.

    # Nice little code fragment to append to the end of a file. # Takes the name of the file to append to, followed by # all of the lines of text to append. sub SafeAppend { my ($file, @content) = @_; s/\n*$/\n/ foreach @content; local *FILE; open FILE, ">>$file" or die "ERROR: Cannot write $file: $!\n"; flock (FILE, LOCK_EX) or die "ERROR: Cannot lock $file: $!\n"; seek (FILE,0,2) or die "ERROR: Cannot seek to end of $file: $!\n"; print FILE foreach @content; close FILE or die "ERROR: Cannot close $file: $!\n"; }
Re: Fork Me I need help
by neilwatson (Priest) on May 30, 2002 at 17:27 UTC
    Thanks for your input. I've done some rewriting:

    #!/usr/bin/perl -w #checks for valid email address #usage validemail <file containing email addresses> use warnings; use strict; use Email::Valid::Loose; use Net::DNS; use Parallel::ForkManager; my $pm=new Parallel::ForkManager(20); my $resolver=Net::DNS::Resolver->new(); my $addrfile = $ARGV[0]; my ($is_valid, $host, $ip, @goodaddr, @badaddr, $x, $record, @mx, $add +, @adds); #custom words that make emails invalid to you my @custom = qw( postmaster webmaster ); open (EMAILS, "$addrfile"); while (<EMAILS>){ $_ =~ s/\015//; chomp $_; push @adds, $_; } close (EMAILS); open (BADADDR, ">badmails") || die; open (GOODADDR, ">goodmails") || die; OUTER: foreach $add (@adds){ $pm->start and next; foreach $x (@custom){ if ($add =~ m/$x/){ print BADADDR "$add\n"; #address is bad $pm->finish; } } #if email is invalid move on if (!defined(Email::Valid::Loose->address($add))){ print BADADDR "$add\n"; #address is bad $pm->finish; } #if email is valid get hostname $is_valid = Email::Valid::Loose->address($add); if ($is_valid =~ m/\@(.*)$/) { $host = $1; } $is_valid=""; # perform dsn lookup to check domain @mx=mx($resolver, $host); if (@mx) { print GOODADDR "$add\n"; #address is good }else{ print BADADDR "$add\n"; #address is bad } $pm->finish; } close (GOODADDR); close (BADADDR);

    This code works. The only thing I noticed is that the parent will finish before some of the children (is that terminology correct?). If someone was impatient they would think the program is done when actaully it may need another minute to finish. Can this be prevented?

    Neil Watson
    watson-wilson.ca

      You only think your code works. But what's going to happen if two children write at once?

      I would use something like:

      use warnings 'all'; use strict; use Email::Valid::Loose; use Net::DNS; use Fcntl qw /:flock :seek/; ... setup stuff ... sub check_addresses; open my $bad => "> badmails" or die "Failed to open badmails: $!\ +n"; open my $good => "> goodmails" or die "Failed to open goodmails: $! +\n"; my $chunk_size = int (@adds / 20) + 1; while (@adds) { my @chunk = splice @chunks, 0, $chunk_size; my $pid = fork; die "Failed to fork: $!\n" unless defined $pid; unless ($pid) { check_addresss @chunk; exit; } } 1 until wait () == -1; # Wait till all children have died. exit; sub check_addr { my $address = shift; ... do some test, return 1 if good, 0 if bad ... } sub check_addresses { foreach my $address (@addresses) { my $handle = check_addr ($address) ? $good : $bad; flock $handle, LOCK_EX or die "Flock failed: $!\n"; seek $handle, SEEK_END, 0 or die "Seek failed: $!\n"; print $handle, "$address\n"; flock $handle, LOCK_UN or die "Flock failed: $!\n"; } }

      Abigail

        I believe you. Clinging to the hope that some of my code is good I tried this using some of your suggestions:

        #!/usr/bin/perl -w #checks for valid email address #usage validemail <file containing email addresses> use warnings; use strict; use Email::Valid::Loose; use Net::DNS; use Parallel::ForkManager; use Fcntl qw/:flock :seek/; my $pm=new Parallel::ForkManager(20); my $resolver=Net::DNS::Resolver->new(); my $addrfile = $ARGV[0]; my ($is_valid, $host, $x, @mx, $add, @adds, $handle); #custom words that make emails invalid to you my @custom = qw( postmaster webmaster ); open (EMAILS, "$addrfile"); while (<EMAILS>){ $_ =~ s/\015//; chomp $_; push @adds, $_; } close (EMAILS); #warning, I will delete existing files open (BADADDR, ">badmails") || die; open (GOODADDR, ">goodmails") || die; foreach $add (@adds){ $pm->start and next; foreach $x (@custom){ if ($add =~ m/$x/){ writebad(); #address is bad $pm->finish; } } #if email is invalid move on if (!defined(Email::Valid::Loose->address($add))){ writebad(); #address is bad $pm->finish; } #if email is valid get domain name $is_valid = Email::Valid::Loose->address($add); if ($is_valid =~ m/\@(.*)$/) { $host = $1; } $is_valid=""; # perform dsn lookup to check domain @mx=mx($resolver, $host); if (@mx) { writegood(); #address is good }else{ writebad(); #address is bad } $pm->finish; } close (GOODADDR); close (BADADDR); 1 until wait () == -1; # Wait till all children have died. exit; sub writegood{ flock GOODADDR, LOCK_EX or die "Flock failed: $!\n"; seek GOODADDR, SEEK_END, 0 or die "Seek failed: $!\n"; print GOODADDR "$add\n"; flock GOODADDR, LOCK_UN or die "unFlock failed: $!\n"; } sub writebad{ flock BADADDR, LOCK_EX or die "Flock failed: $!\n"; seek BADADDR, SEEK_END, 0 or die "Seek failed: $!\n"; print BADADDR "$add\n"; flock BADADDR, LOCK_UN or die "unFlock failed: $!\n"; }

        Alas, now only a couple of blank lines are written to goodmails and badmails. Fork Me!

        Neil Watson
        watson-wilson.ca