Kumaravel has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

Following is the code which runs fine on any machine. But If I submit this job to a cluster to run in batch mode, this script goes to sleep mode.
## Runs fine devarajk>/u/devarajk/downloads/check_cvs.pl
## This goes to sleep mode devarajk>sqrsh -grid desa /u/devarajk/downloads/check_cvs.pl
sqrsh is a utility we use to submit jobs to clusters.
#!/usr/local/bin/perl -w use Data::Dumper; use Cvs; my $repository = "prod.nyc.com:/proj/ops/repository"; my $dir = "/tmp/"; my $cvs_obj = new Cvs ( $dir, cvsroot => $repository, "remote-shell" => 'rsh', "debug" => 1, ) or die $Cvs::ERROR; my $res = $cvs_obj->checkout("."); print Dumper($res); print "Successfully initialized cvs object for repository $repository +and checked out in $dir\n";
I traced the code to where it sleeps. Its in IPC/Run.pm
$nfound = select( $self->{ROUT} = $self->{RIN}, $self->{WOUT} = $self->{WIN}, $self->{EOUT} = $self->{EIN}, $timeout ) ;
IPC/Run.pm is called by Cvs/Command/Base.pm
The code actually checks out the repository but goes to sleep mode.
Any help would be appreciable. Is it related to STDOUT/STDIN buffer problems?
Thanks,
Kumaravel

Replies are listed 'Best First'.
Re: Cvs::checkout not working when run on cluster
by zentara (Cardinal) on Dec 26, 2005 at 11:58 UTC
    Just wild-a**-brainstorming..... that $timeout in the select statement sure stands out. If you read the perldoc -f select, that timeout sets the blocking timeout for the filehandles. Maybe you could make a special Run.pm for debugging, and print the $timeout value and see what is happening. If timeout is undef, then it will block ( seemingly sleep) until something appears on the filehandles. Maybe hardcode $timeout to .01, and see what happens? Print what is on all the filehandles. It might not solve your problem, but it will help get the ball rolling. :-)

    I'm not really a human, but I play one on earth. flash japh
Re: Cvs::checkout not working when run on cluster
by Kumaravel (Novice) on Dec 26, 2005 at 14:01 UTC
Re: Cvs::checkout not working when run on cluster
by zentara (Cardinal) on Dec 26, 2005 at 16:44 UTC
    If all the variables are undef, then it's not a "sleep" problem. You have to ask "What happened to the filehandles?" . It sounds like the program that IPC is trying to run, has failed to open properly. Is there some way you can get more debug output?

    I'm not really a human, but I play one on earth. flash japh
Re: Cvs::checkout not working when run on cluster
by ph713 (Pilgrim) on Dec 27, 2005 at 14:18 UTC
    I would assume that since the problem only occurs when running under your "sqrsh" utility which submits remote commands, that the problem is triggered by how "sqrsh" deals with STDIN/STDOUT/STDERR.

    In my own parallel execution tools (which are probably similar to sqrsh), I ended up emulating a TTY for the remote command to execute under - it was the only way to solve all of the esoteric issues one hits with various remote commands.

    If you'd like to try to upgrade sqrsh to work around the whole general case (instead of hacking up check_cvs.pl to work around sqrsh), see the forkptycmd() function I've documented in a perlmonk comment here. It was developed with the help of the monks here, and seems to solve the problem for me. my $fh = forkptycmd('rsh machine123 somecommand'); returns a filehandle you can use to read/write from the executing rsh process, and makes that process believe it is writing to a real terminal.

    Also since it forks off subprocesses for commands and returns filehandles for them, you can, for instance, do things like:

    foreach my $host (@hosts) { $cmdfh->{$host} = forkptycmd("rsh $host $command"); }

    And then use select(), poll(), or other types of nonblocking methods to watch all the filehandles in parallel. This way you don't end up waiting on one to finish before you issue the next - true parallel remote execution.

    Beware at larger numbers of hosts there may be other issues to solve as well (like, if you're using rsh as root, you can only do a small number (~ 120-ish) in parallel before you run out of priveleged ports to issue the rshs from).