in reply to batch processing via single open

Am I looking for something like PHP's require_once?
You are looking for require. If a file is succesfully imported via require, it won't be imported again during the program's lifetime. On subsequent calls require'ing the same file just returns 1 (or some other bizarre but "true" value).

From your post the calling semantics are not clear to me. "My perl file is executed for every document in a large collection" - how is this perl file invoked? from a shell, from perl itself? If you are invoking perl anew for each file in the set (and exiting perl when done) you won't gain much with require.

If you showed some code it would be easier to help you. See How (Not) To Ask A Question.

--shmem

_($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                              /\_¯/(q    /
----------------------------  \__(m.====·.(_("always off the crowd"))."·
");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}

Replies are listed 'Best First'.
Re^2: batch processing via single open
by karden (Novice) on Jul 16, 2007 at 17:41 UTC
    Thank you shmem.

    I am to run this perl file from shell for every file in a large corpus (I have to, because this is only a small part of a large project. Execution continues with various other languages and scripts and so on afterwards), not from another perl file. So daxim's highly sophisticated code is unfortunately useless to me.

    As you said, "require" would not bring me much advantage because upon finishing execution, even if I do not close() the connection, it will be auto-closed. And while processing the next file, same perl code will be executed and also "require"d file will be executed again.

    Roughly, I want to have the following:
    db.pl

    open2 (\*INP, \*OUTP, 'path_to_external_program'); print "included"; 1;

    maincode.pl
    require "db.pl"; $file = $ARGV[0] #the next file to process print OUTP some_command_using_$file; #exec some cmds on external prog #more lines follow afterwards

    I want my maincode.pl to function for every file, as well as db.pl to print "included" only once. So that for the whole collection of thousands of documents, I will have only single connection throughout the batch execution. Possible in a way?
      Possible in a way?

      OF course yes! That's why I asked about calling semantics. Which part is gathering the collection of files to iterate over? the shell script? do you want perl to gather the files? what system are you on?

      <handwaving style="amount: lots"> If you are gathering the file via a shell script, you could do something along these lines

      ( while read command param; do # whatever method files=`command param` # to gather the files for file in $files; do # process files echo $file done done) | maincode.pl

      and in maincode.pl

      #!/usr/bin/perl use IPC::Open2; my $cmd = 'whatever'; # really, I have no idea what you are doing $pid = open2(\*CHLD_OUT, \*CHLD_IN, $cmd) or die "oops: $!\n"; while(<STDIN>) { chomp; my $file = $_; # now do whatever with $file. open(I,'<', $file); while (my $line = <I>) { ... # do whatever with each line in the file } }

      but since I still don't know what you're up to, I can't give proper advice. See I know what I mean. Why don't you?

      Maybe you really want some client/server stuff, or the perl code to act as a stream filter, and dispatch it's output to somewhere else. Each usage has different semantics; how can I know which are required, without a bit more of explanation from you?

      --shmem

      _($_=" "x(1<<5)."?\n".q·/)Oo.  G°\        /
                                    /\_¯/(q    /
      ----------------------------  \__(m.====·.(_("always off the crowd"))."·
      ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
        No, no... We are very misdirected at the moment. We have nothing to do with reading/writing from/to files. Forget about it.

        I am not looking for a way to handle files. My shell script gathers the file names and passes to my Perl script. I do not want Perl to handle this instead, I am okay with it. Then my Perl file connects to "another program", executes some commands there (depending on the argument passed from shell to it) and exits. But this happens hundreds of thousands of times. Why not open the connection to "that program" only ONCE and use the same connection hundreds of thousands of times?

        I certainly understand you asking for detailed information and am quite aware of posting guidelines but this is a part of an experiment for a TREC proposal. So it is impossible for me to describe such a long thesis here. If I could not clarify what I am looking for so far, then that's the limitation of my English and we have nothing to do.

        One final try from me: Though I am not, let us assume the "external program" we are dealing is MySQL. That is: my Perl file opens an IPC connection to MySQL, selects a DB to work on and creates a table named ARGV[0]. That's the way it is, we cannot change it. Also assume that we want to create 100000 different tables. Now, why open a connection to MySQL for 100000 times? Let us open the connection and select the DB once at the beginning and then just create 100000 tables. It would save us a time of 99999*(time to open a new connection and select database), wouldn't it?

        Still does not make sense? :(