alanraetz has asked for the wisdom of the Perl Monks concerning the following question:

I need to process a very large set of data. Currently it gets dumped from a database, is processed by a single-threaded perl script, and is loaded back into the database (the script does data validation). This currently takes about 3 hours. I was thinking of writing something that splits the file up and divides the task, but I was struck by this line (in http://perldoc.perl.org/perlfork.html) "Any filehandles open at the time of the fork() will be dup()-ed. Thus, the files can be closed independently in the parent and child, but beware that the dup()-ed handles will still share the same seek pointer. Changing the seek position in the parent will change it in the child and vice-versa." So, does this mean I can open this huge file, fork() a bunch of processes that each process individual lines, and thus creating a very terse multi-threaded file processor? I'm thinking there could be race conditions on the output, but if each fork writes it's own output file and these are aggregated after all children are done (and duplicate processing may be irrelevant)... why wouldn't that work???
  • Comment on Will a shared filehandle processed by multiple threads/processes cause problems?

Replies are listed 'Best First'.
Re: Will a shared filehandle processed by multiple threads/processes cause problems?
by Athanasius (Archbishop) on Jul 01, 2014 at 06:42 UTC

    Maybe I’m misunderstanding, but I don’t see why you need to share the filehandle among any of the child processes. Have the parent thread read in the file, split the data into appropriate-sized chunks (or lines, as you say), and feed each chunk to a different child process. This will avoid all the problems arising from shared filehandles, including those detailed by wrog, above, and without losing any of the potential benefits of utilising multiple cores. (Whether those benefits outweigh the additional overhead of creating and managing the child processes is another question, one which you will need to answer by profiling.)

    Hope that helps,

    Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,

      Yes, this was my original thought, to split up the file first and then spawn a bunch of processes for each chunk... thanks for the response. Actually my thought now is to point the script to a test input file that takes about 10 seconds to process and either do some speed profiling or just direct code optimization and see if I can speed it up. There are numerous inefficiencies in the script, so I have some ideas.
Re: Will a shared filehandle processed by multiple threads/processes cause problems?
by wrog (Friar) on Jul 01, 2014 at 06:26 UTC
    Two problems:
    1. buffering,
      i.e., you may think you're only reading one line but underneath the hood you may be slurping up a lot more and there's no real way to control how much or to guarantee that what you're getting will end at a line boundary. The only way out of that box is to use the low-level operations (sysread, sysseek), at which point you're giving up the speed advantages of buffering and probably making many more system calls (sysseek) in order to leave the filepointer exactly at the end of a line
    2. synchronization
      if two processes attempt to read or seek the same handle at the same time, all bets are off as to what will actually be read by either or where the seek pointer will end up. You will have to do some kind of locking to ensure only one process is reading at a time.

    Much depends on how big the individual lines are vs. how much computation needs to be done on each one. If the latter is what's killing you and you have a true multicore processor (i.e., where multiple processes really can run simultaneously) then this approach could indeed win. If, on the other hand, you're mostly I/O-bound, i.e., reading+writing are what's taking up your time, then probably not.

    You may want to consider having one process do all of the reading, giving it outgoing pipes to the other processes, which then all pull lines from the reader on demand (which then involves more games with signals or semaphores), which then gives you benefits of buffering and concurrency. This will be worth it if the lines are generally way smaller than the buffer size.

      Thanks for the response, makes sense. I think any perl code that depends on assumptions about what happens under the hood is bad practice so I will avoid this. But wanted to throw it out there to get feedback, thanks.
Re: Will a shared filehandle processed by multiple threads/processes cause problems?
by CountZero (Bishop) on Jul 01, 2014 at 09:57 UTC
    The data comes out of a database: cannot you have your processing and validating scripts access the database directly?

    With a little thinking, perhaps you can re-write the SQL (*) in such a way that the task can be split up already at the database-level and then you can more easily have multiple scripts, perhaps even multiple PCs, work on it concurrently.

    (*): Say all the data-records have a reference number, or a timestamp, or ... in one of the fields. Then you can have one script access all the even-numbered references, timestamps, ... and the second script handle all the odd-numbered records.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics
Re: Will a shared filehandle processed by multiple threads/processes cause problems?
by roboticus (Chancellor) on Jul 01, 2014 at 12:54 UTC

    alanraetz:

    Many people will grab the tool they know best for the job, whether or not it's the right one. I don't know what your expertise is, nor do I know enough about your problem to know if that applies here. But are you sure that perl is the best tool for the task at hand? If the data validations are simple, you may be better served doing the validations in SQL and letting the database do all the heavy lifting. It may be a good deal faster and (ultimately) simpler.

    I only mention it, because at work it's all too common to see someone do the equivalent of something like:

    $SQL='update tbl set val=? where id=?'; my $ST = $DB->prepare($SQL); $SQL='select id, val from tbl'; $ar = $DB->selectall_arrayref($SQL); for my $r (@$ar) { if ($r->[1] > 100) {$ST->execute(100, $r->[0])} elsif ($r->[1]<0) {$ST->execute(0, $r->[0])} }

    This code has two serious problems: First, it extracts *all* rows from the database, when it should extract only the rows needed. Every row fetched from the database consumes time and bandwidth, so you can save a lot of resources by *not* pulling out stuff you don't need. Second, the validations are too simple to need to do the work in perl. It should let the database do the work, something like:

    $DB->do("update tbl set val=100 where val>100"); $DB->do("update tbl set val=0 where val<0");

    Any time you need to iterate over a large quantity of data in a database, you should consider whether it's a task that should be done entirely in the database.

    Even if the validation is complex enough to dictate some perl code involvement, you may be able to preprocess some of it in SQL to reduce the amount of data you need to read and write to the database.

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.

      or have a stored procedure written in Perl

        wrog:

        That would be pretty neat. I know PostgreSQL will do that, but I haven't actually tried it yet. I don't know of any other databases that will let you do that, though.

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.