Multithreading Parsers

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Multithreading Parsers by kvale (Monsignor) on Mar 22, 2005 at 00:53 UTC
Sounds like you are doing bioinformatics stuff. Usually the long computation time in these sort of problems is due to searching large numbers of strings. In this case, the problem is trivially parallelizable. Partition the strings into N sets and create N processes on N processors, each searching one set. If you don't run into I/O bottlenecks at that point, you may even consider more than one process per CPU. -Mark	[reply]
Re: Multithreading Parsers by Tanktalus (Canon) on Mar 22, 2005 at 01:01 UTC
Personally, I would look at splitting up the work a bit differently. Look at what you're doing as processing, and do that in subprocesses. On Windows, I've avoided forking, largely because I thought it didn't work (I've been corrected by some other monks), but also because forking on Windows is via emulation which is quite a bit slower than on unix/linux. Since you're running single-threaded, you must also be not using fork. However, you're now on unix, and looking to abuse the CPUs being thrown at you. So, see if you can find 29+ items that you can process without knowing about each other. For example, if you have 100 files, and you don't need to know about stuff from file 1 when reading file 40, just process each file in a separate process (probably using Parallel::ForkManager), wait for them to finish, and then you can pull together the results. So you may generate some intermediary files that get pulled together by your parent process, but that should still be mounds faster. You can also take advantage of faster formats than XML for these intermediary files (say, using Storable). I avoid this module, too, but that's because it isn't compatable from version to version. In your case, it could be perfect in that you're using the same perl virtual machine, thus physically the same Storable module for both writing and reading. Much faster than generating XML and reinterpreting it. Once you go down this road, you can investigate actual threads - threads would largely reduce the need for intermediary files, but this is not likely a huge sticking point quite yet for your performance. I'm not sure I'm even on the right track for how you can speed this up. But I'm not sure you're even fully aware of where you can split up tasks, and what tasks take a long time. Likely, the actual parsing of the XML file(s) isn't what takes time, it's working with the data in memory that takes a long time. Thus, you could even parse the XML file in your parent process, then fork out 15-30 subprocesses to work on different trees in your XML data. Again, with fork, you may put them in intermediary files, or you may investigate many of the IPC modules to communicate back with the parent process, perhaps via shared memory, or maybe you bypass it by simply having the child print to a pipe, and having the parent read from all the children's pipes (using IO::Select and IPC::Pipe). Let us know if there's more details you can divulge to help with ;-)	[reply]
Re: Multithreading Parsers by BrowserUk (Patriarch) on Mar 22, 2005 at 01:16 UTC
Is it possible or practical to multi-thread these? Possible: yes. Practical: maybe. Easy: no. I'm assuming that the reason for the time taken to parse are the volumes of data being processed. If the bioperl data is something like FASTA format--ie. flat format, single records (even though spread across multiple lines)--then the problem is reasonable to tackle. A single thread of execution (TOE) reads the file and feeds a bunch of your object-stream parsers running in seperate TOEs. The results are forwarded to a final TOE that gathers them, optionally reordering & collating them and writes them to the output file(s). The TOEs can be either processes communicating through sockets or threads using either sockets or queues. For the XML, life is more awkward due to the hierarchical nature of XML. However, within the outer level of most xml documents, there are usually many repetitions of a smaller, self contained subtrees. It should be possible to hand these of to separate TOEs running whichever xml parser you favour and have them processed as discrete documents. Going beyond that vague description requires more information on the scale and nature of the problem. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l]
Re: Multithreading Parsers by djantzen (Priest) on Mar 22, 2005 at 01:06 UTC
A streaming parser doesn't seem like a very threadable kind of thing insofar as you're talking about walking a document tree in parallel. I think you'd end up doing a great deal of work trying to rig something where, for example, you'd have dedicated threads for processing particular node types. I think I'd be more inclined to try to break the raw input into separate chunks that could run in parallel processes, something your 28 processor machine could do very well. So, maybe rather than reengineering your existing code to use threads, you could write a frontend script to divide whole units of work and then fork off processes. Hard to say much more without concrete examples. "The dead do not recognize context" -- Kai, Lexx	[reply]
Re: Multithreading Parsers by perlfan (Parson) on Mar 22, 2005 at 01:49 UTC
First things first (as napolean dynamite) - luckeee.... Ok, now the beef. I would not look at "threading" anything. You have 28 processors, so if you can split things up, utilize AIX's built in ability to distribute these processes among the 28 cpus. In a nutshell, split the data up into 28 parts, run the scripts 28 times on the 28 chunks of data, and join the results back. If you are considering "parallelizing" whatever it is you are doing, then I highly recommend you start over and use MPI. If it must be in perl, look at parallel mpi interface for Perl. If you describe what you are doing (and want to write a parallel app), I am sure some of us can suggest a domain/task decomposition scheme would help you out. BTW, bioinformatics screams, "parallelize me!!" :)	[reply]
Re: Multithreading Parsers by Robertn (Hermit) on Mar 22, 2005 at 08:32 UTC
You might wish to just move everything first, and see how long it takes, before you jump into re-engineering. I moved a process that used to take 45 to 55 minutes to the new box and it only took 14. At first I thought something must be broken as it completed so quickly, and it took a few runs before we were convinced everything was working properly. Users were so happy no further action was necessary.	[reply]
Re: Multithreading Parsers by inq123 (Sexton) on Mar 22, 2005 at 08:56 UTC
I totally agree with the suggestion to parallelize the input/output, not the algorithm. For example, if your parser is parsing XML files in GAME format, then simply in your program glob all files needed to be parsed, split that array into 28 smaller arrays, then fork 27 children working on each of them. I do similar things all the time. Most of those times the output is also independent of each other, the only thing to worry about is competition for things like DB connections/transactions (but changing settings would help). Or, say, you're parsing the whole Genbank release - even easier, just run 28 of your programs on 28 files! Very simple divide and conquer strategy :) Trying to parallelize the algorithm, especially since you're using modules others developed, would not be a good investment of time when you could simply parallelize the input and achieve the same savings.	[reply]
Re: Multithreading Parsers by crenz (Priest) on Mar 22, 2005 at 12:12 UTC
Don't read this. Read BrowserUK's reply. I'm wrong... but I learned something! Just as an aside, since it hasn't been mentioned explicitly: I have no experience with AIX, but my guess is that it can only shift processes to different processors, but not threads, as a process can't run (efficiently) on two different processors. So if you use the divide & conquer strategies mentioned, make sure you `fork()`.	[reply] [d/l]
Re^2: Multithreading Parsers by BrowserUk (Patriarch) on Mar 22, 2005 at 13:27 UTC
but my guess is that it can only shift processes to different processors, but not threads, as a process can't run (efficiently) on two different processors. Why do you say that? To quote from this AIX Programming Guide: On a uniprocessor system, threads execute one after another in a time-sliced manner. This contrasts with a multiprocessor system, where several threads execute at the same time, one on each available processor. Overall performance is improved by running different process threads on different processors. However, an individual program cannot take advantage of multiprocessing, unless it has multiple threads. Indeed, I am not aware of any flavour of *nix that supports POSIX threads that does not allow threads to run on all available cpu's? Nor am I aware of any efficiency problems inherent in doing so. Is this just FUD or do you know something I don't? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply]
Re^2: Multithreading Parsers by perlfan (Parson) on Mar 22, 2005 at 15:26 UTC
That is why I say forget about threads here :). Additionally, if you want to figure out if you'll get any benefit out of the whole "divide and conquer" approach, check out the master's theorem.	[reply]
Re^3: Multithreading Parsers by BrowserUk (Patriarch) on Mar 22, 2005 at 16:02 UTC
That is why I say forget about threads here :). Because you haven't got a clue what your talking about? Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. Rule 1 has a caveat! -- Who broke the cabal?	[reply] [d/l]