Re: Multithreading Parsers

Personally, I would look at splitting up the work a bit differently. Look at what you're doing as processing, and do that in subprocesses. On Windows, I've avoided forking, largely because I thought it didn't work (I've been corrected by some other monks), but also because forking on Windows is via emulation which is quite a bit slower than on unix/linux. Since you're running single-threaded, you must also be not using fork.

However, you're now on unix, and looking to abuse the CPUs being thrown at you. So, see if you can find 29+ items that you can process without knowing about each other. For example, if you have 100 files, and you don't need to know about stuff from file 1 when reading file 40, just process each file in a separate process (probably using Parallel::ForkManager), wait for them to finish, and then you can pull together the results. So you may generate some intermediary files that get pulled together by your parent process, but that should still be mounds faster. You can also take advantage of faster formats than XML for these intermediary files (say, using Storable). I avoid this module, too, but that's because it isn't compatable from version to version. In your case, it could be perfect in that you're using the same perl virtual machine, thus physically the same Storable module for both writing and reading. Much faster than generating XML and reinterpreting it.

Once you go down this road, you can investigate actual threads - threads would largely reduce the need for intermediary files, but this is not likely a huge sticking point quite yet for your performance.

I'm not sure I'm even on the right track for how you can speed this up. But I'm not sure you're even fully aware of where you can split up tasks, and what tasks take a long time. Likely, the actual parsing of the XML file(s) isn't what takes time, it's working with the data in memory that takes a long time. Thus, you could even parse the XML file in your parent process, then fork out 15-30 subprocesses to work on different trees in your XML data. Again, with fork, you may put them in intermediary files, or you may investigate many of the IPC modules to communicate back with the parent process, perhaps via shared memory, or maybe you bypass it by simply having the child print to a pipe, and having the parent read from all the children's pipes (using IO::Select and IPC::Pipe).

Let us know if there's more details you can divulge to help with ;-)

Comment on Re: Multithreading Parsers