ethrbunny has asked for the wisdom of the Perl Monks concerning the following question:
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Threading - getting better use of my MP box
by BrowserUk (Patriarch) on Jul 02, 2007 at 18:57 UTC | |
There are at least two questions here. Basically, you'll need to profile your application to work out where the bottlenecks are. Armed with that information and a little more detail (like does GB of data come from: a single source as a single document; or multiple documents from a single source; or multiple documents from multiple sources.?), then it may be possible to see ways of using threads and/or processes to speed up your processing. Even if this is one large XML document, it may contain repetitive sub-documents that could be easily recognised without recourse to a proper XML parser, or (maybe) using mirod's brilliant and very well supported XML::Twig, and split out from the document as it is being downloaded. You could then pass these on to other threads for further parsing before finally sending to a DB thread for upload to the DB. Bottom line: more info required :) Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
by ethrbunny (Monk) on Jul 06, 2007 at 15:54 UTC | |
-connecting to 60+ clients each night. This number will rise to 500+ over the next year. -from each client we download 10-500 files of XML data. Some of it is encrypted but most is straight-up XML. -each file ranges from 0.1Kb to 50Mb -from each client we get 1-10 different types of data. Each data type goes into 1 or more tables across several databases. -all databases are running on one machine. All parsing is done on one machine. -each file is copied from a remote-client to our MP box. Its then parsed and moved to a backup area. Any IO wait issues will be on the client side. Each is connected to the net via a linksys router into a standard cable modem. The upload speed is throttled by the provider. The (highly simplified) LAN diagram is: clients -> MP box that does the copy and parse -> DB box. The DB box is not 'exposed' to the outside. From what Im reading so far it sounds like I could / should break out the copy process into > 1 thread. I can also thread the DBD section but I shouldn't share the same connection handle. Each thread should create / delete its own DBD connection. Accurate? Are there any 'rules-of-thumb' for the number of concurrent processes I should create? Im assuming (naively) that RH Enterprise will task-swap them appropriately. For an 8-way box maybe I should start with 4 concurrent procs and see how it handles it? | [reply] |
by BrowserUk (Patriarch) on Jul 06, 2007 at 17:43 UTC | |
I can also thread the DBD section but I shouldn't share the same connection handle. Each thread should create / delete its own DBD connection. I saw another poster suggest that it was 'safe to thread DBI provided each thread has it's own handle'. I was reluctant to interceded as that poster may know better than I, but certainly at at least one point in time this was not the case. Some or all DBI drivers (and/or the shared libraries that underlie them) have been shown to not be thread-safe on some or all platforms at some point in the past. The problem is, for at least some drivers/libraries, the process ID is used to key data to the connection, so using multiple handles from a single process causes things to get mixed up. You will have to go to the DBI mailing list for the current skinny on what drivers are thread-safe where--if any. You will also probably be advised that you "should not use DBI with or from threads". This seems to be the defacto-standard position. I will say that you can use DBI from a threaded program provided that you only use DBI from one thread of that program. Basically, if the program uses DBI from one (say the main) thread, then the presence of other threads that do not use DBI should not compromise anything as the DBI code will not even be aware that they are there. In any case, as you are processing large numbers of mostly relatively small files, and running under *nix, there seems to be little or no advantage in using threads over processes for this. In this case I suggest using processes rather than threads. Parallel::ForkManager seems almost ideal for the application. Have the main process load an array of the urls to be fetched and then fork a child to do a simple, linear read->parse->upload-to-DB-and-die. The parent process' role is simply to monitor the child processes and start another when one completes. Tuning the process is simple a case of varying the set_max_procs() at startup. With an 8-way box (assuming nothing else heavy is running concurrently on that box, you should run at least one process per CPU for maximum throughput. As for at least some large percentage of their time, each process will be IO-bound doing the download, then you should be able to achieve greater throughput by having double or even treble the number of processes running as you have CPUs. The CPU-bound processes (those that have completed their downloads and are into the parsing phase) will be able to usefully use the time-slices that the IO-bound processes will relinquish in recv wait states. For best effect, you should decrease (nice) the priority of the processes once they complete their downloads and move into the parsing phase. That will allow the IO-bound processes to respond quickly to the receipt of data, but will have little affect on the cpu-bound processes as they will move back into another wait state almost immediately. A further option for tuning is to arrange the urls, or the picking of urls for newly forked children so that you avoid downloading more than one file from the same host concurrently. That probably means keeping the urls for each host in separate arrays and only using an url for a new host once the previous download from that host completes. That complicates the model somewhat as you would need to use 'nested parallel managers'. The original process would fork a subprocess to process the urls for a given host. That child would download the first file, then fork again. The grand-child process would then parse the data downloaded while its parent starts another download from the same host. Once the parent completes it's download it would wait until the child completes before forking again and repeating the process. The grandparent process would only fork another once that host has been completed. It's hard to describe, but not too difficult to program. Anyway, few ideas for you to mull over. Given the architecture of your setup, I see little benefit in using threads for this. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] [d/l] |
by Joost (Canon) on Jul 06, 2007 at 18:00 UTC | |
by BrowserUk (Patriarch) on Jul 06, 2007 at 19:06 UTC | |
|
Re: Threading - getting better use of my MP box
by samtregar (Abbot) on Jul 02, 2007 at 18:44 UTC | |
As for #2, the answer is complicated. InnoDB supports real transactions with configurable semantics (read-committed is my favorite). It does support row-level locking, but that doesn't tell you ask much as you might think. I recommend some extended quality time with the manual. #3 - yes. Your 8-way boxes need parallell code to perform up to their potential. From your description it sounds like just separating downloading, parsing and DB inserts into three processes would be a big improvement. Beyond that you might consider processing multiple streams of download/parse/insert in parallel. -sam | [reply] |
|
Re: Threading - getting better use of my MP box
by archfool (Monk) on Jul 02, 2007 at 17:20 UTC | |
(Though some DBD implementations like DBD::Oracle DO support multiple threads) According to this, though, you can make a private DBI connection in each thread safely.
(DBI) Threads and Thread SafetyPerl 5.7 and later support a new threading model called iThreads. (The old "5.005 style" threads are not supported by the DBI.) In the iThreads model each thread has it's own copy of the perl interpreter. When a new thread is created the original perl interpreter is 'cloned' to create a new copy for the new thread. If the DBI and drivers are loaded and handles created before the thread is created then it will get a cloned copy of the DBI, the drivers and the handles. However, the internal pointer data within the handles will refer to the DBI and drivers in the original interpreter. Using those handles in the new interpreter thread is not safe, so the DBI detects this and croaks on any method call using handles that don't belong to the current thread (except for DESTROY). Because of this (possibly temporary) restriction, newly created threads must make their own connctions to the database. Handles can't be shared across threads. But BEWARE, some underlying database APIs (the code the DBD driver uses to talk to the database, often supplied by the database vendor) are not thread safe. If it's not thread safe, then allowing more than one thread to enter the code at the same time may cause subtle/serious problems. In some cases allowing more than one thread to enter the code, even if not at the same time, can cause problems. You have been warned. Using DBI with perl threads is not yet recommended for production environments. | [reply] |
|
Re: Threading - getting better use of my MP box
by wind (Priest) on Jul 02, 2007 at 18:50 UTC | |
Most likely the parsing and inputting of data into the database are the quickest operations by far, so you can leave these as single threaded for simplicity sake. - Miller | [reply] |
|
Re: Threading - getting better use of my MP box
by aufflick (Deacon) on Jul 04, 2007 at 05:50 UTC | |
If you're simply reading records and then inserting, you can simply split the file into chunks and process each in a separate forked process (which, on Unix, is probably faster than perl ithreads) each with it's own DB connection. If you're doing intensive processing, perhaps you can split that into 2 (or more) phases - then you can implement each phase in a separate Perl program and connect them with pipes, the last phase of which will have the DB connection. This is still only using one DB connection. I'm not sure if you need to use multiple connections for MySQL to use multiple backend processes/threads but you could possibly run this 'pipeline' multiple times also. Another completely unrelated option you can consider is to do all the processing into a flat file, slurp that over to the db machine and use LOAD DATA LOCAL (or whatever it is - that's from memory) to bulk load the data. That will save you a huge amount of database overhead on that sort of volume of data. | [reply] |
|
Re: Threading - getting better use of my MP box
by weismat (Friar) on Jul 06, 2007 at 04:21 UTC | |
| [reply] |