ethrbunny has asked for the wisdom of the Perl Monks concerning the following question:

I working on a study where we bring in nearly a Gb of XML data each day, parse it and put it into MySQL for various academics to pore over. It takes 4-6 hours to get all the bits downloaded from the places its collected at and parsed. This level will rise 3-5 fold over the next year.

I recently took over / inherited the code for this and found that its running only a single thread - sequentially read, parse, repeat. I have two boxes for this job - the db machine and the read / parse machine. Both are 8-way with 8Gb of RAM.

My questions are thus:
1) how well do DBI and DBD react to being threaded?
2) does DBI use row locking or table locking when 'AutoCommit = 0' is used? (the tables in question are InnoDB)
3) would threading this process gain me anything or should I be happy with its current (functional) condition?

Suggestions are appreciated.
  • Comment on Threading - getting better use of my MP box

Replies are listed 'Best First'.
Re: Threading - getting better use of my MP box
by BrowserUk (Patriarch) on Jul 02, 2007 at 18:57 UTC

    There are at least two questions here.

    1. Will the DB make effective use of its multiple processor box.

      You'll need to consult the docs for your version of MySQL.

    2. Can you multi-thread the 'parse XML and feed the DB' end.

      As archfool and the documentation points out, it's not safe to use DBI from multiple threads for most flavours of DBI driver. However, it is safe to use DBI from within a multi-threaded Perl application, provided that you only interact with the DB from a single thread.

      So you could have (for example) several threads running fetching the XML data from the various sources and parsing it and then passing the processed data to a single DB thread for insertion. Whether that is an effective strategy will depend upon where the bottlenecks occur in your existing code.

      • If most of your runtime is spent in IO waits downloading, overlapping the downloads on different threads may be effective if you are downloading from multiple sources.
      • If a large proportion of the time is spent cpu-bound, parsing the XML, then parsing different documents on different threads should spread the cpu-bound parts of the processing across the multiple cpus. I say should because apparently some threaded OSs do not do this by default.
      • But if the bottleneck is uploading the data to the DB, making the above parts of the processing more efficient and then funnelling the data through a single thread for DB interaction could simply exacerbate the problem. In this case--multiple documents from multiple sources, there woudl be little advantage in using threads. Better to use multiple processes each operating on document(s) from different sources.
      • If your datasource is a single download of a single huge XML document that has to be downloaded as a single entity and parsed as a single entity it might be a bit harder to effectively use multiptocessing on that document. See below.

    Basically, you'll need to profile your application to work out where the bottlenecks are. Armed with that information and a little more detail (like does GB of data come from: a single source as a single document; or multiple documents from a single source; or multiple documents from multiple sources.?), then it may be possible to see ways of using threads and/or processes to speed up your processing.

    Even if this is one large XML document, it may contain repetitive sub-documents that could be easily recognised without recourse to a proper XML parser, or (maybe) using mirod's brilliant and very well supported XML::Twig, and split out from the document as it is being downloaded. You could then pass these on to other threads for further parsing before finally sending to a DB thread for upload to the DB.

    Bottom line: more info required :)


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Some more info:
      -connecting to 60+ clients each night. This number will rise to 500+ over the next year.
      -from each client we download 10-500 files of XML data. Some of it is encrypted but most is straight-up XML.
      -each file ranges from 0.1Kb to 50Mb
      -from each client we get 1-10 different types of data. Each data type goes into 1 or more tables across several databases.
      -all databases are running on one machine. All parsing is done on one machine.
      -each file is copied from a remote-client to our MP box. Its then parsed and moved to a backup area.


      Any IO wait issues will be on the client side. Each is connected to the net via a linksys router into a standard cable modem. The upload speed is throttled by the provider.

      The (highly simplified) LAN diagram is:
      clients -> MP box that does the copy and parse -> DB box. The DB box is not 'exposed' to the outside.

      From what Im reading so far it sounds like I could / should break out the copy process into > 1 thread. I can also thread the DBD section but I shouldn't share the same connection handle. Each thread should create / delete its own DBD connection.

      Accurate?

      Are there any 'rules-of-thumb' for the number of concurrent processes I should create? Im assuming (naively) that RH Enterprise will task-swap them appropriately. For an 8-way box maybe I should start with 4 concurrent procs and see how it handles it?
        I can also thread the DBD section but I shouldn't share the same connection handle. Each thread should create / delete its own DBD connection.

        I saw another poster suggest that it was 'safe to thread DBI provided each thread has it's own handle'. I was reluctant to interceded as that poster may know better than I, but certainly at at least one point in time this was not the case.

        Some or all DBI drivers (and/or the shared libraries that underlie them) have been shown to not be thread-safe on some or all platforms at some point in the past. The problem is, for at least some drivers/libraries, the process ID is used to key data to the connection, so using multiple handles from a single process causes things to get mixed up.

        You will have to go to the DBI mailing list for the current skinny on what drivers are thread-safe where--if any. You will also probably be advised that you "should not use DBI with or from threads". This seems to be the defacto-standard position.

        I will say that you can use DBI from a threaded program provided that you only use DBI from one thread of that program. Basically, if the program uses DBI from one (say the main) thread, then the presence of other threads that do not use DBI should not compromise anything as the DBI code will not even be aware that they are there.

        In any case, as you are processing large numbers of mostly relatively small files, and running under *nix, there seems to be little or no advantage in using threads over processes for this. In this case I suggest using processes rather than threads.

        Parallel::ForkManager seems almost ideal for the application. Have the main process load an array of the urls to be fetched and then fork a child to do a simple, linear read->parse->upload-to-DB-and-die. The parent process' role is simply to monitor the child processes and start another when one completes.

        Tuning the process is simple a case of varying the set_max_procs() at startup. With an 8-way box (assuming nothing else heavy is running concurrently on that box, you should run at least one process per CPU for maximum throughput. As for at least some large percentage of their time, each process will be IO-bound doing the download, then you should be able to achieve greater throughput by having double or even treble the number of processes running as you have CPUs.

        The CPU-bound processes (those that have completed their downloads and are into the parsing phase) will be able to usefully use the time-slices that the IO-bound processes will relinquish in recv wait states. For best effect, you should decrease (nice) the priority of the processes once they complete their downloads and move into the parsing phase. That will allow the IO-bound processes to respond quickly to the receipt of data, but will have little affect on the cpu-bound processes as they will move back into another wait state almost immediately.

        A further option for tuning is to arrange the urls, or the picking of urls for newly forked children so that you avoid downloading more than one file from the same host concurrently. That probably means keeping the urls for each host in separate arrays and only using an url for a new host once the previous download from that host completes. That complicates the model somewhat as you would need to use 'nested parallel managers'.

        The original process would fork a subprocess to process the urls for a given host. That child would download the first file, then fork again. The grand-child process would then parse the data downloaded while its parent starts another download from the same host. Once the parent completes it's download it would wait until the child completes before forking again and repeating the process. The grandparent process would only fork another once that host has been completed. It's hard to describe, but not too difficult to program.

        Anyway, few ideas for you to mull over. Given the architecture of your setup, I see little benefit in using threads for this.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Threading - getting better use of my MP box
by samtregar (Abbot) on Jul 02, 2007 at 18:44 UTC
    I recommend you avoid threading in Perl - it's really not a very good implementation. However, that doesn't mean you have to avoid multi-processing! Check out Parallel::ForkManager, a module that makes forking off sub-processes and collecting results reasonably pain-free. Just make sure you make new DBI connections in the child processes since multiple processes can't share a connection. Also, you'll have to set InactiveDestroy on any parent handles in the child so they don't get closed when the child exits.

    As for #2, the answer is complicated. InnoDB supports real transactions with configurable semantics (read-committed is my favorite). It does support row-level locking, but that doesn't tell you ask much as you might think. I recommend some extended quality time with the manual.

    #3 - yes. Your 8-way boxes need parallell code to perform up to their potential. From your description it sounds like just separating downloading, parsing and DB inserts into three processes would be a big improvement. Beyond that you might consider processing multiple streams of download/parse/insert in parallel.

    -sam

Re: Threading - getting better use of my MP box
by archfool (Monk) on Jul 02, 2007 at 17:20 UTC
    From the DBI.pm perldoc:
    (Though some DBD implementations like DBD::Oracle DO support multiple threads)
    According to this, though, you can make a private DBI connection in each thread safely.

    (DBI) Threads and Thread Safety

    Perl 5.7 and later support a new threading model called iThreads. (The old "5.005 style" threads are not supported by the DBI.)

    In the iThreads model each thread has it's own copy of the perl interpreter. When a new thread is created the original perl interpreter is 'cloned' to create a new copy for the new thread.

    If the DBI and drivers are loaded and handles created before the thread is created then it will get a cloned copy of the DBI, the drivers and the handles.

    However, the internal pointer data within the handles will refer to the DBI and drivers in the original interpreter. Using those handles in the new interpreter thread is not safe, so the DBI detects this and croaks on any method call using handles that don't belong to the current thread (except for DESTROY).

    Because of this (possibly temporary) restriction, newly created threads must make their own connctions to the database. Handles can't be shared across threads.

    But BEWARE, some underlying database APIs (the code the DBD driver uses to talk to the database, often supplied by the database vendor) are not thread safe. If it's not thread safe, then allowing more than one thread to enter the code at the same time may cause subtle/serious problems. In some cases allowing more than one thread to enter the code, even if not at the same time, can cause problems. You have been warned.

    Using DBI with perl threads is not yet recommended for production environments.

Re: Threading - getting better use of my MP box
by wind (Priest) on Jul 02, 2007 at 18:50 UTC
    Where exactly is your bottleneck? Add debugging information that keeps track of the time used in each of the three primary functions of this program: downloading of data, parsing of data, inputting of data into the database. I presume that the most likely place where your program is using the most time would be the downloading of information. Therefore look into creating a unique thread for downloading from each source of data.

    Most likely the parsing and inputting of data into the database are the quickest operations by far, so you can leave these as single threaded for simplicity sake.

    - Miller
Re: Threading - getting better use of my MP box
by aufflick (Deacon) on Jul 04, 2007 at 05:50 UTC
    As wind says, you want to know where your bottleneck is, and it also depends on what you're doing between parsing and insert.

    If you're simply reading records and then inserting, you can simply split the file into chunks and process each in a separate forked process (which, on Unix, is probably faster than perl ithreads) each with it's own DB connection.

    If you're doing intensive processing, perhaps you can split that into 2 (or more) phases - then you can implement each phase in a separate Perl program and connect them with pipes, the last phase of which will have the DB connection. This is still only using one DB connection. I'm not sure if you need to use multiple connections for MySQL to use multiple backend processes/threads but you could possibly run this 'pipeline' multiple times also.

    Another completely unrelated option you can consider is to do all the processing into a flat file, slurp that over to the db machine and use LOAD DATA LOCAL (or whatever it is - that's from memory) to bulk load the data. That will save you a huge amount of database overhead on that sort of volume of data.

Re: Threading - getting better use of my MP box
by weismat (Friar) on Jul 06, 2007 at 04:21 UTC
    Please check if DBI is really the bottle neck and if yes - then I would really look more into the MySQL configuration. You may check DBI::Profile to profile the DBI or insert test data to check the time spent on the data inserts. Otherwise I would suggest to use Devel::Profile. My guess is that there are a lot more performance problems related to the parsing. Something which might have a big performance impact is if you have lots of functions which return lists/hashs. It is better to return references instead of lists/hashes. From my experience the Perl threading implementation is better than most people think - but it is cumbersome, because the debugger can not be used well.