in reply to Re^3: Multithreaded xml writing using XML::Writer
in thread Multithreaded xml writing using XML::Writer

The source data comes from 3 mysql-databases, residing on the same server. There are quite heavy load on them already, which means that there might be quite slow access to them. And, theres quite much data that is to be exported. So, to gain some time I want to fetch data from them in parallell, using multiple database-handles (Possibly, I will end up with the same end time since the data resides on the same server, what do you think?)

My thought here was to dump the data to file as quick as possible so that the server could keep as little data in memory as possible (of course there can be a possibility that my current "way" of using xml::writer isn't memory effective, who knows). So yes, I want to fetch the data in parallell and write it to the output file.
  • Comment on Re^4: Multithreaded xml writing using XML::Writer

Replies are listed 'Best First'.
Re^5: Multithreaded xml writing using XML::Writer
by BrowserUk (Patriarch) on May 03, 2010 at 12:22 UTC

    Are you fetching 3 large chunks of data, or lots of small chunks from each of the servers?

    And does the ordering of the data in the xml file matter?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      There are 3 large chunks of data.

        Then the first thing you need to determine, is whether performing your 3 queries in parallel actually improves the time taken to complete the overall task. And that is something that can only be determined empirically. Ie. Test it.

        In theory, with the queries operating upon different databases, there should be no need for any internal locking or synchronisation. (Between these three queries--other client queries is a different matter.) So, if the server is multi-cored, then it is possible that you might save some time by overlapping them.

        There is also the question of whether all the elements of the chain of software between you and the DB server are thread safe--DBI, DBD::MySQL; the C-libraries used by the DBD. Etc. At one time the answer was definitively "no". Things have moved on--certainly there have been some moves to make the Perl parts thread-safe--but I don't know what the current state of play is for MySQL.

        But, assuming that you can demonstrate (to yourself), that there is something to be gained from running the queries in parallel, then comes the question of how to safely and efficiently combine those results into a single file using XML::Writer.

        The problems here are:

        1. Sharing objects between threads--in any language--does not work well. And doing it via closure (which is essentially copying), does not work at all.

          Even if you used the threads::shared version of bless to create a shared object, the class has to be written specifically to anticipate and cater for sharing, and XML::Writer, in common with most CPAN modules, is not.

          There is also a persuasive argument that says that shared objects do not work well in any language, even when specially constructed for the purpose. And that goes doubly so for objects that need to serialise access to a common resource--like a file.

        2. Wholesale copying-transfers of large data objects between threads (via queues, or directly through join), is expensive.

          So, assuming that you can successfully achieve gains by threading your queries, the question becomes how can you serialise the processing of the returns by a single XML::Writer object efficiently. And the answer to that will depend upon the nature of the data in the results set.

          By which I mean that bulk data queries to DBI are usually returned as arrays of arrays, or arrays of hashes. And sharing nested data structures in non trivial and involves a lot of copying. Inevitably, the bigger the data structures, the more costly that becomes, and as you're considering threading, one assumes your's are pretty big.

          Two methods for dealing with this present themselves:

          1. Pass the data from the threads, via shared data structures or queues, back to the main thread for XMLising by a single XML::Writer object.

            Sharing structured data involves copying and is therefore costly, thereby potentially negating any gains through parallelising your queries--assuming there are any.

          2. Pass a cloned or (externally) shared XML::Writer object to your threads and (externally) serialise use of it through locking.

            Is external locking of the cloned object sufficient to ensure safety--your original problem perhaps suggest not.

        The upshot of all the above 'though experiments' is that you first need to test whether parallelising the queries buys you time.

        And if it buys you enough to consider the additional complexity of threads, then you need to answer my earlier question about the ordering of data.

        And then, assuming you're still considering this, explain the nature of the data returned and how it will be XMLised.

        One final thought is that both mysql & mysqdump command line tools have --xml options, and they usually work much more quickly than perl scripts. It might be both simpler and quicker to use them to produce separate XML files (in parallel), and then combine the files by stripping redundant duplicate headers and top-level tags.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.