in reply to Re^5: Multithreaded xml writing using XML::Writer
in thread Multithreaded xml writing using XML::Writer

There are 3 large chunks of data.
  • Comment on Re^6: Multithreaded xml writing using XML::Writer

Replies are listed 'Best First'.
Re^7: Multithreaded xml writing using XML::Writer
by BrowserUk (Patriarch) on May 03, 2010 at 15:38 UTC

    Then the first thing you need to determine, is whether performing your 3 queries in parallel actually improves the time taken to complete the overall task. And that is something that can only be determined empirically. Ie. Test it.

    In theory, with the queries operating upon different databases, there should be no need for any internal locking or synchronisation. (Between these three queries--other client queries is a different matter.) So, if the server is multi-cored, then it is possible that you might save some time by overlapping them.

    There is also the question of whether all the elements of the chain of software between you and the DB server are thread safe--DBI, DBD::MySQL; the C-libraries used by the DBD. Etc. At one time the answer was definitively "no". Things have moved on--certainly there have been some moves to make the Perl parts thread-safe--but I don't know what the current state of play is for MySQL.

    But, assuming that you can demonstrate (to yourself), that there is something to be gained from running the queries in parallel, then comes the question of how to safely and efficiently combine those results into a single file using XML::Writer.

    The problems here are:

    1. Sharing objects between threads--in any language--does not work well. And doing it via closure (which is essentially copying), does not work at all.

      Even if you used the threads::shared version of bless to create a shared object, the class has to be written specifically to anticipate and cater for sharing, and XML::Writer, in common with most CPAN modules, is not.

      There is also a persuasive argument that says that shared objects do not work well in any language, even when specially constructed for the purpose. And that goes doubly so for objects that need to serialise access to a common resource--like a file.

    2. Wholesale copying-transfers of large data objects between threads (via queues, or directly through join), is expensive.

      So, assuming that you can successfully achieve gains by threading your queries, the question becomes how can you serialise the processing of the returns by a single XML::Writer object efficiently. And the answer to that will depend upon the nature of the data in the results set.

      By which I mean that bulk data queries to DBI are usually returned as arrays of arrays, or arrays of hashes. And sharing nested data structures in non trivial and involves a lot of copying. Inevitably, the bigger the data structures, the more costly that becomes, and as you're considering threading, one assumes your's are pretty big.

      Two methods for dealing with this present themselves:

      1. Pass the data from the threads, via shared data structures or queues, back to the main thread for XMLising by a single XML::Writer object.

        Sharing structured data involves copying and is therefore costly, thereby potentially negating any gains through parallelising your queries--assuming there are any.

      2. Pass a cloned or (externally) shared XML::Writer object to your threads and (externally) serialise use of it through locking.

        Is external locking of the cloned object sufficient to ensure safety--your original problem perhaps suggest not.

    The upshot of all the above 'though experiments' is that you first need to test whether parallelising the queries buys you time.

    And if it buys you enough to consider the additional complexity of threads, then you need to answer my earlier question about the ordering of data.

    And then, assuming you're still considering this, explain the nature of the data returned and how it will be XMLised.

    One final thought is that both mysql & mysqdump command line tools have --xml options, and they usually work much more quickly than perl scripts. It might be both simpler and quicker to use them to produce separate XML files (in parallel), and then combine the files by stripping redundant duplicate headers and top-level tags.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Ok, thank you for a very detailed and good answer!
      To answered my unanswered question first: No, the ordering of data is irrelevant, my thought is that whoever of the subroutines that has data ready should write it immediately, unless another one of them currently writes data. I.e., a queue.

      I will benchmark the time it takes to execute the queries in parallell/serial "fashion" to determine if it buys me time.

      But for now, I'd be glad to learn how to do this, in case it actually buys me time:-)

      I can't go in too much on detail how the fetched data looks. What I can say is that the data contains
      1. Product information, such as article number, article description etc, a mix of ID's and text fields, around 10 of each. The number of rows can vary from 10 to 15000, where the last one is more common.

      2. A website database. his one holds websites, and can generate c:a 1000 rows.
      3. Another website database, with approximately the same specs as number 2.

      It feels like "Pass a cloned or (externally) shared XML::Writer object to your threads and (externally) serialise use of it through locking." would be an option. Can you describe how you would do it?
        It feels like "Pass a cloned or (externally) shared XML::Writer object to your threads and (externally) serialise use of it through locking." would be an option. Can you describe how you would do it?

        Still not enough info. I don't need to see the details of your content,nor even the real field names. But some idea of the structure. For example, would something like this be acceptable as output:

        <?xml version="1.0" encoding="UTF-8"?> <operations> <sites-list2>website 1</sites-list1> <product>product 1</product> <sites-list1>website 1</sites-list1> <product>product 2</product> <sites-list1>website 2</sites-list1> </operations>

        Or this (or some other essentially random ordering of it?):

        <?xml version="1.0" encoding="UTF-8"?> <operations> <website-1-1>url</website-1-1> <product-2> <sites-list1> </product-1> <detail-2>stuff</detail-2> <website-2-1>url</website-2-1> <detail-1>stuff</detail-1> <detail-1>stuff</detail-1> <sites-list2> <website-2-2>url</website-2-2> </sites-list1> <products> <product-1> </product-2> </product> <website-1-2>url</website-1-2> </sites-list1> <detail-2>stuff</detail-2> </operations>

        That's shuffled rather than interleaved, but it makes my point. Depending upon the structure of the data from different sources, the locking requirements vary. If each of the data sources only produced a single line--one tag (pair) and lots of attributes--and it didn't matter how the lines from the 3 sources where interleaved, then you would use a different solution than if the data sources produce deeply nested structures of tags for each row.

        If, as I suspect, you want all the rows from each data source grouped together (possibly under a top-level tag for that data sources), and your " No, the ordering of data is irrelevant" means that you don't care which order the 3 blocks are in the file, you'd be better creating the output from each data source separately (perhaps in ram files using three separate instances of XML::Writer), and then combining them into an output file as 3 large chunks.

        But if that is the case, I would seriously consider using the -xml flag on the mysql or mysqldump commands.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.