DreamT has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

Mission: Collect data from various sources and write to xml-file as soon as the data is available using multithreading.

Code:
use XML::Writer; use IO::File; use threads; use strict; my $output = new IO::File(">output.xml"); my $writer = new XML::Writer(OUTPUT => $output, DATA_MODE=>1, DATA_IND +ENT=>1,CHECK_PRINT=>1); $writer->xmlDecl("UTF-8"); $writer->startTag('operations'); my $thr1 = threads->new(\&writeArticleData1); my $thr2 = threads->new(\&writeArticleData2); while (!(($thr1->join) && ($thr2->join))) { # Wait for threads to finish } $writer->endTag("operations"); $writer->end(); $output->close(); sub writeArticleData1() { lock($writer); $writer->startTag('product'); $writer->characters("Hello, world!"); $writer->endTag("product"); sleep 2; return 1; } sub writeArticleData2() { lock($writer); $writer->startTag('product'); $writer->characters("Hello, world2!"); $writer->endTag("product"); return 1; } 1;

Output:
<?xml version="1.0" encoding="UTF-8"?> <operations> <product>Hello, world2!</product> <product>Hello, world!</product></operations>

I'm very close to reach my goal as you can see, but for some reason, the closing tag for operations doesn't get it's own row as supposed. Which makes me questioning my way of doing it. So, questions:

1. Am I using threads in the right way?
2. Am I thinking correct regarding the "lock" usage?
3. Do you see any other pitfalls regarding scoping and such?
4. Can I fix the particular problem?

The resulting document _doesn't_ need to be formatted as such, I'm just questioning my own ways of using threading in this case. Maybe XML::Writer doesn't support it? Or maybe a filehandle issue?

Replies are listed 'Best First'.
Re: Multithreaded xml writing using XML::Writer
by BrowserUk (Patriarch) on May 03, 2010 at 11:25 UTC

    No. No. Yes. Yes, by starting again and doing things differently.

    Sharing an object ($writer), via closure, means that each thread gets its own copy of the object. Now if the object needs to remember any information internally, say about what it has already done, then each copy will not know what the other copy has done, and both will get very confused.

    Also, lock() can only be used upon shared variables. And since you aren't using threads::shared, you should probably be getting warnings about the way you are using it. If you aren't, then it must mean you aren't using warnings either, and that is a very unwise decision if you intend to use threading.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Ok, I will have a look at threads::shared.
      But other than that, is it a good thing to use XML::Writer in a multithread fashion?

        I wonder what you try to gain by using multiple threads with what is essentially an IO-bound operation.

        I wouldn't try to do IO on the same channel with more than one thread. If you want to create XML fragments and output them in parallel, I would create the fragments in separate threads and send them to one output thread via Thread::Queue.

        is it a good thing to use XML::Writer in a multithread fashion?

        The way you are doing it, no.

        Since both threads are writing to the same file, and they cannot safely both be writing at the same time, you would have to serialise them to prevent them intermingling their outputs. And that means there would be nothing gained by threading that paret of the process.

        But, reading the subtext of your question, the gain you are hoping for is not in the writing of the XML output, but in the sourcing of the data written. That is to say, you imply that you are fetching data from two (or more) sources. As you do not show where or how you are sourcing that data, it is impossible to say whether there would be any advantage in using threading for that part of the process.

        If, for example, you are fetching the data from two different servers, there may be some gains to be had by overlapping the fetching of the data, and then feeding the data fetched back to a single thread for writing to a file as XML.

        But you'll need to describe the whole process from end to end to get good wisdom on that.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.