jabarin has asked for the wisdom of the Perl Monks concerning the following question:

Sorry, there was a missing tag in the xml file from my previous post. ---------------------


Hello,

I have the following simple script to parse shift-jis XML. However, the script crashes every time I run it with a "Segmentation fault" after the first sleep. That is it crashes when starting the second thread on the second iteration of the while loop. Specifically, the crash occurs just before the safe_parsefile() is called during the second thread. I tried the script both on perl 5.8.0 and 5.8.5. Could it be that the first thread is not cleaning up some file handle? I have no way of telling. Please help! I don't know what's happening.

#!/usr/bin/perl use XML::Twig; use LWP::UserAgent; use Thread; $userAgent = new LWP::UserAgent; $userAgent->timeout($connectTimeout); while (1) { my $thread = threads->new(\&querySystem); $thread->join; sleep 5; } sub querySystem { my $statusCallback = queryStatus(); my $statusTwig = new XML::Twig(twig_roots => {'Result/Status' +=> $statusCallback }, ProtocolEncoding=>"x-sjis-unicode"); my $result = $statusTwig->safe_parsefile("xmlfile.xml"); } sub queryStatus { my $newfunc = sub { my($twig, $element) = @_; my $mmsVersion = $element->first_child('MMSVersion')-> +text; my $partVersion = $element->first_child('ParticipantVe +rsion')->text; print "mmsVersion: $mmsVersion, partVersion: $partVers +ion\n"; }; return $newfunc; }
The XML file looks as follows:
<?xml version="1.0" encoding="shift-jis"?> <!DOCTYPE AdminRequest SYSTEM "AdminAPI.dtd"> <AdminRequest> <Header> <ReturnCode>0</ReturnCode> <ReturnString></ReturnString> </Header> <Results> <Result Id="123455"> <Action>query_status</Action> <ReturnCode>0</ReturnCode> <ReturnString></ReturnString> <Status> <MMSVersion>2.1.74G</MMSVersion> <ParticipantVersion>4.8.1565B</ParticipantVersion> <MMSMajorVersion>5.1 jp</MMSMajorVersion> + </Status> </Result> </Results> </AdminRequest>

Replies are listed 'Best First'.
Re: Script crashes when parsing XML
by jdtoronto (Prior) on Sep 12, 2006 at 15:39 UTC
    XML::Twig is obviously dieing, so wrap this:
    my $statusTwig = new XML::Twig(twig_roots => {'Result/Status' => $stat +usCallback }, ProtocolEncoding=>"x-sjis-unicode");
    in an eval{ } and then see what error $@ you get would be my next step.

    jdtoronto

Re: Script crashes when parsing XML
by gellyfish (Monsignor) on Sep 12, 2006 at 16:01 UTC

    It appears to work fine without the join(). Running the program with Devel::Trace seems to indicate that it is segfaulting at:

    >> /usr/lib/perl5/5.8.5/i586-linux-thread-multi/IO/Handle.pm:431: +read($_[0], $_[1], $_[2], $_[3] || 0);
    although of course that could be messed up by the threading. I'd suggest paring this down to an even smaller example that exhibits the same behaviour (I would imagine that the actual content of the XML and what the callback does have no bearing on the behaviour,) and submit a report via perlbug and also possibly to the author of XML::Twig.

    I do notice that you don't get the coredump if you change the safe_parsefile to:

    my $fh; open $fh,'xmlfile.xml' or die $!; my $result = $statusTwig->safe_parse($fh); close $fh;
    But then it blocks which is probably not what is wanted either.

    /J\

Re: Script crashes when parsing XML
by mirod (Canon) on Sep 13, 2006 at 08:32 UTC

    OK, so as a follow-up, here is the simplest test I found that triggers the segfault:

    #!/usr/bin/perl use strict; use warnings; use threads; use XML::Twig; foreach my $i (1..2) { warn "creating thread $i\n"; my $thread = threads->new(\&create_twig); $thread->join; sleep 1; } sub create_twig { warn " creating twig\n"; my $twig = XML::Twig->new( protocol_encoding=>"x-sjis-unicode") ->safe_parse( '<doc/>'); }

    The problem happens only when I add the protocol_encoding=>"x-sjis-unicode" option.

    At this point it is worth checking whether the problem lies with XML::Twig, or with the underlying module, XML::Parser:

    #!/usr/bin/perl use strict; use warnings; use threads; use XML::Parser; foreach my $i (1..3) { warn "creating thread $i\n"; my $thread = threads->new(\&create_parser); $thread->join; sleep 1; } sub create_parser { warn " creating parser\n"; my $parser = XML::Parser->new( ProtocolEncoding=>"x-sjis-unicode") +; $parser->parse( '<doc/>'); }

    This code crashes too!

    So on one hand, the problem is not in XML::Twig, so I am sort of off the hook ;--). On the other hand, this doesn't help you much :--(

    Is there any way you could pre-process your data to make it UTF-8? That would simplify processing, and make it cleaner. In any case, as it is, having to use the protocol_encoding option is quite shady. You don't even have to keep the "fixed" data around, you can just open a pipe that would change the encoding (iconv is great for that) and the encoding declaration, process it, then convert the data back (if needed) when you output it. Does it make sense?

Re: Script crashes when parsing XML
by zentara (Cardinal) on Sep 12, 2006 at 16:28 UTC
    I havn't run the code, but it looks to me that your thread code

    my $statusCallback = queryStatus();

    is trying to use a sub located in main. I'm not surprised that it segfaults.

    Is there any way you can write this so the XML code is all contained within the thread's code block?

    UPDATE

    Just for fun, I took out the while(1){} loop, and it does not segfault. I think that is where your problem may be. When that second thread in the while(1) loop gets created, it segfaults. Maybe you need to undefine your objects in the thread, reuse the thread (instead of making a new one), or figure out why XML::Twig isn't thread safe.

    It will segfault on this too, as soon as the second thread is launched:

    #while (1) #{ my $thread = threads->new(\&querySystem); $thread->join; sleep 5; #} print "Starting second thread\n"; $thread = threads->new(\&querySystem); $thread->join; sleep 5;

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: Script crashes when parsing XML
by zentara (Cardinal) on Sep 12, 2006 at 17:10 UTC
    Here is a way to make it work, which only creates a single XML::Twig object in the thread, then reuses it. (Also its 'use threads', not 'use Thread'. I made it as simple as I could for clarity. Threads need to go to the end of their code block in order to be joined, but a return will work as well. Anyways, this creates 1 thread, and 1 XML::Twig object, then reuses it.
    #!/usr/bin/perl use warnings; use strict; use threads; use threads::shared; my $tgo; my $tdie; my $tfile; share $tgo; share $tdie; share $tfile; $tgo = 0; $tdie = 0; $tfile = ''; my $thread = threads->new(\&querySystem); foreach my $file('xmlfile.xml', 'xmlfile.xml' ){ $tfile = $file; $tgo = 1; sleep 5; } $tdie = 1; $thread->join; print "done press any key to exit\n"; <>; ################################################### sub querySystem { use XML::Twig; $|++; my $statusTwig = new XML::Twig( twig_roots => {'Result/Status' => sub{ my($twig, $element) = @_; my $mmsVersion = $element->first_child('MMS +Version')->text; my $partVersion = $element->first_child('Pa +rticipantVersion')->text; print "mmsVersion: $mmsVersion, partVersion +: $partVersion\n"; } }, ProtocolEncoding=>"x-sjis-unicode"); while(1){ if($tdie == 1){ goto END }; if ( $tgo == 1 ){ my $result = $statusTwig->safe_parsefile($tfile); $tgo = 0; #turn off self before returning }else { sleep 1 } } END: } __END__

    I'm not really a human, but I play one on earth. Cogito ergo sum a bum
Re: Script crashes when parsing XML
by mirod (Canon) on Sep 13, 2006 at 08:00 UTC

    From the Thread docs:

    For new code the use of the Thread module is discouraged and the direct use of the threads and threads::shared modules is encouraged instead. Finally, note that there are many known serious problems with the 5005threads, one of the least of which is that regular expression match variables like $1 are not threadsafe, that is, they easily get corrupted by competing threads. Other problems include more insidious data corruption and mysterious crashes. You are seriously urged to use ithreads instead.

    So I'd suggest that for a start you switched to threads and see what happens. Then, in order to eliminate some possible exotic reasons for the bug, try with pure ascii XML, then without handlers... in short try simplifying your code until you find the simplest version that triggers the problem. Then you can send me the code (if it comes as a self-contained test-case, that's even better). Note that I have never worked with threads, so don't expect too much, but at least I'll look into it.

    Thanks