jabarin has asked for the wisdom of the Perl Monks concerning the following question:

Hello, Despite extensive research on the internet into previous work that has been done on this topic I could not find any useful bit of information. So I hope someone here will be able to help me out.

I am trying to parse and XML file that is shift-jis encoded using XML::Twig. From what I gather, XML::Twig can parse shift-jis assuming the XML is well formed. However, I haven't been able to get it to do that yet. So far I've done the following: in my XML::Parser directory under Parser/Encoding I set my shift-jis encoding file to be x-sjis-unicode.enc (renamed the file to shift-jis.enc).

When I try to parse my XMl file I get the following error:

unknown encoding at line 1, column 30, byte 30 at /usr/lib/perl5/vendor_perl/5.8.0/i386-linux-thread-multi/XML/Parser.pm line 185

The first few lines of my XML file look as follows (note the <RetrunString> tag should contain shift-jis characters):

<?xml version="1.0" encoding="shift-jis" ?> <!DOCTYPE AdminRequest (View Source for full doctype...)> <AdminReques +t> <Header> <ReturnCode>0</ReturnCode> <ReturnString></ReturnString> </Header>


Can you please let me know what I am doing wrong or how I can go about parsing shift-jis? If you can point me to examples that'd be appreciated. Thanks.

Replies are listed 'Best First'.
Re: Parsing shift-jis XML with XML::TWIG
by shmem (Chancellor) on Aug 18, 2006 at 22:19 UTC
    So far I've done the following: in my XML::Parser directory under Parser/Encoding I set my shift-jis encoding file to be x-sjis-unicode.enc (renamed the file to shift-jis.enc)
    The encoding files contain the encoding name in the file header. If you just rename x-sjis-unicode.enc, it's header still reads x-sjis-unicode. Change the encoding atribute value in your xml file to x-sjis-unicode and try again.

    You could also change the header of the copied file. I don't have advice for that since I don't know the format of the header, a simple s/$old/$new/ for the header section (padding with null bytes) doesn't do the job. Doing just that, I get

    syntax error at line 2, column 23, byte 68 at /path/to/XML/Parser.pm
    But the line
    <!DOCTYPE AdminRequest (View Source for full doctype...)> <AdminReques +t>
    looks dubious to me anyways...

    --shmem

    _($_=" "x(1<<5)."?\n".q/)Oo.  G\        /
                                  /\_/(q    /
    ----------------------------  \__(m.====.(_("always off the crowd"))."
    ");sub _{s./.($e="'Itrs `mnsgdq Gdbj O`qkdq")=~y/"-y/#-z/;$e.e && print}
Re: Parsing shift-jis XML with XML::TWIG
by graff (Chancellor) on Aug 19, 2006 at 01:39 UTC
    In the man page for XML::Parser (which is the module generating the error), there is a section titled ENCODINGS, which explains how to work with non-unicode data.

    Part of the explanation mentions @XML::Parser::Expat::Encoding_Path, which is a list of one or more directories where encoding definitions are kept. You're likely to find a directory called "Encodings" under XML/Parser/ (wherever this module was installed on your system).

    On the system I'm using at the moment (macosx, perl 5.8.6, XML::Parser version 2.34), that Encodings directory contains the follow shift-jis related encoding map files:

    x-sjis-cp932.enc x-sjis-jdk117.enc x-sjis-jisx0221.enc x-sjis-unicode.enc
    If I take the xml snippet from the OP, and change the name of the encoding from "shift-jis" to "x-sjis-unicode", it does not generate an error.

    Also, rather than altering the xml data file, if I get my xml object like this:

    my $parser = new XML::Parser(ProtocolEncoding=>"x-sjis-unicode");
    this encoding spec overrides the encoding named in the xml file, and it generates no error. (And of course, if I use the data as posted and do not set ProtocolEncoding for the parser object, I get the same error as the OP.)

    I'm just guessing that the one I picked is "the right one". Good luck with that.

Re: Parsing shift-jis XML with XML::TWIG
by kettle (Beadle) on Aug 19, 2006 at 00:45 UTC
    You should check out the information on Binmode: ( http://www.icewalkers.com/Perl/5.8.0/pod/func/binmode.html )

    binmode(STDIN, ":shiftjis"); binmode(STDOUT, ":YOUR_DESIRED_OUTPUT_ENCODING");
    and probably the Encode module: http://www.ayni.com/perldoc/perl5.8.0/lib/Encode.html

    http://www.ayni.com/perldoc/perl5.8.0/lib/Encode.html

    In case you can read Japanese: http://www.fl.reitaku-u.ac.jp/~schiba/perl/perlEncoding.html

    It is possible to specify or guess the encoding of the file that you are working on, and these links should hopefully put you on the right track. I recently had a similar problem and the above listings led to the solution. However, it isn't precisely clear to me what your problem is: just parsing a japanese encoded page? viewing the results? Hopefully this is of some small help.