Never having used Unicode in any way, this looked like a good way for me to start teaching myself. In Excel, it’s certainly more complicated than I expected. The first thing I needed was a way of getting the characters into Excel. I was cursing the OP mildly for not doing this, but it turns out to be very difficult. AFAICT, the characters used are UTF16, while Win32:Ole doesn’t export anything beyond UTF8. Recording a VBA macro results in ActiveCell.FormulaR1C1 = "'??" which, needless to say, is utterly unhelpful. The reason I suspect UTF8 to be inadequate is that passing a single code from Perl results in two characters appearing in the cell. Extensive Gargling reveals nothing on getting more advanced Unicode from Perl to Excel.

I should perhaps point out here that my setup may be partly to blame. Never having had the slightest need to use any of these characters – and not even knowing their language – I have not installed lots of strange fonts in the hope that one of them might fit. But I don’t think that’s the case. I can copy & paste the characters without problems and I CAN enter them using the Insert | Symbol dialogue in Excel.

There is a way to get these characters in from VBA, using the ChrW function, but this isn’t available from the face of the spreadsheet. Wanting a SSCCE (http://sscce.org/ https://web.archive.org/web/20160926072757/http://sscce.org), I went down the route shown in http://dailydoseofexcel.com/archives/2012/06/23/unicode-and-diacritic-characters/, getting the VBA in via Perl using the technique shown in RFC Tutorial: Adding and extracting VBA to and from Excel files.

At this stage, I had the data in, but getting it out again was the problem. The OP’s “hex_dump_cell” routine includes an error infelicity, in that “Value” should not be is not usually quoted, but the real issue is the one I mentioned in the first paragraph, namely that Win32:Ole doesn’t export anything beyond UTF8. This is the cause of the “?” characters that appeared in the recorded macro and in the OP’s output. The “?” character is 3F in hex, explaining the result of the unpack. So again, we need some VBA to get the characters out using the AscW function, since this isn’t available from the face of the spreadsheet.

It isn’t possible to combine the call to the extraction routine in the same cell as the characters to be decoded. Something within Excel converts these to “?” characters when I tried. Therefore, I have had to put formulae in the second row. Since I am not saving the spreadsheet, this should be no problem. The only thing needing care is that the formulae don’t overwrite a cell that will be decoded later. If this happens, things can get recursive very quickly.

use strict; use warnings; use Win32::OLE; my $xl = Win32::OLE->new('Excel.Application'); $xl->{Visible} = 1; my $wb = $xl->Workbooks->Add; if ($wb->Sheets->{Count} > 1) { for (2..$wb->Sheets->{Count}) { $wb->Sheets(2)->Delete; } } $xl->VBE->ActiveVBProject->VBComponents->Add(1); my $cm = $xl->VBE->ActiveVBProject->VBComponents(3)->CodeModule; my $line = int($cm->CountOfLines); $cm->InsertLines(++$line, "Function CHARW(code As Variant) As String") +; $cm->InsertLines(++$line, "'Use a Leading \"U\" or \"u\" to indicate U +nicode values"); $cm->InsertLines(++$line, " code = VBA.Replace(code, \"+\", \"\", 1, + 1, vbTextCompare)"); $cm->InsertLines(++$line, " If UCase(Left\$(code, 1)) = \"U\" Then c +ode = VBA.Replace(code, \"U\", \"&H\", 1, 1, vbTextCompare)"); $cm->InsertLines(++$line, " CHARW = ChrW(code)"); $cm->InsertLines(++$line, "End Function"); $cm->InsertLines(++$line, ""); $cm->InsertLines(++$line, "Function AscDec(char As Variant) As String" +); $cm->InsertLines(++$line, " Dim i As Long"); $cm->InsertLines(++$line, " For i = 1 To Len(char)"); $cm->InsertLines(++$line, " AscDec = AscDec & AscW(Mid(char, i, + 1))"); $cm->InsertLines(++$line, " If i < Len(char) Then AscDec = AscD +ec & \":\""); $cm->InsertLines(++$line, " Next i"); $cm->InsertLines(++$line, "End Function"); my $sht = $wb->Sheets(1); $sht->Cells(1, 1)->{Formula} = '=charw(1576)&charw(1744)&charw(1610)&c +harw(1580)&charw(1609)&charw(1709)'; $sht->Cells(1, 2)->{Formula} = '=charw(21271)&charw(20140)'; $sht->Cells(1, 3)->{Value} = 'Beijing'; $sht->Range("A1:C1")->Copy; $sht->Range("A1:C1")->PasteSpecial(-4163); #xlPasteValues $sht->Cells(2, 1)->{Formula} = "=ascdec(A1)"; $sht->Range("A2")->Copy ($sht->Range("B2:C2")); print $sht->Cells(2, 1)->{Value} . "\n"; print $sht->Cells(2, 2)->{Value} . "\n"; print $sht->Cells(2, 3)->{Value} . "\n"; $xl->{DisplayAlerts} = 0; $xl->Quit;

I think this mess is best summarised by Chesterton: http://www.gkc.org.uk/gkc/books/rolling.html.

Regards,

John Davies


In reply to Re: Extracting Chinese characters from Excel by davies
in thread Extracting Chinese characters from Excel by puti

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.