comment on

Hi all. I'm mangling the current Unihan because the one on CPAN is out of date. You find there a link to the db textfile which is 6 MB zipped, 28 MB unzipped.

This is what I've got.

# parse-unihan.pl

# mangle Unihan.txt on STDIN into tab seperated columns
# for feeding it into a RDBMS
# most populated columns come first

use strict;
use diagnostics;
use Data::Dumper;

my %tags;
{
    my $i    = 0;
    for (qw(
        kRSUnicode
        kIRGKangXi
        kRSKangXi
        kIRG_GSource
        kHanYu
        kIRGHanyuDaZidian
        kIRG_TSource
        kTotalStrokes
        kMandarin
        kIRG_KPSource
        kMorohashi
        kKangXi
        kDefinition
        kCantonese
        kCCCII
        kSBGY
        kKPS1
        kIRGDaiKanwaZiten
        kIRG_KSource
        kCangjie
        kCNS1992
        kCNS1986
        kDaeJaweon
        kIRGDaeJaweon
        kCihaiT
        kIRG_JSource
        kRSAdobe_Japan1_6
        kEACC
        kJapaneseOn
        kBigFive
        kPhonetic
        kJapaneseKun
        kIICore
        kXerox
        kIRG_VSource
        kKorean
        kTaiwanTelegraph
        kMatthews
        kVietnamese
        kGSR
        kMeyerWempe
        kMainlandTelegraph
        kGB1
        kGB0
        kJis0
        kFennIndex
        kJis1
        kNelson
        kFrequency
        kFenn
        kKSC0
        kGB3
        kHKGlyph
        kCowles
        kKPS0
        kIRG_HSource
        kHKSCS
        kTang
        kHanyuPinlu
        kJIS0213
        kLau
        kSemanticVariant
        kKSC1
        kGB5
        kSimplifiedVariant
        kTraditionalVariant
        kGradeLevel
        kZVariant
        kKarlgren
        kCompatibilityVariant
        kGB8
        kSpecializedSemanticVariant
        kIBMJapan
        kHDZRadBreak
        kRSJapanese
        kRSKanWa
        kPseudoGB1
        kGB7
        kIRG_USource
        kOtherNumeric
        kAccountingNumeric
        kRSKorean
        kPrimaryNumeric
    )) {
        $tags{$_}    = $i;
        $i++;
    };
    # $tags{kRSUnicode}    = 0; $tags{kIRGKangXi}    = 1; and so on
};

my %unihan;
while (<>) {
    next if /^#/;    # skip comments
    chomp;
    my ($codepoint, $tag, $content)    = split /\t/;
    $codepoint    =~ s/^U\+/0x/;        # replace U+ with 0x
    $codepoint    = hex $codepoint;    # treat 0x number as hex, conve
+rt to dec
    $unihan{$codepoint}[$tags{$tag}]    = $content;
};

foreach (keys %unihan) {
    $unihan{$_}[82]    = $unihan{$_}[82];
    # autovivify the last field for correct number of columns
    # else SQL COPY command throws a hissy fit

    my $s    = "$_\t";            # codepoint in dec + tab
    $s    .= join "\t", @{$unihan{$_}};    # append all content, tab s
+eparated
    $s    .= "\n";            # append final newline

    $s    =~ s/([\x{10000}-\x{1FFFFF}])/'\x{'.(sprintf '%X', ord $1).'
+}'/ge;
    # ork around: http://google.com/search?q=0x10000+site:postgresql.o
+rg+inurl:docs
    # replace U+10000 upwards with its perl escaped \x{HEXNUM} form

    print $s;
};
[download]

It's too damn slow. I haven't benchmarked or profiled, but in the course of developing I noticed that the join seems to be the big time killer.

If done right, of course this needs to be run only once, but I have the feeling that I'm going to need this a couple of times again in the future.

Speculate how this program can be improved speedwise.

In reply to parsing textfile is too slow by Anonymous Monk

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.