Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I'm mangling the current Unihan because the one on CPAN is out of date. You find there a link to the db textfile which is 6 MB zipped, 28 MB unzipped.

This is what I've got.

# parse-unihan.pl # mangle Unihan.txt on STDIN into tab seperated columns # for feeding it into a RDBMS # most populated columns come first use strict; use diagnostics; use Data::Dumper; my %tags; { my $i = 0; for (qw( kRSUnicode kIRGKangXi kRSKangXi kIRG_GSource kHanYu kIRGHanyuDaZidian kIRG_TSource kTotalStrokes kMandarin kIRG_KPSource kMorohashi kKangXi kDefinition kCantonese kCCCII kSBGY kKPS1 kIRGDaiKanwaZiten kIRG_KSource kCangjie kCNS1992 kCNS1986 kDaeJaweon kIRGDaeJaweon kCihaiT kIRG_JSource kRSAdobe_Japan1_6 kEACC kJapaneseOn kBigFive kPhonetic kJapaneseKun kIICore kXerox kIRG_VSource kKorean kTaiwanTelegraph kMatthews kVietnamese kGSR kMeyerWempe kMainlandTelegraph kGB1 kGB0 kJis0 kFennIndex kJis1 kNelson kFrequency kFenn kKSC0 kGB3 kHKGlyph kCowles kKPS0 kIRG_HSource kHKSCS kTang kHanyuPinlu kJIS0213 kLau kSemanticVariant kKSC1 kGB5 kSimplifiedVariant kTraditionalVariant kGradeLevel kZVariant kKarlgren kCompatibilityVariant kGB8 kSpecializedSemanticVariant kIBMJapan kHDZRadBreak kRSJapanese kRSKanWa kPseudoGB1 kGB7 kIRG_USource kOtherNumeric kAccountingNumeric kRSKorean kPrimaryNumeric )) { $tags{$_} = $i; $i++; }; # $tags{kRSUnicode} = 0; $tags{kIRGKangXi} = 1; and so on }; my %unihan; while (<>) { next if /^#/; # skip comments chomp; my ($codepoint, $tag, $content) = split /\t/; $codepoint =~ s/^U\+/0x/; # replace U+ with 0x $codepoint = hex $codepoint; # treat 0x number as hex, conve +rt to dec $unihan{$codepoint}[$tags{$tag}] = $content; }; foreach (keys %unihan) { $unihan{$_}[82] = $unihan{$_}[82]; # autovivify the last field for correct number of columns # else SQL COPY command throws a hissy fit my $s = "$_\t"; # codepoint in dec + tab $s .= join "\t", @{$unihan{$_}}; # append all content, tab s +eparated $s .= "\n"; # append final newline $s =~ s/([\x{10000}-\x{1FFFFF}])/'\x{'.(sprintf '%X', ord $1).' +}'/ge; # ork around: http://google.com/search?q=0x10000+site:postgresql.o +rg+inurl:docs # replace U+10000 upwards with its perl escaped \x{HEXNUM} form print $s; };

It's too damn slow. I haven't benchmarked or profiled, but in the course of developing I noticed that the join seems to be the big time killer.

If done right, of course this needs to be run only once, but I have the feeling that I'm going to need this a couple of times again in the future.

Speculate how this program can be improved speedwise.

Replies are listed 'Best First'.
Re: parsing textfile is too slow
by graff (Chancellor) on Aug 17, 2005 at 06:54 UTC
    (apologies for all the messy updates)

    I think the big problem is trying to hold everything in an HoA, with over a million 70K hash keys and 82 array elements in each key (update: there are over a million lines of input data, with multiple lines per codepoint/hash key). Since you are retrieving the elements in roughly random order, you might be doing a fair bit of virtual memory swapping.

    You don't have to store the whole input stream before printing any of it out -- just keep the data for one codepoint in memory at a time, and print it out as you move to the next code point. Like this:

    ## ... %tags as defined in the OP ... my $tagcnt = scalar keys %tags; # let's be sure about how many tags t +here are warn scalar localtime(), $/; # initial time mark my $prev_codepoint = 0; my @outrec; my $ndone = 0; while (<>) { next if /^#/; # skip comments chomp; my ($codepoint, $tag, $content) = split /\t/; $codepoint =~ s/^U\+//; # replace U+ with 0x $codepoint = hex $codepoint; # treat 0x number as hex, conv +ert to dec if ( $codepoint != $prev_codepoint ) { printrec( $prev_codepoint, \@outrec ) if ( $prev_codepoint ); @outrec = map { '' } (1..$tagcnt); $prev_codepoint = $codepoint; } $outrec[$tags{$tag}] = $content; # ongoing time-stamping: warn "$ndone done at ".scalar localtime()."\n" if ( ++$ndone % 100 +0 == 0 ); } printrec( $prev_codepoint, \@outrec ); warn scalar localtime(), $/; # final timestamp sub printrec { my ( $cp, $rec ) = @_; my $s = join( "\t", $cp, @$rec ) . "\n"; $s =~ s/([\x{10000}-\x{1FFFFF}])/'\x{'.(sprintf '%X', ord $1).'}' +/ge; print $s; };
    I put similar "localtime" prints to stderr in your original version, and found it taking upwards of 40 sec per 1000 lines printed out. The fixed version shown here clocked over 14,000 lines printed input per sec. (Both versions running on a powerbook G4 with 768 GB MB RAM, using perl 5.8.1 (updated -- how did that "G" get in there?))

    I compared a few of the lines from the two versions, and they matched as intended (though I didn't do a complete run of your version -- I wanted to reply and get some sleep tonight :).

    UPDATE: The reason this version of the code should work as desired is that the Unihan.txt file from www.unicode.org should already be "sorted", in the sense that all the data lines for a given codepoint are contiguous. (I concluded this was the case from a casual inspection of the file confirmed this was the case by comparing the output line count against  cut -f1 output | sort -u | wc -l which gave the same line count; sorting the Unihan file yourself is unnecessary.) But if this turns out to be a false conclusion, and some codepoints have their tag/value tuples scattered around at different points in the file, all you have to do is sort the file before feeding it to this version of the script ( sort Unihan.txt | your_script.pl > loader.tab) -- that will still be tons faster than trying to hold the whole file in a single perl HoA structure. This version finished (71,226 output codepoints) in 1m5s.

Re: parsing textfile is too slow
by cowboy (Friar) on Aug 17, 2005 at 06:31 UTC
    I'm really not that sure about what you're trying to do, but at least one thing you could do to give a slight boost in speed, is to avoid the assignment of:
    $unihan{$_}[82] = $unihan{$_}[82];
    That's a totally wasted op, unless I'm missing something.
    Depending on the size of the %unihan hash, you may or may not be able to save some time by changing:
    foreach (keys %unihan) { to: while ( ($key,$val) = each %unihan) { }
    keys will create a new list containing all the keys. If your hash is large, this can be an expensive proceedure. You might also try something similar to replace the join, depending on the sizes of things you're dereferencing.
    Anyway, in a nutshell, the more records you have in a hash/arrayref, the more it hurts to copy them (dereference, keys/values/ect).

    Update: fixed one code example (oops), thanks QM
      Shouldn't
      while ($key,$val) %unihan { }
      be
      while ( ($key,$val)= each %unihan ) { }
      ??

      -QM
      --
      Quantum Mechanics: The dreams stuff is made of

        Oops, yes, corrected it. I knew something didn't look right, but I had just written it off to the fact I don't use each() very often. Thanks.
Re: parsing textfile is too slow
by newroz (Monk) on Aug 17, 2005 at 06:35 UTC
    Hi,
    "use diagnostic;" makes your program slower.
    Beside that, use -w flag. Then script will produce many warnings.
Re: parsing textfile is too slow
by tlm (Prior) on Aug 17, 2005 at 13:11 UTC

    On my laptop, the version below takes 24 seconds to process the data. It assumes that the data is sorted, so there's a one-time additional cost (which may not be necessary) of sorting the data file; actually the important thing is not the sorting, but that all the lines corresponding to a given codepoint are adjacent to each other in the input stream. The idea is not to build the large %unihan hash, but output the data for each codepoint as soon as it is available.

    the lowliest monk