parsing textfile is too slow

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi all. I'm mangling the current Unihan because the one on CPAN is out of date. You find there a link to the db textfile which is 6 MB zipped, 28 MB unzipped.

This is what I've got.

# parse-unihan.pl

# mangle Unihan.txt on STDIN into tab seperated columns
# for feeding it into a RDBMS
# most populated columns come first

use strict;
use diagnostics;
use Data::Dumper;

my %tags;
{
    my $i    = 0;
    for (qw(
        kRSUnicode
        kIRGKangXi
        kRSKangXi
        kIRG_GSource
        kHanYu
        kIRGHanyuDaZidian
        kIRG_TSource
        kTotalStrokes
        kMandarin
        kIRG_KPSource
        kMorohashi
        kKangXi
        kDefinition
        kCantonese
        kCCCII
        kSBGY
        kKPS1
        kIRGDaiKanwaZiten
        kIRG_KSource
        kCangjie
        kCNS1992
        kCNS1986
        kDaeJaweon
        kIRGDaeJaweon
        kCihaiT
        kIRG_JSource
        kRSAdobe_Japan1_6
        kEACC
        kJapaneseOn
        kBigFive
        kPhonetic
        kJapaneseKun
        kIICore
        kXerox
        kIRG_VSource
        kKorean
        kTaiwanTelegraph
        kMatthews
        kVietnamese
        kGSR
        kMeyerWempe
        kMainlandTelegraph
        kGB1
        kGB0
        kJis0
        kFennIndex
        kJis1
        kNelson
        kFrequency
        kFenn
        kKSC0
        kGB3
        kHKGlyph
        kCowles
        kKPS0
        kIRG_HSource
        kHKSCS
        kTang
        kHanyuPinlu
        kJIS0213
        kLau
        kSemanticVariant
        kKSC1
        kGB5
        kSimplifiedVariant
        kTraditionalVariant
        kGradeLevel
        kZVariant
        kKarlgren
        kCompatibilityVariant
        kGB8
        kSpecializedSemanticVariant
        kIBMJapan
        kHDZRadBreak
        kRSJapanese
        kRSKanWa
        kPseudoGB1
        kGB7
        kIRG_USource
        kOtherNumeric
        kAccountingNumeric
        kRSKorean
        kPrimaryNumeric
    )) {
        $tags{$_}    = $i;
        $i++;
    };
    # $tags{kRSUnicode}    = 0; $tags{kIRGKangXi}    = 1; and so on
};

my %unihan;
while (<>) {
    next if /^#/;    # skip comments
    chomp;
    my ($codepoint, $tag, $content)    = split /\t/;
    $codepoint    =~ s/^U\+/0x/;        # replace U+ with 0x
    $codepoint    = hex $codepoint;    # treat 0x number as hex, conve
+rt to dec
    $unihan{$codepoint}[$tags{$tag}]    = $content;
};

foreach (keys %unihan) {
    $unihan{$_}[82]    = $unihan{$_}[82];
    # autovivify the last field for correct number of columns
    # else SQL COPY command throws a hissy fit

    my $s    = "$_\t";            # codepoint in dec + tab
    $s    .= join "\t", @{$unihan{$_}};    # append all content, tab s
+eparated
    $s    .= "\n";            # append final newline

    $s    =~ s/([\x{10000}-\x{1FFFFF}])/'\x{'.(sprintf '%X', ord $1).'
+}'/ge;
    # ork around: http://google.com/search?q=0x10000+site:postgresql.o
+rg+inurl:docs
    # replace U+10000 upwards with its perl escaped \x{HEXNUM} form

    print $s;
};
[download]

It's too damn slow. I haven't benchmarked or profiled, but in the course of developing I noticed that the join seems to be the big time killer.

If done right, of course this needs to be run only once, but I have the feeling that I'm going to need this a couple of times again in the future.

Speculate how this program can be improved speedwise.

Comment on parsing textfile is too slow Download Code

Replies are listed 'Best First'.
Re: parsing textfile is too slow by graff (Chancellor) on Aug 17, 2005 at 06:54 UTC
(apologies for all the messy updates) I think the big problem is trying to hold everything in an HoA, with over ~~a million~~ 70K hash keys and 82 array elements in each key (update: there are over a million lines of input data, with multiple lines per codepoint/hash key). Since you are retrieving the elements in roughly random order, you might be doing a fair bit of virtual memory swapping. You don't have to store the whole input stream before printing any of it out -- just keep the data for one codepoint in memory at a time, and print it out as you move to the next code point. Like this: ## ... %tags as defined in the OP ... my $tagcnt = scalar keys %tags; # let's be sure about how many tags t +here are warn scalar localtime(), $/; # initial time mark my $prev_codepoint = 0; my @outrec; my $ndone = 0; while (<>) { next if /^#/; # skip comments chomp; my ($codepoint, $tag, $content) = split /\t/; $codepoint =~ s/^U\+//; # replace U+ with 0x $codepoint = hex $codepoint; # treat 0x number as hex, conv +ert to dec if ( $codepoint != $prev_codepoint ) { printrec( $prev_codepoint, \@outrec ) if ( $prev_codepoint ); @outrec = map { '' } (1..$tagcnt); $prev_codepoint = $codepoint; } $outrec[$tags{$tag}] = $content; # ongoing time-stamping: warn "$ndone done at ".scalar localtime()."\n" if ( ++$ndone % 100 +0 == 0 ); } printrec( $prev_codepoint, \@outrec ); warn scalar localtime(), $/; # final timestamp sub printrec { my ( $cp, $rec ) = @_; my $s = join( "\t", $cp, @$rec ) . "\n"; $s =~ s/([\x{10000}-\x{1FFFFF}])/'\x{'.(sprintf '%X', ord $1).'}' +/ge; print $s; }; [download] I put similar "localtime" prints to stderr in your original version, and found it taking upwards of 40 sec per 1000 lines printed out. The fixed version shown here clocked over 14,000 lines ~~printed~~ input per sec. (Both versions running on a powerbook G4 with 768 GB MB RAM, using perl 5.8.1 (updated -- how did that "G" get in there?)) I compared a few of the lines from the two versions, and they matched as intended (though I didn't do a complete run of your version -- I wanted to reply and get some sleep tonight :). UPDATE: The reason this version of the code should work as desired is that the Unihan.txt file from www.unicode.org should already be "sorted", in the sense that all the data lines for a given codepoint are contiguous. (I ~~concluded this was the case from a casual inspection of the file~~ confirmed this was the case by comparing the output line count against `cut -f1 output \| sort -u \| wc -l` which gave the same line count; sorting the Unihan file yourself is unnecessary.) But if this turns out to be a false conclusion, and some codepoints have their tag/value tuples scattered around at different points in the file, all you have to do is sort the file before feeding it to this version of the script ( `sort Unihan.txt \| your_script.pl > loader.tab`) -- that will still be tons faster than trying to hold the whole file in a single perl HoA structure. This version finished (71,226 output codepoints) in 1m5s.	[reply] [d/l] [select]
Re: parsing textfile is too slow by cowboy (Friar) on Aug 17, 2005 at 06:31 UTC
I'm really not that sure about what you're trying to do, but at least one thing you could do to give a slight boost in speed, is to avoid the assignment of: `$unihan{$_}[82] = $unihan{$_}[82];` [download] That's a totally wasted op, unless I'm missing something. Depending on the size of the %unihan hash, you may or may not be able to save some time by changing: `foreach (keys %unihan) { to: while ( ($key,$val) = each %unihan) { }` [download] keys will create a new list containing all the keys. If your hash is large, this can be an expensive proceedure. You might also try something similar to replace the join, depending on the sizes of things you're dereferencing. Anyway, in a nutshell, the more records you have in a hash/arrayref, the more it hurts to copy them (dereference, keys/values/ect). Update: fixed one code example (oops), thanks QM	[reply] [d/l] [select]
Re^2: parsing textfile is too slow by QM (Parson) on Aug 17, 2005 at 18:27 UTC
Shouldn't `while ($key,$val) %unihan { }` [download] be `while ( ($key,$val)= each %unihan ) { }` [download] ?? -QM -- Quantum Mechanics: The dreams stuff is made of	[reply] [d/l] [select]
Re^3: parsing textfile is too slow by cowboy (Friar) on Aug 17, 2005 at 18:32 UTC
Oops, yes, corrected it. I knew something didn't look right, but I had just written it off to the fact I don't use each() very often. Thanks.	[reply]
Re: parsing textfile is too slow by newroz (Monk) on Aug 17, 2005 at 06:35 UTC
Hi, "use diagnostic;" makes your program slower. Beside that, use -w flag. Then script will produce many warnings.	[reply]
Re: parsing textfile is too slow by tlm (Prior) on Aug 17, 2005 at 13:11 UTC
On my laptop, the version below takes 24 seconds to process the data. It assumes that the data is sorted, so there's a one-time additional cost (which may not be necessary) of sorting the data file; actually the important thing is not the sorting, but that all the lines corresponding to a given codepoint are adjacent to each other in the input stream. The idea is not to build the large `%unihan` hash, but output the data for each codepoint as soon as it is available. Read more... (3 kB) the lowliest monk	[reply] [d/l]