in reply to parsing textfile is too slow
I think the big problem is trying to hold everything in an HoA, with over a million 70K hash keys and 82 array elements in each key (update: there are over a million lines of input data, with multiple lines per codepoint/hash key). Since you are retrieving the elements in roughly random order, you might be doing a fair bit of virtual memory swapping.
You don't have to store the whole input stream before printing any of it out -- just keep the data for one codepoint in memory at a time, and print it out as you move to the next code point. Like this:
I put similar "localtime" prints to stderr in your original version, and found it taking upwards of 40 sec per 1000 lines printed out. The fixed version shown here clocked over 14,000 lines## ... %tags as defined in the OP ... my $tagcnt = scalar keys %tags; # let's be sure about how many tags t +here are warn scalar localtime(), $/; # initial time mark my $prev_codepoint = 0; my @outrec; my $ndone = 0; while (<>) { next if /^#/; # skip comments chomp; my ($codepoint, $tag, $content) = split /\t/; $codepoint =~ s/^U\+//; # replace U+ with 0x $codepoint = hex $codepoint; # treat 0x number as hex, conv +ert to dec if ( $codepoint != $prev_codepoint ) { printrec( $prev_codepoint, \@outrec ) if ( $prev_codepoint ); @outrec = map { '' } (1..$tagcnt); $prev_codepoint = $codepoint; } $outrec[$tags{$tag}] = $content; # ongoing time-stamping: warn "$ndone done at ".scalar localtime()."\n" if ( ++$ndone % 100 +0 == 0 ); } printrec( $prev_codepoint, \@outrec ); warn scalar localtime(), $/; # final timestamp sub printrec { my ( $cp, $rec ) = @_; my $s = join( "\t", $cp, @$rec ) . "\n"; $s =~ s/([\x{10000}-\x{1FFFFF}])/'\x{'.(sprintf '%X', ord $1).'}' +/ge; print $s; };
I compared a few of the lines from the two versions, and they matched as intended (though I didn't do a complete run of your version -- I wanted to reply and get some sleep tonight :).
UPDATE: The reason this version of the code should work as desired is that the Unihan.txt file from www.unicode.org should already be "sorted", in the sense that all the data lines for a given codepoint are contiguous. (I concluded this was the case from a casual inspection of the file confirmed this was the case by comparing the output line count against cut -f1 output | sort -u | wc -l which gave the same line count; sorting the Unihan file yourself is unnecessary.) But if this turns out to be a false conclusion, and some codepoints have their tag/value tuples scattered around at different points in the file, all you have to do is sort the file before feeding it to this version of the script ( sort Unihan.txt | your_script.pl > loader.tab) -- that will still be tons faster than trying to hold the whole file in a single perl HoA structure.
This version finished (71,226 output codepoints) in 1m5s.
|
|---|