(apologies for all the messy updates)

I think the big problem is trying to hold everything in an HoA, with over a million 70K hash keys and 82 array elements in each key (update: there are over a million lines of input data, with multiple lines per codepoint/hash key). Since you are retrieving the elements in roughly random order, you might be doing a fair bit of virtual memory swapping.

You don't have to store the whole input stream before printing any of it out -- just keep the data for one codepoint in memory at a time, and print it out as you move to the next code point. Like this:

## ... %tags as defined in the OP ... my $tagcnt = scalar keys %tags; # let's be sure about how many tags t +here are warn scalar localtime(), $/; # initial time mark my $prev_codepoint = 0; my @outrec; my $ndone = 0; while (<>) { next if /^#/; # skip comments chomp; my ($codepoint, $tag, $content) = split /\t/; $codepoint =~ s/^U\+//; # replace U+ with 0x $codepoint = hex $codepoint; # treat 0x number as hex, conv +ert to dec if ( $codepoint != $prev_codepoint ) { printrec( $prev_codepoint, \@outrec ) if ( $prev_codepoint ); @outrec = map { '' } (1..$tagcnt); $prev_codepoint = $codepoint; } $outrec[$tags{$tag}] = $content; # ongoing time-stamping: warn "$ndone done at ".scalar localtime()."\n" if ( ++$ndone % 100 +0 == 0 ); } printrec( $prev_codepoint, \@outrec ); warn scalar localtime(), $/; # final timestamp sub printrec { my ( $cp, $rec ) = @_; my $s = join( "\t", $cp, @$rec ) . "\n"; $s =~ s/([\x{10000}-\x{1FFFFF}])/'\x{'.(sprintf '%X', ord $1).'}' +/ge; print $s; };
I put similar "localtime" prints to stderr in your original version, and found it taking upwards of 40 sec per 1000 lines printed out. The fixed version shown here clocked over 14,000 lines printed input per sec. (Both versions running on a powerbook G4 with 768 GB MB RAM, using perl 5.8.1 (updated -- how did that "G" get in there?))

I compared a few of the lines from the two versions, and they matched as intended (though I didn't do a complete run of your version -- I wanted to reply and get some sleep tonight :).

UPDATE: The reason this version of the code should work as desired is that the Unihan.txt file from www.unicode.org should already be "sorted", in the sense that all the data lines for a given codepoint are contiguous. (I concluded this was the case from a casual inspection of the file confirmed this was the case by comparing the output line count against  cut -f1 output | sort -u | wc -l which gave the same line count; sorting the Unihan file yourself is unnecessary.) But if this turns out to be a false conclusion, and some codepoints have their tag/value tuples scattered around at different points in the file, all you have to do is sort the file before feeding it to this version of the script ( sort Unihan.txt | your_script.pl > loader.tab) -- that will still be tons faster than trying to hold the whole file in a single perl HoA structure. This version finished (71,226 output codepoints) in 1m5s.


In reply to Re: parsing textfile is too slow by graff
in thread parsing textfile is too slow by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.