in reply to sorting type question- space problems

G'day baxy77bax,

Without knowing the (average) record size, it's a little hard to estimate whether this solution fits the 100MB memory limit. If the records are of a similar size to your dummy data, then it probably won't meet that requirement; if they are in the order of hundreds bytes each, then you may be in with a chance.

Anyway, here's the potential solution [Updated - see below]:

$ perl -Mstrict -Mwarnings -Mautodie -le ' use Tie::File; tie my @input_records, "Tie::File", "./pm_1054101_input.txt"; open my $out_fh, ">", "./pm_1054101_output.txt"; print $out_fh $input_records[$_] for map { $_->[0] } sort { $a->[1] <=> $b->[1] || $a->[2] <=> $b->[2] } map { [ $_, map { substr $_, 3 } (split " ", $input_records[$_ +])[0,1] ] } 0 .. $#input_records; close $out_fh; untie @input_records; '

Input file:

$ cat pm_1054101_input.txt key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf...

Output file:

$ cat pm_1054101_output.txt key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf... key43 key21 sdkjfhdghdbgbd

File sizes unchanged:

$ ls -l pm_1054101_* -rw-r--r-- 1 ken staff 159 14 Sep 22:41 pm_1054101_input.txt -rw-r--r-- 1 ken staff 159 14 Sep 23:04 pm_1054101_output.txt

Update: I removed an intermediary array which might save some memory. Here's the original code:

-- Ken

Replies are listed 'Best First'.
Re^2: sorting type question- space problems
by BrowserUk (Patriarch) on Sep 14, 2013 at 14:27 UTC

    That will never work. It will run the machine out of memory.

    It will attempt to build 4 huge lists:

    1. a list of integers, one for every record in the file.

      Input to the first map.

    2. a list of anonymous arrays -- each containing two elements -- one for every record in the file.

      Output from the map, input to sort.

    3. Same again, but reordered.

      Output from sort, input to second map.

    4. An ordered list of all the records.

      Output from second map, input to for.

    If the OPs records average 64 bytes, that would require:

    • 4 * 20GB = 80GB
    • + 6 * 335544320 * 24 bytes (for the scalars) = 45GB
    • + 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB
    • + lots of other bits and pieces of overhead within Tie::File and Perl's internals stacks = ??

    For a grand total memory requirement of at least 197.5 Gigabytes


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      Thankyou for this analysis. I have some points to raise and some questions. At the outset, I'd like to say that this isn't intended as a nitpicking exercise (although it may look like that in parts); instead, I'm more interested in learning from it.

      "It will attempt to build 4 huge lists:"

      Possibly wishful thinking on my part, but I had imagined that as each list was used up, its data would no longer persist and the memory it used could be reclaimed.

      To clarify, the first map generates a list then passes the elements of that list onto sort which, in turn, generates another list which is passed to the second map, and so on. When, say, the second map is being processed, $_ is an alias into the sort list (so sort's data is still in use); however, at this point, the list generated by the first map is not being used by anything and its memory could, at least in theory, be reused without causing problems. Anyway, that was my thinking: I haven't seen any documentation indicating whether this is how it works - I could be entirely wrong.

      "1. a list of integers, one for every record in the file. Input to the first map."

      Quite right. I did think about this when coding it, but clearly misremembered whay I'd previously read. Going back to "perlop: Range Operators", I see "... no temporary array is created when the range operator is used as the expression in foreach loops, ...": I had thought that was more of a general rule for generating lists with the ".." operator (clearly, it isn't). So in retrospect, "for (0 .. $#input_records) {...}", instead of "... 0 .. $#input_records;", would have removed this overhead and been a better choice.

      "2. & 3. a list of anonymous arrays -- each containing two elements ... (map1 to sort and sort to map2)"

      Actually, that's three elements (numbers): integer, string, string. Looking back at it, I can see that changing "substr $_, 3" to "0 + substr $_, 3" would have produced three integers which would've saved some memory.

      "4. An ordered list of all the records. Output from second map, input to for."

      That's not a list of the records, it's a list of integers (i.e. (0 .. $#input_records) after being sorted). Possibly what you meant, but I thought I'd just clear that up in case it wasn't.

      "If the OPs records average 64 bytes, that would require:"

      The OP has indicated "there are abbout 5 mil records (lines) which sum up to 20GB in total." [sic]. I appreciate that was possibly written while you were composing your reply. Anyway, that gives an average closer to 4,295 bytes (20 * 1024**3 / 5e6); and your calculated ~335 million records (instead of the given ~5 million records) will mean the results will be out by a couple of orders of magnitude (where that figure is used).

      In the dot points that followed (with the various calculations), I have some questions regarding where some of the numbers came from. Where appropriate, if you could point me to documentation (e.g. some section in perlguts), that would be preferable as I'd be interested in reading about general cases rather than just specifics relating to this question: that might also be a lot less typing for you.

      • "4 * 20GB = 80GB": What does the 4 refer to? (I've assumed 20GB is the file size.)
      • "+ 6 * 335544320 * 24 bytes (for the scalars) = 45GB": I've guessed the 6 is from the earlier numbered points (1+2+2+1); if so, that should be 8 (1+3+3+1). 335,544,320 is the number of records which, as noted above, should be 5,000,000. I thought 24 might be the size of a scalar, but some of those are integers (I thought they would've been 8), so I'm not sure on this (and it would be one I'd be interested in documentation for).
      • "+ 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB": 335544320 (number of records, already mentioned). 232 has me stumped (possibly another doco link here).
      • "+ lots of other bits and pieces ...": Agreed. Unknown.

      -- Ken

        (I'd be interested in documentation for ... possibly another doco link here)

        See Perlguts Illustrated. In the light of the OPs later information regarding the number and size of his records; and with careful consideration of the build version of Perl he is using, a more accurate estimate of the total memory requirement can be calculated.

        But the point is simply that even using rough estimates, it is easy to see that unless the OP (or you) have access to some particularly high end hardware; regardless of whether the memory overuse is 5x or 50x more than the typical 4GB to (say) 24GB available on most machines; it ain't gonna fly.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re^2: sorting type question- space problems
by baxy77bax (Deacon) on Sep 14, 2013 at 13:50 UTC
    Nice solution , thnx!!

    However, the records are not <= 100 bytes they are much larger. there are abbout 5 mil records (lines) which sum up to 20GB in total. Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :(

      First, RickardK's solution of using sort(1) makes sense. If it's there, use it, as it's written in C, highly optimized and well tested.

      That said, if your keys tend to be small compared to the rest of the records, in-memory sort may be feasible by recording only the keys and the file offsets of their respective lines:

      my @a; open my $fh, "<","input" or die $!; while(1) { last if eof($fh); my $pos = tell($fh); my ($k1,$k2) = split /\s+/, <$fh>; push @a, [$k1, $k2, $pos]; } foreach(sort { $a->[0] cmp $b->[0] or $a->[1] cmp $b->[1]} @a) { seek($fh, $_->[2], 0); print scalar <$fh>; }

      Probably not the fastest, but if you want to avoid external sorts both in the sense of shelling out and tempfiles, it may be worth a try. At 5M lines it will likely break the 100 MB limit but without tricks like assuming something about characters that don't occur in the keys and thus encoding all the keys and offsets in one string it's unlikely to get much smaller.

      Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :(

      Actually, it is. Or at least, it may be so ... if the lengths of the keys in your OP is somewhat representative of your real data; and you have a couple of GB of ram available.

      The lengths of the records is immaterial; the 5 million records can be represented in memory by the two keys + an (64-bit) integer offset of the start of each record's position within the file. If the two keys are less than ~8 bytes each, then the total memory requirement for 5 million anonymous subs each containing 2 keys + file offet is ~ 1.8GB.

      This code builds an index of the two keys + file offset in an AoA; sorts that AoA in memory and then writes the output file by seeking the input file, reading the appropriate record and writing it to the output file. For a 5 million record file it takes a little over 2 minutes on my machine:

      #! perl -sw use strict; open IN, '<', $ARGV[0] or die $!; my @index; my $pos = 0; while( <IN> ) { my( $k1, $k2 ) = m[(^\S+)\s+(\S+)]; push @index, [$k1, $k2, $pos]; $pos = tell IN; } @index = sort{ $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } @index; open OUT, '>', 'sorted'; for my $r ( @index ) { seek IN, $r->[2], 0; print OUT scalar <IN>; } close OUT; close IN; __END__ C:\test>dir unsorted 15/09/2013 14:40 117,501,066 unsorted C:\test>wc -l unsorted 5000000 unsorted C:\test>head unsorted key25 key15 xxxxxxxxxxx key28 key05 xxxxx key30 key18 xxxxxxxxxxxxxx key24 key03 xxxxxxxxx key41 key01 xxxxxxxxxxxxxx key12 key16 xxxxxxxxxxxx key38 key20 xx key19 key19 xxxxxxxxxxxxxxxxxx key30 key13 xxxxxxxx key16 key19 xxxxxxxxxxxxx [14:41:03.25] C:\test>1054101 unsorted [14:43:19.59] C:\test> [14:44:38.83] C:\test>head sorted key01 key01 xxxxxxxxxxxxx key01 key01 xxxxxxx key01 key01 x key01 key01 xxxxxxxx key01 key01 xxxxx key01 key01 xxxxxx key01 key01 xxxxxxxxxxxxxxxxx key01 key01 xxxxxx key01 key01 xx key01 key01 xxxxxxxxxx

      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      Science is about questioning the status quo. Questioning authority