sorting type question- space problems

baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: sorting type question- space problems by kcott (Archbishop) on Sep 14, 2013 at 13:21 UTC
G'day baxy77bax, Without knowing the (average) record size, it's a little hard to estimate whether this solution fits the 100MB memory limit. If the records are of a similar size to your dummy data, then it probably won't meet that requirement; if they are in the order of hundreds bytes each, then you may be in with a chance. Anyway, here's the potential solution [Updated - see below]: `$ perl -Mstrict -Mwarnings -Mautodie -le ' use Tie::File; tie my @input_records, "Tie::File", "./pm_1054101_input.txt"; open my $out_fh, ">", "./pm_1054101_output.txt"; print $out_fh $input_records[$_] for map { $_->[0] } sort { $a->[1] <=> $b->[1] \|\| $a->[2] <=> $b->[2] } map { [ $_, map { substr $_, 3 } (split " ", $input_records[$_ +])[0,1] ] } 0 .. $#input_records; close $out_fh; untie @input_records; '` [download] Input file: `$ cat pm_1054101_input.txt key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf...` [download] Output file: `$ cat pm_1054101_output.txt key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf... key43 key21 sdkjfhdghdbgbd` [download] File sizes unchanged: `$ ls -l pm_1054101_* -rw-r--r-- 1 ken staff 159 14 Sep 22:41 pm_1054101_input.txt -rw-r--r-- 1 ken staff 159 14 Sep 23:04 pm_1054101_output.txt` [download] Update: I removed an intermediary array which might save some memory. Here's the original code: <Reveal this spoiler or all in this thread> -- Ken	[reply] [d/l] [select]
Re^2: sorting type question- space problems by BrowserUk (Patriarch) on Sep 14, 2013 at 14:27 UTC
That will never work. It will run the machine out of memory. It will attempt to build 4 huge lists: a list of integers, one for every record in the file. Input to the first map. a list of anonymous arrays -- each containing two elements -- one for every record in the file. Output from the map, input to sort. Same again, but reordered. Output from sort, input to second map. An ordered list of all the records. Output from second map, input to for. If the OPs records average 64 bytes, that would require: 4 * 20GB = 80GB + 6 * 335544320 * 24 bytes (for the scalars) = 45GB + 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB + lots of other bits and pieces of overhead within Tie::File and Perl's internals stacks = ?? For a grand total memory requirement of at least 197.5 Gigabytes With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re^3: sorting type question- space problems by kcott (Archbishop) on Sep 15, 2013 at 03:36 UTC
Thankyou for this analysis. I have some points to raise and some questions. At the outset, I'd like to say that this isn't intended as a nitpicking exercise (although it may look like that in parts); instead, I'm more interested in learning from it. "It will attempt to build 4 huge lists:" Possibly wishful thinking on my part, but I had imagined that as each list was used up, its data would no longer persist and the memory it used could be reclaimed. To clarify, the first `map` generates a list then passes the elements of that list onto `sort` which, in turn, generates another list which is passed to the second `map`, and so on. When, say, the second `map` is being processed, `$_` is an alias into the `sort` list (so `sort`'s data is still in use); however, at this point, the list generated by the first `map` is not being used by anything and its memory could, at least in theory, be reused without causing problems. Anyway, that was my thinking: I haven't seen any documentation indicating whether this is how it works - I could be entirely wrong. "1. a list of integers, one for every record in the file. Input to the first `map`." Quite right. I did think about this when coding it, but clearly misremembered whay I'd previously read. Going back to "perlop: Range Operators", I see "... no temporary array is created when the range operator is used as the expression in `foreach` loops, ...": I had thought that was more of a general rule for generating lists with the "`..`" operator (clearly, it isn't). So in retrospect, "`for (0 .. $#input_records) {...}`", instead of "`... 0 .. $#input_records;`", would have removed this overhead and been a better choice. "2. & 3. a list of anonymous arrays -- each containing two elements ... (`map`1 to `sort` and `sort` to `map`2)" Actually, that's three elements (numbers): integer, string, string. Looking back at it, I can see that changing "`substr $_, 3`" to "`0 + substr $_, 3`" would have produced three integers which would've saved some memory. "4. An ordered list of all the records. Output from second `map`, input to `for`." That's not a list of the records, it's a list of integers (i.e. `(0 .. $#input_records)` after being sorted). Possibly what you meant, but I thought I'd just clear that up in case it wasn't. "If the OPs records average 64 bytes, that would require:" The OP has indicated "there are abbout 5 mil records (lines) which sum up to 20GB in total." [sic]. I appreciate that was possibly written while you were composing your reply. Anyway, that gives an average closer to 4,295 bytes (`20 * 1024*3 / 5e6`); and your calculated ~335 million records (instead of the given ~5 million records) will mean the results will be out by a couple of orders of magnitude (where that figure is used). In the dot points that followed (with the various calculations), I have some questions regarding where some of the numbers came from. Where appropriate, if you could point me to documentation (e.g. some section in perlguts), that would be preferable as I'd be interested in reading about general cases rather than just specifics relating to this question: that might also be a lot less typing for you. "4 * 20GB = 80GB": What does the 4 refer to? (I've assumed 20GB is the file size.) "+ 6 * 335544320 * 24 bytes (for the scalars) = 45GB": I've guessed the 6 is from the earlier numbered points (1+2+2+1); if so, that should be 8 (1+3+3+1). 335,544,320 is the number of records which, as noted above, should be 5,000,000. I thought 24 might be the size of a scalar, but some of those are integers (I thought they would've been 8), so I'm not sure on this (and it would be one I'd be interested in documentation for). "+ 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB": 335544320 (number of records, already mentioned). 232 has me stumped (possibly another doco link here). "+ lots of other bits and pieces ..."*: Agreed. Unknown. -- Ken	[reply] [d/l] [select]
Re^4: sorting type question- space problems by BrowserUk (Patriarch) on Sep 15, 2013 at 11:48 UTC
Re^5: sorting type question- space problems by kcott (Archbishop) on Sep 16, 2013 at 03:28 UTC
Re^2: sorting type question- space problems by baxy77bax (Deacon) on Sep 14, 2013 at 13:50 UTC
Nice solution , thnx!! However, the records are not <= 100 bytes they are much larger. there are abbout 5 mil records (lines) which sum up to 20GB in total. Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :(	[reply]
Re^3: sorting type question- space problems by mbethke (Hermit) on Sep 14, 2013 at 19:42 UTC
First, RickardK's solution of using sort(1) makes sense. If it's there, use it, as it's written in C, highly optimized and well tested. That said, if your keys tend to be small compared to the rest of the records, in-memory sort may be feasible by recording only the keys and the file offsets of their respective lines: `my @a; open my $fh, "<","input" or die $!; while(1) { last if eof($fh); my $pos = tell($fh); my ($k1,$k2) = split /\s+/, <$fh>; push @a, [$k1, $k2, $pos]; } foreach(sort { $a->[0] cmp $b->[0] or $a->[1] cmp $b->[1]} @a) { seek($fh, $_->[2], 0); print scalar <$fh>; }` [download] Probably not the fastest, but if you want to avoid external sorts both in the sense of shelling out and tempfiles, it may be worth a try. At 5M lines it will likely break the 100 MB limit but without tricks like assuming something about characters that don't occur in the keys and thus encoding all the keys and offsets in one string it's unlikely to get much smaller.	[reply] [d/l]
Re^3: sorting type question- space problems by BrowserUk (Patriarch) on Sep 15, 2013 at 13:55 UTC
Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :( Actually, it is. Or at least, it may be so ... if the lengths of the keys in your OP is somewhat representative of your real data; and you have a couple of GB of ram available. The lengths of the records is immaterial; the 5 million records can be represented in memory by the two keys + an (64-bit) integer offset of the start of each record's position within the file. If the two keys are less than ~8 bytes each, then the total memory requirement for 5 million anonymous subs each containing 2 keys + file offet is ~ 1.8GB. This code builds an index of the two keys + file offset in an AoA; sorts that AoA in memory and then writes the output file by seeking the input file, reading the appropriate record and writing it to the output file. For a 5 million record file it takes a little over 2 minutes on my machine: #! perl -sw use strict; open IN, '<', $ARGV[0] or die $!; my @index; my $pos = 0; while( <IN> ) { my( $k1, $k2 ) = m[(^\S+)\s+(\S+)]; push @index, [$k1, $k2, $pos]; $pos = tell IN; } @index = sort{ $a->[0] cmp $b->[0] \|\| $a->[1] cmp $b->[1] } @index; open OUT, '>', 'sorted'; for my $r ( @index ) { seek IN, $r->[2], 0; print OUT scalar <IN>; } close OUT; close IN; __END__ C:\test>dir unsorted 15/09/2013 14:40 117,501,066 unsorted C:\test>wc -l unsorted 5000000 unsorted C:\test>head unsorted key25 key15 xxxxxxxxxxx key28 key05 xxxxx key30 key18 xxxxxxxxxxxxxx key24 key03 xxxxxxxxx key41 key01 xxxxxxxxxxxxxx key12 key16 xxxxxxxxxxxx key38 key20 xx key19 key19 xxxxxxxxxxxxxxxxxx key30 key13 xxxxxxxx key16 key19 xxxxxxxxxxxxx [14:41:03.25] C:\test>1054101 unsorted [14:43:19.59] C:\test> [14:44:38.83] C:\test>head sorted key01 key01 xxxxxxxxxxxxx key01 key01 xxxxxxx key01 key01 x key01 key01 xxxxxxxx key01 key01 xxxxx key01 key01 xxxxxx key01 key01 xxxxxxxxxxxxxxxxx key01 key01 xxxxxx key01 key01 xx key01 key01 xxxxxxxxxx [download] With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. Science is about questioning the status quo. Questioning authority	[reply] [d/l]
Re: sorting type question- space problems by hdb (Monsignor) on Sep 14, 2013 at 14:41 UTC
Your "virtual" file could be a directory tree structure. Assuming the keys are valid names for directories, you could do something like this: use strict; use warnings; my $dir = "data.tmp"; mkdir $dir unless -d $dir; while(<DATA>){ my( $key1, $key2, $data ) = split /\s+/, $_, 3; mkdir "$dir/$key1" unless -d "$dir/$key1"; mkdir "$dir/$key1/$key2" unless -d "$dir/$key1/$key2"; open my $fh, ">>", "$dir/$key1/$key2/data.txt"; print $fh $data; close $fh; } opendir( my $dh, $dir ); for my $key1 (sort grep { !/^\./ } readdir $dh) { opendir( my $dh1, "$dir/$key1" ); for my $key2 (sort grep { !/^\./ } readdir $dh1) { open my $fh, "<", "$dir/$key1/$key2/data.txt"; while( <$fh> ) { print "$key1\t$key2\t$_"; } close $fh; } } system( "rm -rf $dir" ); __DATA__ key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf... [download] If things go wrong, you only have to delete the one directory... WARNING: No error checking in this code, you need to add that yourself!	[reply] [d/l]
Re: sorting type question- space problems by hdb (Monsignor) on Sep 14, 2013 at 12:36 UTC
You may want to look at some database, e.g. DBD::CSV. UPDATE: I was thinking of something like this but I do not have a 20GB test file to see how fast it is. use strict; use warnings; use DBI; my $dbn = DBI->connect ("dbi:CSV:", "", "", { f_dir => "dbdcsv" } ); $dbn->do ("DROP TABLE data"); $dbn->do ("CREATE TABLE data (key1 CHAR (10), key2 CHAR(10), data CHAR +(255) )"); my $sth = $dbn->prepare( "INSERT INTO data (key1, key2, data) VALUES ( +?, ?, ?)" ); while(<DATA>){ chomp; $sth->execute( split /\s+/, $_, 3 ); } $sth = $dbn->prepare ( "SELECT * FROM data ORDER BY key1, key2" ); $sth->execute( ); $"="\t"; while( my $row = $sth->fetchrow_arrayref() ) { print "@$row\n"; } __DATA__ key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf... [download] UPDATE 2: I have tested this on some random 20GB test file. I cannot comment on performance but it consumes a LOT of memory...	[reply] [d/l]
Re: sorting type question- space problems by RichardK (Parson) on Sep 14, 2013 at 13:16 UTC
Does your platform have sort available? If so, I'd use that as it will do what you need without having to write any code.	[reply]
Re: sorting type question- space problems by BrowserUk (Patriarch) on Sep 14, 2013 at 13:45 UTC
Since you know what keys you've got, you do not need to create 43 bucket files, just one final output file and do 43 passes. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply]
Re: sorting type question- space problems by Laurent_R (Canon) on Sep 14, 2013 at 18:04 UTC
First, whichever way you go for sorting 20 GB of data, you will need quite a bit of disk space for intermediate storage. Even if you use a sort utility provided by your OS, it will create a number of temporary files (probably at least several hundred) on your disk. So the first thing your program should do is to check that there is ample disk space wherever the temp files will go (or you need to check manually before launching the sort). These sort utilities usually take care of removing temporary files when they don't need them anymore, but they might not be able to do it if they crash really badly. Second, what you describe is not really what I would call sorting, but rather dispatching your data into 43 x 21 buckets (assuming from you example that there are 21 secondary keys) and then merging the buckets in the specific order of the keys, and this can be much faster than actual sorting. I would suggest that you do create 43 x 21 = 903 files on a temporary directory on disk. You then just read your file once and dispatch the records into the proper files. This will require to open 900+ file handlers. Perl can handle without problem 1000 open filehandlers (you'll have to use an array or hash of filehandlers), so it should work; if you hit an operating system limit, then you'll have to go for two passes. Then, it is just a matter of merging back the files in the adequate order, and deleting these files as soon as you no longer need them. If it c rashes, all your files are in the same temp directory, no big deal to get rid of them. I do not think that any other method can be faster than that.	[reply]
Re: sorting type question- space problems by TJPride (Pilgrim) on Sep 14, 2013 at 18:48 UTC
The best approach is to read records from the file until you hit some size benchmark (whatever fits in memory, including sorting requirements), sort the records, then print them out to a sequentially-numbered file. Once the entire 20 GB are written out this way, probably to like 50-100 files, you then go through the files in pairs and merge using a line-by-line merge routine. Then go through those files in pairs, and so on until you only have one file. If this is something that has to run automatically, just have a cleanup routine run some reasonable amount of time later and wipe the files if there are more than one.	[reply]
Re^2: sorting type question- space problems by Laurent_R (Canon) on Sep 15, 2013 at 08:44 UTC
Your approach is great if a full sort was required. But what I was trying to say in my earlier post above is that the OP does not actually require a full sort, because there are about 900 sort keys for 5 million records. So that simply splitting the data into 900 buckets (files) and then reread the files in the right order is all what is needed. In short, each piece of data is read and written twice and compared only once. This is basically an O(n) algorithm and will be much faster than any general purpose sorting algorithm.	[reply]
Re^3: sorting type question- space problems by TJPride (Pilgrim) on Sep 16, 2013 at 20:23 UTC
True, if there are only 900 keys.	[reply]
Re: sorting type question- space problems by Random_Walk (Prior) on Sep 14, 2013 at 22:18 UTC
If your keys only go up to 43 how about this psuedo code `# open infile; # open outfile; for my $k_out (1..43) { for my k_in (1..43) { read $line from infile until /^$k_out $k_in/; write $line to outfile } set pointer of infile back to 0; }` [download] Clunky and slow, but it will not use much mem, or write temp files. A little caching sould help a bunch. Line starting ofsets would be a quick win Cheers, R. Pereant, qui ante nos nostra dixerunt!	[reply] [d/l]
Re: sorting type question- space problems by zork42 (Monk) on Sep 15, 2013 at 02:01 UTC
If you choose a solution that needs to tidy up (delete files, etc) after it finishes, or dies, then an `END` might be useful. See BEGIN, UNITCHECK, CHECK, INIT and END An END code block is executed as late as possible, that is, after perl has finished running the program and just before the interpreter is being exited, even if it is exiting as a result of a die() function. (But not if it's morphing into another program via exec, or being blown out of the water by a signal--you have to trap that yourself (if you can).) You may have multiple END blocks within a file--they will execute in reverse order of definition; that is: last in, first out (LIFO). END blocks are not executed when you run perl with the -c switch, or if compilation fails.	[reply] [d/l]
Re: sorting type question- space problems (semi-OT) by ww (Archbishop) on Sep 14, 2013 at 14:14 UTC
Maybe merely a nitpick, but it sure is nice when the code, data and comments are consistent with themselves (and match the narrative exposition though that's not relevant in this case): `key1... key1... key43... key1... key2.... - i believe the structure is clear: - there are two keys - ....` [download] Maybe you're using zero-based counting? `perl -E "my @keys = qw/1 2 43/; say $_ for( @keys); say \"count of \$# +keys: $#keys\";" 1 2 43 count of $#keys: 2` [download] Accuracy in your post makes it easier for us to help!	[reply] [d/l] [select]
Re^2: sorting type question- space problems (semi-OT) by hdb (Monsignor) on Sep 14, 2013 at 14:42 UTC
You are counting possible values of keys...	[reply] [d/l]
Re: sorting type question- space problems by marinersk (Priest) on Sep 16, 2013 at 02:13 UTC
Okay, live-ish results: Summary: `5M` small records sorted in `1.0` minutes on my system RAM usage appeared to peak at about `522 MB`. Details: `2.78GB` sitting at the command prompt `2.78GB` running a `doNothing.pl` script (negligible overhead for Perl) `gendat.pl` ran `2.5` minutes Generated `5M` lines Lengths of lines were randomized from 1-100 Key pairs were randomly generated as `key1`-`key43` each. `2.76GB` sitting at the command prompt `TwoKeySort.pl` ran `1.0` minutes `3.27GB` peak RAM during `TwoKeySort.pl` (`0.51GB` = 0.51 * 1024 = `522.25 MB`.) A little larger than anticipated Either my guess at a Perl's hash overhead needs to be tripled, or; I missed other intermediate storage requirements `2.77GB` after perl script ended and about 45 seconds for memory to be reclaimed So it takes a little over `500MB` on my machine. This is a 64-bit system so if my version of Perl does 64-bit integers this should be fairly indicative. It might need to double otherwise; leaving you with, potentially, a `1GB` commit using this technique. If `100MB` was just a wag, then I would hope you wouldn't balk at using a half `GB` of RAM to sort a `20GB` file using no intermediate disk space. Alternatively, you could write the key/key/offset/length values to a file which would be closer to `120MB`; sort it, then read it in to guide the binary random I/O process. Just replace `522MB` of RAM with a `120MB` file, and use an external sort routine. The time estimate is not indicative since I had no `5MB` lines to read, and I/O is still probably the slowest function on the system (though I haven't really kept up on the industry tech specs, going on assumption here). So there you have it. I didn't cache any of the data lines (that was kind of the point of this approach) so this should be a fair representation of space consumed for `5M` lines, since the size of the lines don't matter. If you really have to keep it to `100MB`, the 43 x 43 = 1849 passes through the source file might be your best bet. It will be slow, but effective. One other possibility, if you really want to over-engineer this thing, is you could segment the work. One approach for this has already been suggested, but given the low memory usage for the hash and array data, you could consider writing just the hash out to intermediate files to be gang-sorted using some kind of segmented iterative approach. That would be fun, but almost certainly not worth the effort. Have fun!	[reply]
Re: sorting type question- space problems by marinersk (Priest) on Sep 16, 2013 at 03:23 UTC
Okay, I converted it to write the keys to a file rather than collect them in a hash. For simplicity I wrote the keys, offsets and lengths as strings; to conserve space on disk (the sort files were about 91MB each) you could probably use `pack` to get them into a fixed box. In any regard, I then split the functions so pre-sort and post-sort activities are in separate scripts. The sort was then run separately as a command line utility. The results are favorable: TwoKeySortDisk1.pl (extracts keys/offset/length and writes to file) Starting RAM: 2.74GB Start Time: 05:03:13 Peak RAM: 2.78GB Ending Time: 05:03:32 Run Time: 19 sec Peak RAM Usage: 0.04GB * 1024 = 40MB Sort (probably Cygwin, not Windows native) Starting RAM: 2.75GB Start Time: 05:04:39 Peak RAM: 3.02GB Ending Time: 05:04:49 Run Time: 10 sec Peak RAM Usage: 0.27GB * 1024 = 276MB TwoKeySortDisk2.pl (reads sorted keys, reads original file in random m +ode, writes output text file) Starting RAM: 2.73GB Start Time: 05:07:56 Peak RAM: 2.82GB Ending Time: 05:08:45 Run Time: 49 sec Peak RAM Usage: 0.09GB * 1024 = 92MB [download] Well, at least the Perl portion remains under 100MB. `:-)` Here's the first script, which pulls out the keys/offsets/lengths and writes them for external sorting. Read more... (3 kB) Here's the second script, which reads the sorted keys/offsets/lengths and performs the actual big data sort as before. Read more... (2 kB)	[reply] [d/l] [select]
Re: sorting type question- space problems by marinersk (Priest) on Sep 16, 2013 at 00:28 UTC
Okay. This won't quite fit in `100MB` but it's pretty close (~`120MB 160MB` +{general Perl overhead}) unless Perl is a lot less efficient on its hash storage than I suspect. Update: Correction: The `sort` on the hash keys will add another array of about `40MB`, so we're up to ~`160MB` plus general overhead. Grrr. If you're really constrained to `100MB`, you're gonna' have to go with Random_Walk's iterative crawl. The trade-off is space vs. time. Typical issue. You get to make the call. The technique I employed (I believe I saw others suggest it but not sure) was to scan the file once in order to collect keys, offsets, and lengths of the original lines. Then a second pass reads them in order from the original file in binary mode (we used to call this Random IO), and writes them sequentially to the output file. As a side benefit of the approach, the original sequence of full matching keysets is preserved. Some potential issues with this approach: For `20GB` file, requires 64-bit integers. Dunno how to do that, hope you do. Commensurate with 32-bit issue, dunno if your Perl can handle files > `2GB` The keys in your first and second examples are different; first example shows `key1`-`key43`, but the second example shows `key01`-`key43`. So long as `key1` and `key01` are considered synonynous, then this deeper optimization will work fully. If the keys are not actually `key`nn, skip the deep optimization, store the full keys, consider making a single key by combining the two keys with something unique like a pipe separator to gain some slight optimization (e.g. `my $optimizedKey = "$primaryKey\\|$secondaryKey";`). There are other considerations but you sound sharp enough to catch them. To try to cram this stuff into as small a space on PerlMonks as possible, I left out all the error checking, so bad `open`s and bad `seek`s will send you flying into the pasture without a helmet. 'nuff said. Without further adieu: Read more... (5 kB) Results: Read more... (870 Bytes)	[reply] [d/l] [select]

Accuracy in your post makes it easier for us to help!