Re: sorting type question- space problems
by kcott (Archbishop) on Sep 14, 2013 at 13:21 UTC
|
G'day baxy77bax,
Without knowing the (average) record size, it's a little hard to estimate whether this solution fits the 100MB memory limit.
If the records are of a similar size to your dummy data, then it probably won't meet that requirement; if they are in the order of hundreds bytes each, then you may be in with a chance.
Anyway, here's the potential solution [Updated - see below]:
$ perl -Mstrict -Mwarnings -Mautodie -le '
use Tie::File;
tie my @input_records, "Tie::File", "./pm_1054101_input.txt";
open my $out_fh, ">", "./pm_1054101_output.txt";
print $out_fh $input_records[$_] for
map { $_->[0] }
sort { $a->[1] <=> $b->[1] || $a->[2] <=> $b->[2] }
map { [ $_, map { substr $_, 3 } (split " ", $input_records[$_
+])[0,1] ] }
0 .. $#input_records;
close $out_fh;
untie @input_records;
'
Input file:
$ cat pm_1054101_input.txt
key1 key2 ndnjfgdsjfjjkjjfjf...
key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd
key43 key21 sdkjfhdghdbgbd
key1 key3 jujdejnsduhffnjj
key2 key2 jhzezhdjjf...
Output file:
$ cat pm_1054101_output.txt
key1 key2 ndnjfgdsjfjjkjjfjf...
key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd
key1 key3 jujdejnsduhffnjj
key2 key2 jhzezhdjjf...
key43 key21 sdkjfhdghdbgbd
File sizes unchanged:
$ ls -l pm_1054101_*
-rw-r--r-- 1 ken staff 159 14 Sep 22:41 pm_1054101_input.txt
-rw-r--r-- 1 ken staff 159 14 Sep 23:04 pm_1054101_output.txt
Update:
I removed an intermediary array which might save some memory.
Here's the original code:
| [reply] [d/l] [select] |
|
|
That will never work. It will run the machine out of memory.
It will attempt to build 4 huge lists:
- a list of integers, one for every record in the file.
Input to the first map.
- a list of anonymous arrays -- each containing two elements -- one for every record in the file.
Output from the map, input to sort.
- Same again, but reordered.
Output from sort, input to second map.
- An ordered list of all the records.
Output from second map, input to for.
If the OPs records average 64 bytes, that would require:
- 4 * 20GB = 80GB
- + 6 * 335544320 * 24 bytes (for the scalars) = 45GB
- + 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB
- + lots of other bits and pieces of overhead within Tie::File and Perl's internals stacks = ??
For a grand total memory requirement of at least 197.5 Gigabytes
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
|
|
Thankyou for this analysis.
I have some points to raise and some questions.
At the outset, I'd like to say that this isn't intended as a nitpicking exercise (although it may look like that in parts); instead, I'm more interested in learning from it.
"It will attempt to build 4 huge lists:"
Possibly wishful thinking on my part, but I had imagined that as each list was used up, its data would no longer persist and the memory it used could be reclaimed.
To clarify, the first map generates a list then passes the elements of that list onto sort which, in turn, generates another list which is passed to the second map, and so on. When, say, the second map is being processed, $_ is an alias into the sort list (so sort's data is still in use); however, at this point, the list generated by the first map is not being used by anything and its memory could, at least in theory, be reused without causing problems.
Anyway, that was my thinking: I haven't seen any documentation indicating whether this is how it works - I could be entirely wrong.
"1. a list of integers, one for every record in the file. Input to the first map."
Quite right. I did think about this when coding it, but clearly misremembered whay I'd previously read.
Going back to "perlop: Range Operators", I see "... no temporary array is created when the range operator is used as the expression in foreach loops, ...": I had thought that was more of a general rule for generating lists with the ".." operator (clearly, it isn't).
So in retrospect, "for (0 .. $#input_records) {...}", instead of "... 0 .. $#input_records;", would have removed this overhead and been a better choice.
"2. & 3. a list of anonymous arrays -- each containing two elements ... (map1 to sort and sort to map2)"
Actually, that's three elements (numbers): integer, string, string.
Looking back at it, I can see that changing "substr $_, 3" to "0 + substr $_, 3" would have produced three integers which would've saved some memory.
"4. An ordered list of all the records. Output from second map, input to for."
That's not a list of the records, it's a list of integers (i.e. (0 .. $#input_records) after being sorted).
Possibly what you meant, but I thought I'd just clear that up in case it wasn't.
"If the OPs records average 64 bytes, that would require:"
The OP has indicated "there are abbout 5 mil records (lines) which sum up to 20GB in total." [sic].
I appreciate that was possibly written while you were composing your reply.
Anyway, that gives an average closer to 4,295 bytes (20 * 1024**3 / 5e6);
and your calculated ~335 million records (instead of the given ~5 million records) will mean the results will be out by a couple of orders of magnitude (where that figure is used).
In the dot points that followed (with the various calculations), I have some questions regarding where some of the numbers came from. Where appropriate, if you could point me to documentation (e.g. some section in perlguts), that would be preferable as I'd be interested in reading about general cases rather than just specifics relating to this question: that might also be a lot less typing for you.
-
"4 * 20GB = 80GB": What does the 4 refer to? (I've assumed 20GB is the file size.)
-
"+ 6 * 335544320 * 24 bytes (for the scalars) = 45GB": I've guessed the 6 is from the earlier numbered points (1+2+2+1); if so, that should be 8 (1+3+3+1). 335,544,320 is the number of records which, as noted above, should be 5,000,000. I thought 24 might be the size of a scalar, but some of those are integers (I thought they would've been 8), so I'm not sure on this (and it would be one I'd be interested in documentation for).
-
"+ 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB": 335544320 (number of records, already mentioned). 232 has me stumped (possibly another doco link here).
-
"+ lots of other bits and pieces ...": Agreed. Unknown.
| [reply] [d/l] [select] |
|
|
|
|
|
|
| [reply] |
|
|
First, RickardK's solution of using sort(1) makes sense. If it's there, use it, as it's written in C, highly optimized and well tested.
That said, if your keys tend to be small compared to the rest of the records, in-memory sort may be feasible by recording only the keys and the file offsets of their respective lines:
my @a;
open my $fh, "<","input" or die $!;
while(1) {
last if eof($fh);
my $pos = tell($fh);
my ($k1,$k2) = split /\s+/, <$fh>;
push @a, [$k1, $k2, $pos];
}
foreach(sort { $a->[0] cmp $b->[0] or $a->[1] cmp $b->[1]} @a) {
seek($fh, $_->[2], 0);
print scalar <$fh>;
}
Probably not the fastest, but if you want to avoid external sorts both in the sense of shelling out and tempfiles, it may be worth a try. At 5M lines it will likely break the 100 MB limit but without tricks like assuming something about characters that don't occur in the keys and thus encoding all the keys and offsets in one string it's unlikely to get much smaller. | [reply] [d/l] |
|
|
Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :(
Actually, it is. Or at least, it may be so ... if the lengths of the keys in your OP is somewhat representative of your real data; and you have a couple of GB of ram available.
The lengths of the records is immaterial; the 5 million records can be represented in memory by the two keys + an (64-bit) integer offset of the start of each record's position within the file. If the two keys are less than ~8 bytes each, then the total memory requirement for 5 million anonymous subs each containing 2 keys + file offet is ~ 1.8GB.
This code builds an index of the two keys + file offset in an AoA; sorts that AoA in memory and then writes the output file by seeking the input file, reading the appropriate record and writing it to the output file. For a 5 million record file it takes a little over 2 minutes on my machine:
#! perl -sw
use strict;
open IN, '<', $ARGV[0] or die $!;
my @index;
my $pos = 0;
while( <IN> ) {
my( $k1, $k2 ) = m[(^\S+)\s+(\S+)];
push @index, [$k1, $k2, $pos];
$pos = tell IN;
}
@index = sort{
$a->[0] cmp $b->[0] || $a->[1] cmp $b->[1]
} @index;
open OUT, '>', 'sorted';
for my $r ( @index ) {
seek IN, $r->[2], 0;
print OUT scalar <IN>;
}
close OUT;
close IN;
__END__
C:\test>dir unsorted
15/09/2013 14:40 117,501,066 unsorted
C:\test>wc -l unsorted
5000000 unsorted
C:\test>head unsorted
key25 key15 xxxxxxxxxxx
key28 key05 xxxxx
key30 key18 xxxxxxxxxxxxxx
key24 key03 xxxxxxxxx
key41 key01 xxxxxxxxxxxxxx
key12 key16 xxxxxxxxxxxx
key38 key20 xx
key19 key19 xxxxxxxxxxxxxxxxxx
key30 key13 xxxxxxxx
key16 key19 xxxxxxxxxxxxx
[14:41:03.25] C:\test>1054101 unsorted
[14:43:19.59] C:\test>
[14:44:38.83] C:\test>head sorted
key01 key01 xxxxxxxxxxxxx
key01 key01 xxxxxxx
key01 key01 x
key01 key01 xxxxxxxx
key01 key01 xxxxx
key01 key01 xxxxxx
key01 key01 xxxxxxxxxxxxxxxxx
key01 key01 xxxxxx
key01 key01 xx
key01 key01 xxxxxxxxxx
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
Science is about questioning the status quo. Questioning authority | [reply] [d/l] |
Re: sorting type question- space problems
by hdb (Monsignor) on Sep 14, 2013 at 14:41 UTC
|
Your "virtual" file could be a directory tree structure. Assuming the keys are valid names for directories, you could do something like this:
use strict;
use warnings;
my $dir = "data.tmp";
mkdir $dir unless -d $dir;
while(<DATA>){
my( $key1, $key2, $data ) = split /\s+/, $_, 3;
mkdir "$dir/$key1" unless -d "$dir/$key1";
mkdir "$dir/$key1/$key2" unless -d "$dir/$key1/$key2";
open my $fh, ">>", "$dir/$key1/$key2/data.txt";
print $fh $data;
close $fh;
}
opendir( my $dh, $dir );
for my $key1 (sort grep { !/^\./ } readdir $dh) {
opendir( my $dh1, "$dir/$key1" );
for my $key2 (sort grep { !/^\./ } readdir $dh1) {
open my $fh, "<", "$dir/$key1/$key2/data.txt";
while( <$fh> ) {
print "$key1\t$key2\t$_";
}
close $fh;
}
}
system( "rm -rf $dir" );
__DATA__
key1 key2 ndnjfgdsjfjjkjjfjf...
key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd
key43 key21 sdkjfhdghdbgbd
key1 key3 jujdejnsduhffnjj
key2 key2 jhzezhdjjf...
If things go wrong, you only have to delete the one directory... WARNING: No error checking in this code, you need to add that yourself!
| [reply] [d/l] |
Re: sorting type question- space problems
by hdb (Monsignor) on Sep 14, 2013 at 12:36 UTC
|
You may want to look at some database, e.g. DBD::CSV.
UPDATE: I was thinking of something like this but I do not have a 20GB test file to see how fast it is.
use strict;
use warnings;
use DBI;
my $dbn = DBI->connect ("dbi:CSV:", "", "", { f_dir => "dbdcsv" } );
$dbn->do ("DROP TABLE data");
$dbn->do ("CREATE TABLE data (key1 CHAR (10), key2 CHAR(10), data CHAR
+(255) )");
my $sth = $dbn->prepare( "INSERT INTO data (key1, key2, data) VALUES (
+?, ?, ?)" );
while(<DATA>){
chomp;
$sth->execute( split /\s+/, $_, 3 );
}
$sth = $dbn->prepare ( "SELECT * FROM data ORDER BY key1, key2" );
$sth->execute( );
$"="\t";
while( my $row = $sth->fetchrow_arrayref() ) {
print "@$row\n";
}
__DATA__
key1 key2 ndnjfgdsjfjjkjjfjf...
key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd
key43 key21 sdkjfhdghdbgbd
key1 key3 jujdejnsduhffnjj
key2 key2 jhzezhdjjf...
UPDATE 2: I have tested this on some random 20GB test file. I cannot comment on performance but it consumes a LOT of memory... | [reply] [d/l] |
Re: sorting type question- space problems
by RichardK (Parson) on Sep 14, 2013 at 13:16 UTC
|
Does your platform have sort available? If so, I'd use that as it will do what you need without having to write any code.
| [reply] |
Re: sorting type question- space problems
by BrowserUk (Patriarch) on Sep 14, 2013 at 13:45 UTC
|
Since you know what keys you've got, you do not need to create 43 bucket files, just one final output file and do 43 passes.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
| [reply] |
Re: sorting type question- space problems
by Laurent_R (Canon) on Sep 14, 2013 at 18:04 UTC
|
First, whichever way you go for sorting 20 GB of data, you will need quite a bit of disk space for intermediate storage. Even if you use a sort utility provided by your OS, it will create a number of temporary files (probably at least several hundred) on your disk. So the first thing your program should do is to check that there is ample disk space wherever the temp files will go (or you need to check manually before launching the sort). These sort utilities usually take care of removing temporary files when they don't need them anymore, but they might not be able to do it if they crash really badly.
Second, what you describe is not really what I would call sorting, but rather dispatching your data into 43 x 21 buckets (assuming from you example that there are 21 secondary keys) and then merging the buckets in the specific order of the keys, and this can be much faster than actual sorting.
I would suggest that you do create 43 x 21 = 903 files on a temporary directory on disk. You then just read your file once and dispatch the records into the proper files. This will require to open 900+ file handlers. Perl can handle without problem 1000 open filehandlers (you'll have to use an array or hash of filehandlers), so it should work; if you hit an operating system limit, then you'll have to go for two passes. Then, it is just a matter of merging back the files in the adequate order, and deleting these files as soon as you no longer need them. If it c rashes, all your files are in the same temp directory, no big deal to get rid of them.
I do not think that any other method can be faster than that.
| [reply] |
Re: sorting type question- space problems
by zork42 (Monk) on Sep 15, 2013 at 02:01 UTC
|
If you choose a solution that needs to tidy up (delete files, etc) after it finishes, or dies, then an END might be useful.
See
BEGIN, UNITCHECK, CHECK, INIT and END
An END code block is executed as late as possible, that is, after perl has finished running the program and just before the interpreter is being exited, even if it is exiting as a result of a die() function. (But not if it's morphing into another program via exec, or being blown out of the water by a signal--you have to trap that yourself (if you can).) You may have multiple END blocks within a file--they will execute in reverse order of definition; that is: last in, first out (LIFO). END blocks are not executed when you run perl with the -c switch, or if compilation fails.
| [reply] [d/l] |
Re: sorting type question- space problems
by TJPride (Pilgrim) on Sep 14, 2013 at 18:48 UTC
|
The best approach is to read records from the file until you hit some size benchmark (whatever fits in memory, including sorting requirements), sort the records, then print them out to a sequentially-numbered file. Once the entire 20 GB are written out this way, probably to like 50-100 files, you then go through the files in pairs and merge using a line-by-line merge routine. Then go through those files in pairs, and so on until you only have one file. If this is something that has to run automatically, just have a cleanup routine run some reasonable amount of time later and wipe the files if there are more than one. | [reply] |
|
|
Your approach is great if a full sort was required. But what I was trying to say in my earlier post above is that the OP does not actually require a full sort, because there are about 900 sort keys for 5 million records. So that simply splitting the data into 900 buckets (files) and then reread the files in the right order is all what is needed. In short, each piece of data is read and written twice and compared only once. This is basically an O(n) algorithm and will be much faster than any general purpose sorting algorithm.
| [reply] |
|
|
True, if there are only 900 keys.
| [reply] |
Re: sorting type question- space problems
by Random_Walk (Prior) on Sep 14, 2013 at 22:18 UTC
|
# open infile;
# open outfile;
for my $k_out (1..43) {
for my k_in (1..43) {
read $line from infile until /^$k_out $k_in/;
write $line to outfile
}
set pointer of infile back to 0;
}
Clunky and slow, but it will not use much mem, or write temp files. A little caching sould help a bunch. Line starting ofsets would be a quick win
Cheers, R.
Pereant, qui ante nos nostra dixerunt!
| [reply] [d/l] |
Re: sorting type question- space problems (semi-OT)
by ww (Archbishop) on Sep 14, 2013 at 14:14 UTC
|
Maybe merely a nitpick, but it sure is nice when the code, data and comments are consistent with themselves (and match the narrative exposition though that's not relevant in this case):
key1...
key1...
key43...
key1...
key2....
- i believe the structure is clear:
- there are two keys
- ....
Maybe you're using zero-based counting?
perl -E "my @keys = qw/1 2 43/; say $_ for( @keys); say \"count of \$#
+keys: $#keys\";"
1
2
43
count of $#keys: 2
Accuracy in your post makes it easier for us to help!
| [reply] [d/l] [select] |
|
|
| [reply] [d/l] |
Re: sorting type question- space problems
by marinersk (Priest) on Sep 16, 2013 at 02:13 UTC
|
Okay, live-ish results:
Summary:
- 5M small records sorted in 1.0 minutes on my system
- RAM usage appeared to peak at about 522 MB.
Details:
- 2.78GB sitting at the command prompt
- 2.78GB running a doNothing.pl script (negligible overhead for Perl)
- gendat.pl ran 2.5 minutes
- Generated 5M lines
- Lengths of lines were randomized from 1-100
- Key pairs were randomly generated as key1-key43 each.
- 2.76GB sitting at the command prompt
- TwoKeySort.pl ran 1.0 minutes
- 3.27GB peak RAM during TwoKeySort.pl
- (0.51GB = 0.51 * 1024 = 522.25 MB.)
- A little larger than anticipated
- Either my guess at a Perl's hash overhead needs to be tripled, or;
- I missed other intermediate storage requirements
- 2.77GB after perl script ended and about 45 seconds for memory to be reclaimed
So it takes a little over 500MB on my machine. This is a 64-bit system so if my version of Perl does 64-bit integers this should be fairly indicative. It might need to double otherwise; leaving you with, potentially, a 1GB commit using this technique.
If 100MB was just a wag, then I would hope you wouldn't balk at using a half GB of RAM to sort a 20GB file using no intermediate disk space.
Alternatively, you could write the key/key/offset/length values to a file which would be closer to 120MB; sort it, then read it in to guide the binary random I/O process. Just replace 522MB of RAM with a 120MB file, and use an external sort routine.
The time estimate is not indicative since I had no 5MB lines to read, and I/O is still probably the slowest function on the system (though I haven't really kept up on the industry tech specs, going on assumption here).
So there you have it. I didn't cache any of the data lines (that was kind of the point of this approach) so this should be a fair representation of space consumed for 5M lines, since the size of the lines don't matter.
If you really have to keep it to 100MB, the 43 x 43 = 1849 passes through the source file might be your best bet. It will be slow, but effective.
One other possibility, if you really want to over-engineer this thing, is you could segment the work. One approach for this has already been suggested, but given the low memory usage for the hash and array data, you could consider writing just the hash out to intermediate files to be gang-sorted using some kind of segmented iterative approach.
That would be fun, but almost certainly not worth the effort.
Have fun!
| [reply] |
Re: sorting type question- space problems
by marinersk (Priest) on Sep 16, 2013 at 03:23 UTC
|
Okay, I converted it to write the keys to a file rather than collect them in a hash.
For simplicity I wrote the keys, offsets and lengths as strings; to conserve space on disk (the sort files were about 91MB each) you could probably use pack to get them into a fixed box.
In any regard, I then split the functions so pre-sort and post-sort activities are in separate scripts. The sort was then run separately as a command line utility.
The results are favorable:
TwoKeySortDisk1.pl (extracts keys/offset/length and writes to file)
Starting RAM: 2.74GB
Start Time: 05:03:13
Peak RAM: 2.78GB
Ending Time: 05:03:32
Run Time: 19 sec
Peak RAM Usage: 0.04GB * 1024 = 40MB
Sort (probably Cygwin, not Windows native)
Starting RAM: 2.75GB
Start Time: 05:04:39
Peak RAM: 3.02GB
Ending Time: 05:04:49
Run Time: 10 sec
Peak RAM Usage: 0.27GB * 1024 = 276MB
TwoKeySortDisk2.pl (reads sorted keys, reads original file in random m
+ode, writes output text file)
Starting RAM: 2.73GB
Start Time: 05:07:56
Peak RAM: 2.82GB
Ending Time: 05:08:45
Run Time: 49 sec
Peak RAM Usage: 0.09GB * 1024 = 92MB
Well, at least the Perl portion remains under 100MB. :-)
Here's the first script, which pulls out the keys/offsets/lengths and writes them for external sorting.
Here's the second script, which reads the sorted keys/offsets/lengths and performs the actual big data sort as before.
| [reply] [d/l] [select] |
Re: sorting type question- space problems
by marinersk (Priest) on Sep 16, 2013 at 00:28 UTC
|
Okay. This won't quite fit in 100MB but it's pretty close (~120MB 160MB +{general Perl overhead}) unless Perl is a lot less efficient on its hash storage than I suspect.
Update: Correction: The sort on the hash keys will add another array of about 40MB, so we're up to ~160MB plus general overhead. Grrr. If you're really constrained to 100MB, you're gonna' have to go with Random_Walk's iterative crawl. The trade-off is space vs. time. Typical issue. You get to make the call.
The technique I employed (I believe I saw others suggest it but not sure) was to scan the file once in order to collect keys, offsets, and lengths of the original lines. Then a second pass reads them in order from the original file in binary mode (we used to call this Random IO), and writes them sequentially to the output file. As a side benefit of the approach, the original sequence of full matching keysets is preserved.
Some potential issues with this approach:
- For 20GB file, requires 64-bit integers. Dunno how to do that, hope you do.
- Commensurate with 32-bit issue, dunno if your Perl can handle files > 2GB
- The keys in your first and second examples are different; first example shows key1-key43, but the second example shows key01-key43. So long as key1 and key01 are considered synonynous, then this deeper optimization will work fully.
- If the keys are not actually keynn, skip the deep optimization, store the full keys, consider making a single key by combining the two keys with something unique like a pipe separator to gain some slight optimization (e.g. my $optimizedKey = "$primaryKey\|$secondaryKey";).
There are other considerations but you sound sharp enough to catch them.
To try to cram this stuff into as small a space on PerlMonks as possible, I left out all the error checking, so bad opens and bad seeks will send you flying into the pasture without a helmet. 'nuff said.
Without further adieu:
Results:
| [reply] [d/l] [select] |