baxy77bax has asked for the wisdom of the Perl Monks concerning the following question:

Hi

Well I need some advice and pointers. Let me first explain my situation. I have this huge (ca. 20GB) text file that looks like this:

key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf... i believe the structure is clear: - there are two keys - keys can be repeated
what i need to do is to sort my text file first according to my keys in the first column such that all records having key1 come first followed by key2 up to key43. Next each record inside each bucked needs to be sorted again according to the second key column. (There are only two columns, that is two keys). Now the fastest way I imagine is to create 43 bucket files and then just iterate through main file and print records accordingly. Once done, repeat the process in each bucket. Afterworlds join files and delete unnecessary buckets(files).

The downside is if a sorting is interrupted then my temp.bucket files remain on disc and have to be removed by hand. Alternatively i could intercept the sigint and delete buckets before program terminates.

What I came here to ask is, does someone have a better solution(faster, does not consume a lot of memory (100MB top)) and does not create this file mess on my disc.

any comment is more then welcomed

Thank you

baxy

UPDATE:

to keet the "file explostion" under controle, would it be possible to create "virtual file" - one file but divided into sections and then print into different sections (something like using fseek in c)- would that be advisable ???

Replies are listed 'Best First'.
Re: sorting type question- space problems
by kcott (Archbishop) on Sep 14, 2013 at 13:21 UTC

    G'day baxy77bax,

    Without knowing the (average) record size, it's a little hard to estimate whether this solution fits the 100MB memory limit. If the records are of a similar size to your dummy data, then it probably won't meet that requirement; if they are in the order of hundreds bytes each, then you may be in with a chance.

    Anyway, here's the potential solution [Updated - see below]:

    $ perl -Mstrict -Mwarnings -Mautodie -le ' use Tie::File; tie my @input_records, "Tie::File", "./pm_1054101_input.txt"; open my $out_fh, ">", "./pm_1054101_output.txt"; print $out_fh $input_records[$_] for map { $_->[0] } sort { $a->[1] <=> $b->[1] || $a->[2] <=> $b->[2] } map { [ $_, map { substr $_, 3 } (split " ", $input_records[$_ +])[0,1] ] } 0 .. $#input_records; close $out_fh; untie @input_records; '

    Input file:

    $ cat pm_1054101_input.txt key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf...

    Output file:

    $ cat pm_1054101_output.txt key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf... key43 key21 sdkjfhdghdbgbd

    File sizes unchanged:

    $ ls -l pm_1054101_* -rw-r--r-- 1 ken staff 159 14 Sep 22:41 pm_1054101_input.txt -rw-r--r-- 1 ken staff 159 14 Sep 23:04 pm_1054101_output.txt

    Update: I removed an intermediary array which might save some memory. Here's the original code:

    -- Ken

      That will never work. It will run the machine out of memory.

      It will attempt to build 4 huge lists:

      1. a list of integers, one for every record in the file.

        Input to the first map.

      2. a list of anonymous arrays -- each containing two elements -- one for every record in the file.

        Output from the map, input to sort.

      3. Same again, but reordered.

        Output from sort, input to second map.

      4. An ordered list of all the records.

        Output from second map, input to for.

      If the OPs records average 64 bytes, that would require:

      • 4 * 20GB = 80GB
      • + 6 * 335544320 * 24 bytes (for the scalars) = 45GB
      • + 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB
      • + lots of other bits and pieces of overhead within Tie::File and Perl's internals stacks = ??

      For a grand total memory requirement of at least 197.5 Gigabytes


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Thankyou for this analysis. I have some points to raise and some questions. At the outset, I'd like to say that this isn't intended as a nitpicking exercise (although it may look like that in parts); instead, I'm more interested in learning from it.

        "It will attempt to build 4 huge lists:"

        Possibly wishful thinking on my part, but I had imagined that as each list was used up, its data would no longer persist and the memory it used could be reclaimed.

        To clarify, the first map generates a list then passes the elements of that list onto sort which, in turn, generates another list which is passed to the second map, and so on. When, say, the second map is being processed, $_ is an alias into the sort list (so sort's data is still in use); however, at this point, the list generated by the first map is not being used by anything and its memory could, at least in theory, be reused without causing problems. Anyway, that was my thinking: I haven't seen any documentation indicating whether this is how it works - I could be entirely wrong.

        "1. a list of integers, one for every record in the file. Input to the first map."

        Quite right. I did think about this when coding it, but clearly misremembered whay I'd previously read. Going back to "perlop: Range Operators", I see "... no temporary array is created when the range operator is used as the expression in foreach loops, ...": I had thought that was more of a general rule for generating lists with the ".." operator (clearly, it isn't). So in retrospect, "for (0 .. $#input_records) {...}", instead of "... 0 .. $#input_records;", would have removed this overhead and been a better choice.

        "2. & 3. a list of anonymous arrays -- each containing two elements ... (map1 to sort and sort to map2)"

        Actually, that's three elements (numbers): integer, string, string. Looking back at it, I can see that changing "substr $_, 3" to "0 + substr $_, 3" would have produced three integers which would've saved some memory.

        "4. An ordered list of all the records. Output from second map, input to for."

        That's not a list of the records, it's a list of integers (i.e. (0 .. $#input_records) after being sorted). Possibly what you meant, but I thought I'd just clear that up in case it wasn't.

        "If the OPs records average 64 bytes, that would require:"

        The OP has indicated "there are abbout 5 mil records (lines) which sum up to 20GB in total." [sic]. I appreciate that was possibly written while you were composing your reply. Anyway, that gives an average closer to 4,295 bytes (20 * 1024**3 / 5e6); and your calculated ~335 million records (instead of the given ~5 million records) will mean the results will be out by a couple of orders of magnitude (where that figure is used).

        In the dot points that followed (with the various calculations), I have some questions regarding where some of the numbers came from. Where appropriate, if you could point me to documentation (e.g. some section in perlguts), that would be preferable as I'd be interested in reading about general cases rather than just specifics relating to this question: that might also be a lot less typing for you.

        • "4 * 20GB = 80GB": What does the 4 refer to? (I've assumed 20GB is the file size.)
        • "+ 6 * 335544320 * 24 bytes (for the scalars) = 45GB": I've guessed the 6 is from the earlier numbered points (1+2+2+1); if so, that should be 8 (1+3+3+1). 335,544,320 is the number of records which, as noted above, should be 5,000,000. I thought 24 might be the size of a scalar, but some of those are integers (I thought they would've been 8), so I'm not sure on this (and it would be one I'd be interested in documentation for).
        • "+ 335544320 * 232 bytes (for the anonymous arrays ) = 72.5GB": 335544320 (number of records, already mentioned). 232 has me stumped (possibly another doco link here).
        • "+ lots of other bits and pieces ...": Agreed. Unknown.

        -- Ken

      Nice solution , thnx!!

      However, the records are not <= 100 bytes they are much larger. there are abbout 5 mil records (lines) which sum up to 20GB in total. Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :(

        First, RickardK's solution of using sort(1) makes sense. If it's there, use it, as it's written in C, highly optimized and well tested.

        That said, if your keys tend to be small compared to the rest of the records, in-memory sort may be feasible by recording only the keys and the file offsets of their respective lines:

        my @a; open my $fh, "<","input" or die $!; while(1) { last if eof($fh); my $pos = tell($fh); my ($k1,$k2) = split /\s+/, <$fh>; push @a, [$k1, $k2, $pos]; } foreach(sort { $a->[0] cmp $b->[0] or $a->[1] cmp $b->[1]} @a) { seek($fh, $_->[2], 0); print scalar <$fh>; }

        Probably not the fastest, but if you want to avoid external sorts both in the sense of shelling out and tempfiles, it may be worth a try. At 5M lines it will likely break the 100 MB limit but without tricks like assuming something about characters that don't occur in the keys and thus encoding all the keys and offsets in one string it's unlikely to get much smaller.

        Some records are big > 5MB and some are small (5 bytes) so in-memory sort is not an option :(

        Actually, it is. Or at least, it may be so ... if the lengths of the keys in your OP is somewhat representative of your real data; and you have a couple of GB of ram available.

        The lengths of the records is immaterial; the 5 million records can be represented in memory by the two keys + an (64-bit) integer offset of the start of each record's position within the file. If the two keys are less than ~8 bytes each, then the total memory requirement for 5 million anonymous subs each containing 2 keys + file offet is ~ 1.8GB.

        This code builds an index of the two keys + file offset in an AoA; sorts that AoA in memory and then writes the output file by seeking the input file, reading the appropriate record and writing it to the output file. For a 5 million record file it takes a little over 2 minutes on my machine:

        #! perl -sw use strict; open IN, '<', $ARGV[0] or die $!; my @index; my $pos = 0; while( <IN> ) { my( $k1, $k2 ) = m[(^\S+)\s+(\S+)]; push @index, [$k1, $k2, $pos]; $pos = tell IN; } @index = sort{ $a->[0] cmp $b->[0] || $a->[1] cmp $b->[1] } @index; open OUT, '>', 'sorted'; for my $r ( @index ) { seek IN, $r->[2], 0; print OUT scalar <IN>; } close OUT; close IN; __END__ C:\test>dir unsorted 15/09/2013 14:40 117,501,066 unsorted C:\test>wc -l unsorted 5000000 unsorted C:\test>head unsorted key25 key15 xxxxxxxxxxx key28 key05 xxxxx key30 key18 xxxxxxxxxxxxxx key24 key03 xxxxxxxxx key41 key01 xxxxxxxxxxxxxx key12 key16 xxxxxxxxxxxx key38 key20 xx key19 key19 xxxxxxxxxxxxxxxxxx key30 key13 xxxxxxxx key16 key19 xxxxxxxxxxxxx [14:41:03.25] C:\test>1054101 unsorted [14:43:19.59] C:\test> [14:44:38.83] C:\test>head sorted key01 key01 xxxxxxxxxxxxx key01 key01 xxxxxxx key01 key01 x key01 key01 xxxxxxxx key01 key01 xxxxx key01 key01 xxxxxx key01 key01 xxxxxxxxxxxxxxxxx key01 key01 xxxxxx key01 key01 xx key01 key01 xxxxxxxxxx

        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        Science is about questioning the status quo. Questioning authority
Re: sorting type question- space problems
by hdb (Monsignor) on Sep 14, 2013 at 14:41 UTC

    Your "virtual" file could be a directory tree structure. Assuming the keys are valid names for directories, you could do something like this:

    use strict; use warnings; my $dir = "data.tmp"; mkdir $dir unless -d $dir; while(<DATA>){ my( $key1, $key2, $data ) = split /\s+/, $_, 3; mkdir "$dir/$key1" unless -d "$dir/$key1"; mkdir "$dir/$key1/$key2" unless -d "$dir/$key1/$key2"; open my $fh, ">>", "$dir/$key1/$key2/data.txt"; print $fh $data; close $fh; } opendir( my $dh, $dir ); for my $key1 (sort grep { !/^\./ } readdir $dh) { opendir( my $dh1, "$dir/$key1" ); for my $key2 (sort grep { !/^\./ } readdir $dh1) { open my $fh, "<", "$dir/$key1/$key2/data.txt"; while( <$fh> ) { print "$key1\t$key2\t$_"; } close $fh; } } system( "rm -rf $dir" ); __DATA__ key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf...

    If things go wrong, you only have to delete the one directory... WARNING: No error checking in this code, you need to add that yourself!

Re: sorting type question- space problems
by hdb (Monsignor) on Sep 14, 2013 at 12:36 UTC

    You may want to look at some database, e.g. DBD::CSV.

    UPDATE: I was thinking of something like this but I do not have a 20GB test file to see how fast it is.

    use strict; use warnings; use DBI; my $dbn = DBI->connect ("dbi:CSV:", "", "", { f_dir => "dbdcsv" } ); $dbn->do ("DROP TABLE data"); $dbn->do ("CREATE TABLE data (key1 CHAR (10), key2 CHAR(10), data CHAR +(255) )"); my $sth = $dbn->prepare( "INSERT INTO data (key1, key2, data) VALUES ( +?, ?, ?)" ); while(<DATA>){ chomp; $sth->execute( split /\s+/, $_, 3 ); } $sth = $dbn->prepare ( "SELECT * FROM data ORDER BY key1, key2" ); $sth->execute( ); $"="\t"; while( my $row = $sth->fetchrow_arrayref() ) { print "@$row\n"; } __DATA__ key1 key2 ndnjfgdsjfjjkjjfjf... key1 key2 kdfkjdfgdfugbjndkfgkjgndkjfjkd key43 key21 sdkjfhdghdbgbd key1 key3 jujdejnsduhffnjj key2 key2 jhzezhdjjf...

    UPDATE 2: I have tested this on some random 20GB test file. I cannot comment on performance but it consumes a LOT of memory...

Re: sorting type question- space problems
by RichardK (Parson) on Sep 14, 2013 at 13:16 UTC

    Does your platform have sort available? If so, I'd use that as it will do what you need without having to write any code.

Re: sorting type question- space problems
by BrowserUk (Patriarch) on Sep 14, 2013 at 13:45 UTC

    Since you know what keys you've got, you do not need to create 43 bucket files, just one final output file and do 43 passes.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: sorting type question- space problems
by Laurent_R (Canon) on Sep 14, 2013 at 18:04 UTC

    First, whichever way you go for sorting 20 GB of data, you will need quite a bit of disk space for intermediate storage. Even if you use a sort utility provided by your OS, it will create a number of temporary files (probably at least several hundred) on your disk. So the first thing your program should do is to check that there is ample disk space wherever the temp files will go (or you need to check manually before launching the sort). These sort utilities usually take care of removing temporary files when they don't need them anymore, but they might not be able to do it if they crash really badly.

    Second, what you describe is not really what I would call sorting, but rather dispatching your data into 43 x 21 buckets (assuming from you example that there are 21 secondary keys) and then merging the buckets in the specific order of the keys, and this can be much faster than actual sorting.

    I would suggest that you do create 43 x 21 = 903 files on a temporary directory on disk. You then just read your file once and dispatch the records into the proper files. This will require to open 900+ file handlers. Perl can handle without problem 1000 open filehandlers (you'll have to use an array or hash of filehandlers), so it should work; if you hit an operating system limit, then you'll have to go for two passes. Then, it is just a matter of merging back the files in the adequate order, and deleting these files as soon as you no longer need them. If it c rashes, all your files are in the same temp directory, no big deal to get rid of them.

    I do not think that any other method can be faster than that.

Re: sorting type question- space problems
by zork42 (Monk) on Sep 15, 2013 at 02:01 UTC
    If you choose a solution that needs to tidy up (delete files, etc) after it finishes, or dies, then an END might be useful.

    See BEGIN, UNITCHECK, CHECK, INIT and END
    An END code block is executed as late as possible, that is, after perl has finished running the program and just before the interpreter is being exited, even if it is exiting as a result of a die() function. (But not if it's morphing into another program via exec, or being blown out of the water by a signal--you have to trap that yourself (if you can).) You may have multiple END blocks within a file--they will execute in reverse order of definition; that is: last in, first out (LIFO). END blocks are not executed when you run perl with the -c switch, or if compilation fails.
Re: sorting type question- space problems
by TJPride (Pilgrim) on Sep 14, 2013 at 18:48 UTC
    The best approach is to read records from the file until you hit some size benchmark (whatever fits in memory, including sorting requirements), sort the records, then print them out to a sequentially-numbered file. Once the entire 20 GB are written out this way, probably to like 50-100 files, you then go through the files in pairs and merge using a line-by-line merge routine. Then go through those files in pairs, and so on until you only have one file. If this is something that has to run automatically, just have a cleanup routine run some reasonable amount of time later and wipe the files if there are more than one.

      Your approach is great if a full sort was required. But what I was trying to say in my earlier post above is that the OP does not actually require a full sort, because there are about 900 sort keys for 5 million records. So that simply splitting the data into 900 buckets (files) and then reread the files in the right order is all what is needed. In short, each piece of data is read and written twice and compared only once. This is basically an O(n) algorithm and will be much faster than any general purpose sorting algorithm.

        True, if there are only 900 keys.
Re: sorting type question- space problems
by Random_Walk (Prior) on Sep 14, 2013 at 22:18 UTC

    If your keys only go up to 43 how about this psuedo code

    # open infile; # open outfile; for my $k_out (1..43) { for my k_in (1..43) { read $line from infile until /^$k_out $k_in/; write $line to outfile } set pointer of infile back to 0; }
    Clunky and slow, but it will not use much mem, or write temp files. A little caching sould help a bunch. Line starting ofsets would be a quick win

    Cheers,
    R.

    Pereant, qui ante nos nostra dixerunt!
Re: sorting type question- space problems (semi-OT)
by ww (Archbishop) on Sep 14, 2013 at 14:14 UTC

    Maybe merely a nitpick, but it sure is nice when the code, data and comments are consistent with themselves (and match the narrative exposition though that's not relevant in this case):

    key1... key1... key43... key1... key2.... - i believe the structure is clear: - there are two keys - ....

    Maybe you're using zero-based counting?

    perl -E "my @keys = qw/1 2 43/; say $_ for( @keys); say \"count of \$# +keys: $#keys\";" 1 2 43 count of $#keys: 2

    Accuracy in your post makes it easier for us to help!

      You are counting possible values of keys...

Re: sorting type question- space problems
by marinersk (Priest) on Sep 16, 2013 at 02:13 UTC

    Okay, live-ish results:

    Summary:

    • 5M small records sorted in 1.0 minutes on my system
    • RAM usage appeared to peak at about 522 MB.

    Details:

    • 2.78GB sitting at the command prompt
    • 2.78GB running a doNothing.pl script (negligible overhead for Perl)
    • gendat.pl ran 2.5 minutes
      • Generated 5M lines
      • Lengths of lines were randomized from 1-100
      • Key pairs were randomly generated as key1-key43 each.
    • 2.76GB sitting at the command prompt
    • TwoKeySort.pl ran 1.0 minutes
    • 3.27GB peak RAM during TwoKeySort.pl
      • (0.51GB = 0.51 * 1024 = 522.25 MB.)
      • A little larger than anticipated
      • Either my guess at a Perl's hash overhead needs to be tripled, or;
      • I missed other intermediate storage requirements
    • 2.77GB after perl script ended and about 45 seconds for memory to be reclaimed

      So it takes a little over 500MB on my machine. This is a 64-bit system so if my version of Perl does 64-bit integers this should be fairly indicative. It might need to double otherwise; leaving you with, potentially, a 1GB commit using this technique.

      If 100MB was just a wag, then I would hope you wouldn't balk at using a half GB of RAM to sort a 20GB file using no intermediate disk space.

      Alternatively, you could write the key/key/offset/length values to a file which would be closer to 120MB; sort it, then read it in to guide the binary random I/O process. Just replace 522MB of RAM with a 120MB file, and use an external sort routine.

      The time estimate is not indicative since I had no 5MB lines to read, and I/O is still probably the slowest function on the system (though I haven't really kept up on the industry tech specs, going on assumption here).

      So there you have it. I didn't cache any of the data lines (that was kind of the point of this approach) so this should be a fair representation of space consumed for 5M lines, since the size of the lines don't matter.

      If you really have to keep it to 100MB, the 43 x 43 = 1849 passes through the source file might be your best bet. It will be slow, but effective.

      One other possibility, if you really want to over-engineer this thing, is you could segment the work. One approach for this has already been suggested, but given the low memory usage for the hash and array data, you could consider writing just the hash out to intermediate files to be gang-sorted using some kind of segmented iterative approach.

      That would be fun, but almost certainly not worth the effort.

      Have fun!

Re: sorting type question- space problems
by marinersk (Priest) on Sep 16, 2013 at 03:23 UTC

    Okay, I converted it to write the keys to a file rather than collect them in a hash.

    For simplicity I wrote the keys, offsets and lengths as strings; to conserve space on disk (the sort files were about 91MB each) you could probably use pack to get them into a fixed box.

    In any regard, I then split the functions so pre-sort and post-sort activities are in separate scripts. The sort was then run separately as a command line utility.

    The results are favorable:

    TwoKeySortDisk1.pl (extracts keys/offset/length and writes to file) Starting RAM: 2.74GB Start Time: 05:03:13 Peak RAM: 2.78GB Ending Time: 05:03:32 Run Time: 19 sec Peak RAM Usage: 0.04GB * 1024 = 40MB Sort (probably Cygwin, not Windows native) Starting RAM: 2.75GB Start Time: 05:04:39 Peak RAM: 3.02GB Ending Time: 05:04:49 Run Time: 10 sec Peak RAM Usage: 0.27GB * 1024 = 276MB TwoKeySortDisk2.pl (reads sorted keys, reads original file in random m +ode, writes output text file) Starting RAM: 2.73GB Start Time: 05:07:56 Peak RAM: 2.82GB Ending Time: 05:08:45 Run Time: 49 sec Peak RAM Usage: 0.09GB * 1024 = 92MB

    Well, at least the Perl portion remains under 100MB. :-)

    Here's the first script, which pulls out the keys/offsets/lengths and writes them for external sorting.

    Here's the second script, which reads the sorted keys/offsets/lengths and performs the actual big data sort as before.

Re: sorting type question- space problems
by marinersk (Priest) on Sep 16, 2013 at 00:28 UTC

    Okay. This won't quite fit in 100MB but it's pretty close (~120MB 160MB +{general Perl overhead}) unless Perl is a lot less efficient on its hash storage than I suspect.

    Update: Correction: The sort on the hash keys will add another array of about 40MB, so we're up to ~160MB plus general overhead. Grrr. If you're really constrained to 100MB, you're gonna' have to go with Random_Walk's iterative crawl. The trade-off is space vs. time. Typical issue. You get to make the call.

    The technique I employed (I believe I saw others suggest it but not sure) was to scan the file once in order to collect keys, offsets, and lengths of the original lines. Then a second pass reads them in order from the original file in binary mode (we used to call this Random IO), and writes them sequentially to the output file. As a side benefit of the approach, the original sequence of full matching keysets is preserved.

    Some potential issues with this approach:

    1. For 20GB file, requires 64-bit integers. Dunno how to do that, hope you do.
    2. Commensurate with 32-bit issue, dunno if your Perl can handle files > 2GB
    3. The keys in your first and second examples are different; first example shows key1-key43, but the second example shows key01-key43. So long as key1 and key01 are considered synonynous, then this deeper optimization will work fully.
    4. If the keys are not actually keynn, skip the deep optimization, store the full keys, consider making a single key by combining the two keys with something unique like a pipe separator to gain some slight optimization (e.g. my $optimizedKey = "$primaryKey\|$secondaryKey";).

    There are other considerations but you sound sharp enough to catch them.

    To try to cram this stuff into as small a space on PerlMonks as possible, I left out all the error checking, so bad opens and bad seeks will send you flying into the pasture without a helmet. 'nuff said.

    Without further adieu:

    Results: