Re: join on 6 huge files
by BrowserUk (Patriarch) on Jun 10, 2004 at 13:06 UTC
|
Where's the problem?
Open all six files, read 1 line from each, output 1 munged line loop and repeat.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] |
Re: join on 6 huge files
by Art_XIV (Hermit) on Jun 10, 2004 at 13:31 UTC
|
If you are certain that the SHA entries are going to be in the same order in all six of the files then reading from each file, one entry at a time, would probably be best.
If the SHA entries in the aren't in the same order, then I would try to cat the files together, sort them with the sort tool, (which can probably sort 56 million lines much faster than perl), and then let a script read the sorted result.
Slurping shouldn't be necessary... just keep an array of hex strings for the current SHA number and dump it to the out file when you hit a new SHA.
This is just a SWAG, though, and I can't guarantee that sort won't get pathological on you with 56 million lines.
Hanlon's Razor - "Never attribute to malice that which can be adequately explained by stupidity"
| [reply] [d/l] [select] |
|
|
You don't even need to resort the files if they are already sorted,
as sort can merge sorted files:
From (textutils.info)sort:
`sort' has three modes of operation: sort (the default), merge, and
check for sortedness. The following options change the operation mode:
- `-m'
-
Merge the given files by sorting them as a group. Each input file
must always be individually sorted. It always works to sort
instead of merge; merging is provided because it is faster, in the
case where it works.
This means you won't have all six files open at once:
you open two for reading, merge them into a temporary file,
merge the third with the result etc.
| [reply] |
|
|
...I can't guarantee that sort won't get pathological on you with 56 million lines.
But it may just squeak by with 54 million. Sorry, couldn't resist.
| [reply] |
Re: join on 6 huge files
by pbeckingham (Parson) on Jun 10, 2004 at 12:57 UTC
|
If the six files are sorted by that first column, then this is just (just!?) a six-way merge, which does not require you to slurp.
Is this a one-off? Or are you going to be running this beast often?
| [reply] |
|
|
The value of "how often is this going to run" cannot be overstated. Along the same lines, how quick does it need to be? If there is a need for speed, I'd start whipping out an RDBMS.
| [reply] |
|
|
The value of "how often is this going to run" cannot be overstated. Along the same lines, how quick does it need to be? If there is a need for speed, I'd start whipping out an RDBMS.
Sorry, but just how just "whipping out an RDBMS" speed up the merging of six flat data files?
Once they are merged, they're merged. There is no point in repeating the process.
If, however, new flatfiles are received on a regular basis and require merging, how would an RDBMS help?
Even if the process(es) producing the flatfiles could be persuades to write the data directly to an RDBMS, just querying the 9,000,000 records from the RDBMS would take longer than merging the flatfiles. Considerably longer.
And that without taking into consideration the time taken to insert the data in the first place. Never mind the cost of amending the (potentially 6) applications to write directly to the RDBMS--if that is even possible.
Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
| [reply] |
|
|
Re: join on 6 huge files
by Fletch (Bishop) on Jun 10, 2004 at 13:43 UTC
|
Man, flashbacks to my CS prof talking about the good ol' days of magtapes (of course I've got a couple of 9-track reels somewhere in a box in the garage so I can't really talk).
At any rate, as has been said there's nothing wrong with having several files open. You're just doing the last phase of a merge sort with six inputs (since you've guaranteed the inputs are already sorted). Just maintain a list of [ $filehandle, $next_token ] pairs and pull off the smallest (and replace it from the handle or remove the pair from the list when the handle hits EOF).
| [reply] [d/l] |
Re: join on 6 huge files
by dba (Monk) on Jun 10, 2004 at 14:06 UTC
|
unix utilities sort and join should do the work for you.
Are the rows in each file unique and sorted? Depending on requirement use sort with -u option. You may want to check out -y option too.
| [reply] [d/l] [select] |
|
|
Absolutely right. Doing this in Perl would be a waste of time if this is a one-off, and if there is no prior or subsequent processing. The only value is in implementing a merge sort, which is good practise I suppose.
For Perl to sort, it first must slurp. The question of whether the input is already sorted therefore matters greatly.
But our OP is absent, and hasn't responded to any questions. So given our lack of information, I agree with dba.
| [reply] |
Re: join on 6 huge files
by dragonchild (Archbishop) on Jun 10, 2004 at 13:59 UTC
|
Have 6 handles open and simulate peeking at the next line for each filehandle. You have to make sure that you don't attempt to read from a handle that's finished, but that's easy. I wrote one of these in less than 4 hours, with testing.
------
We are the carpenters and bricklayers of the Information Age.
Then there are Damian modules.... *sigh* ... that's not about being less-lazy -- that's about being on some really good drugs -- you know, there is no spoon. - flyingmoose
I shouldn't have to say this, but any code, unless otherwise stated, is untested
| [reply] |
Re: join on 6 huge files
by graff (Chancellor) on Jun 10, 2004 at 22:54 UTC
|
I have 6 text files of the format SHA-1<tab>7_hex_chars<newline> and each file has 9,000,000 lines. The 6 files are in the same order, with the SHA-1 being the key column of all the files.
If that's accurate (don't need to sort or worry about unmatched keys), then something like this would be the first thing I would try -- the unix paste and cut commands:
paste file1 file2 file3 file4 file5 file6 | cut -f2,4,6,8,10,12 > hex.
+cols
That should be pretty close to optimal in terms of performance. | [reply] [d/l] |
Re: join on 6 huge files
by SageMusings (Beadle) on Jun 12, 2004 at 03:49 UTC
|
Dwite,
If I understand your problem correctly, and you were a bit hazy, you simply do not want to work with these huge files in memory. Right?
I would open all the files in a stream fashion, much in the spirit of that old Unix standy "sed". Go through each file line-by-line like you are executing a batch process. The output of each "one line" from each input stream is munged "per-line", not all at once. Then take the resulting concatenation and write to the destination file. It's quick, simple, and elegant.
I have written several tools that use a sed tack. Some of my files are as large as 12MB and it never chokes. | [reply] |