Re: split large CSV file >9.1MB into files of equal size that can be opened in excel
by davido (Cardinal) on Sep 28, 2016 at 23:26 UTC
|
You have a choice when splitting a file; split on its physical size, or split on its logical structure. If you wish to treat each half as its own unit, you have to split logically. Splitting blindly on size only works if you intend to reassemble them later.
To split on logical units that closely approximate physical size, read half the physical file size, then continue forward until you hit the next logical break, and split there.
If you choose to read half, then continue forward, you can either continue byte by byte until you find a record separator. That's not the most efficient approach though. It's often more efficient to read in, say, 4kb or 8kb segments, and then split those segments on the first record separator found. Once you perform that split, append the left-hand of the split to the first document, and prepend the right-hand of the split to the second document. Then read the entire remainder into the 2nd document.
| [reply] |
|
|
This would allow chunking my long line scenario more equally. But it's a matter of whether it's tolerable to the OP to have an input record (line) split across multiple files: if it's tolerable, follow ++davido's advice; otherwise, you will be stuck with potential size imbalances in the output.
And if you weren't expecting a size imbalance in the lines, you need to look at whatever's generating all_tags2.csv, and figure out why one line is significantly longer than the others.
| [reply] |
|
|
Correct, I think. If I was unclear on the tradeoffs I did intend to imply that splitting on a delimiter, even if it's a delimiter near the middle of the file, will almost never result in both sides of the split being of identical size. If you're forced to split into identical sizes on a file where the physical middle is not guaranteed to fall on a record boundary, then you must necessarily lose record-oriented semantics, or deal with the record that spans the physical middle of the file being broken.
I suspect that in this individual's case, it's a reasonable tradeoff for each file to be approximately equal in size to the extent that retaining logical record integrity permits. That is, records are more important than exact size.
| [reply] |
Re: split large CSV file >9.1MB into files of equal size that can be opened in excel
by pryrt (Abbot) on Sep 28, 2016 at 23:45 UTC
|
I ran the OP code on a random CSV with 5-80 characters on each line, 1M lines (about 44MiB total): it behaved as the OP described as the desired output: all the files were ~10k each, with the final file being the leftover lines. With reasonable sized lines, I could not replicate the failure where one file (specifically, the second) ended up hugely bigger than the others. I played with changing the chunk sizes, and it always did what I expected.
While typing up that paragraph, I realized: what if one line were hugely longer than all the others? If one line is significantly larger than all the others (my new test data has line#1023 being 1M characters, rather than ≤80 characters), then when the 10KB read finished for the OP's second output file, it then read the remainder of that huge line using the $chunk = <FH>; in the "second if", making the second file huge. By removing the "second if" (presumably the whole if-block, not just the if-statement like I originally interpreted), then, like stevieb said, the complete line was not read, and the files all ended up the same size.
And, as advice to the OP, based on recommendations I've often seen in the Monastery: use the 3-argument form of open, use lexical filehandles (my $fh and my $out instead of FH and OUT; Super Search for "lexical filehandles" to see some explanations as to why), consider autodie instead of manually adding the or die "..." to each (in a larger program, you might forget it on one or more calls to open... but bonus points for actually doing the check in this program.)
| [reply] [d/l] [select] |
Re: split large CSV file >9.1MB into files of equal size that can be opened in excel
by dasgar (Priest) on Sep 29, 2016 at 01:11 UTC
|
Just thought that I'd point out a few other aspects that you might want to consider.
Although it looks like you're wanting to reduce the CSV file by limiting the number of lines in each new CSV file (which will become rows in Excel), I thought that I'd point out that Excel has a max number of both columns and rows that it can handle. The max number of columns and rows supported will vary across versions of Excel. The reason for bring up this point is that if your original CSV file has more columns than what is supported by Excel, all of the split CSV files that you create from the original will have the same issue when trying to open in Excel.
Also, does the first line of the CSV file contain the column headers? If so, you'll want to capture that first line and reproduce it as the first line in all of your new CSV files. Otherwise, only the first new file will have the headers.
| [reply] |
Re: split large CSV file >9.1MB into files of equal size that can be opened in excel
by Tux (Canon) on Sep 29, 2016 at 18:41 UTC
|
If your first read ends up inside a quoted field that contains an embedded newline after the 10000 byte chunk, you will break the CSV validity. Just warning here and playing the devils advocate.
Enjoy, Have FUN! H.Merijn
| [reply] |
Re: split large CSV file >9.1MB into files of equal size that can be opened in excel
by Anonymous Monk on Sep 28, 2016 at 20:40 UTC
|
| [reply] |
|
|
Have you taken a look at the split csv files?
You're splitting in bytes, which means that unless your csv files are *very* reliably structured, you will likely break the file up into chunks that break full lines.
Say you have a file like this:
1,2,3,4,5
and a read statement like this:
read $fh, $chunk, 5;
You'll end up with split file one with:
1,2,3
and the second split file with:
,4,5
Which might not be what you wanted. | [reply] [d/l] [select] |
|
|
I think that's why the OP ran $chunk = <FH>; after the read(FH, $chunk, 10000); block: it reads whatever is left on the current line (or, if the read happened to end at the end of a line, it will read one extra line) and includes that extra bit in the same file. (This of course still assumes that there are no newlines supposedly protected by quotes in the source CSV, but with a name like "all_tags2", I am guessing that there aren't newlines in "tags".)
| [reply] [d/l] [select] |
Re: split large CSV file >9.1MB into files of equal size that can be opened in excel
by Anonymous Monk on Sep 28, 2016 at 20:36 UTC
|
It's the second output file that doesn't get split properly. It's over 1.9MB. Thought I should clarify. Thanks.
| [reply] |