Re: Complex file manipulation challenge

Update: I got a negative vote for questioning the purpose of the OP's question. My bad. However, if the question is modified to reflect a more realistic scenario, there would be some interesting answers applicable in the real world.

I guess this is a Golf question?
I was working on a solution until I got to step 3 and realized that this requirement makes no sense.
It is so silly that I can't image a real world use for it!

A more real world thing might be having to write a humongous amount of data to a multiple CD data set where no single file can span a CD boundary. I saw this sort of requirement in the olden floppy disk days. When loading say 20 diskettes in a data set, you want to keep going even if diskette #5 has a fatal error. At the end, say 19 diskettes loaded and one didn't. Now we can get one diskette and patch the system with that single diskette in a straightforward way.

I have no idea of a practical use for this requirement.
Here is where I stopped:
BTW, I see no need to parse the .csv file. My gosh, I am unaware of any CSV file that is \n field delimited - what that would mean boggles my mind and would result in some confused display with a text editor.

#!/usr/bin/perl
use strict;
use warnings;
# node: 11104399

use constant DIR => "";     #set these as needed...
use constant N_FILES => 3; 


# step 1: List all .csv files in a directory by increasing order of fi
+le size

my @files_by_size = sort ($a -s <=> $b -s}<DIR/*.csv>;
print join (@files_by_size,"\n"),"\n";

# step 2: Drop the first line of each file and concat the rest into a 
+single output file

open OUT, '>', "BigFile" or die "...blah..$!";
foreach my $infile (@files_by_size)
{
    open my $infile, '<', $infile or die "unable to open $infile $!";
    <$infile>;  #throw away first line of file
    print OUT while <$infile>;
}
close OUT; # $infile already closed...

# step 3:Split the above output file into "n" smaller files 
#        without breaking up the lines in the input files
#
# This is a strange requirement! A VERY strange requirement!
# the obvious thing to do is to make n-1 "big" files and throw
# what is leftover into the nth file (which will be very small)
#
# The tricky part here is to make sure that at least one line
# winds up in the nth file. Now that I think about it...
#
# geez if n==3 and outsize = total bytes, 
# Create file1 and write lines >= total_bytes/2 to it.
# write one line to file 2.
# write the rest of lines to file 3.

my $big_files = $n-1;

# stopped at this point because this sounds like a Golf situation
# with a very contrived situation and I'm not good at that.

#step 4: this is easy
[download]

Comment on Re: Complex file manipulation challenge Download Code

Replies are listed 'Best First'.
Re^2: Complex file manipulation challenge by swl (Parson) on Aug 13, 2019 at 21:54 UTC
I am unaware of any CSV file that is \n field delimited Neither have I, but I have handled CSV files with embedded newlines in quoted fields. Usually these are exported from a spreadsheet program.	[reply]
Re^3: Complex file manipulation challenge by Marshall (Canon) on Aug 13, 2019 at 22:16 UTC
That is indeed a good point++! In Excel, there is some kind of formatting option to wrap a line onto another line depending upon the column width. There may be some kind of option to insert a GUI line break that doesn't appear in the CSV (maybe CTL-Enter)? Not sure that is possible. However, you are quite correct in that multiple lines within a column is something to be considered -- think about a single field for an address instead of multiple columns for each line of the address. All of the CSV files that I currently work with containing addresses are \| delimited, have separate columns for each potential line of the address and disallow the \| char within an address. So a bit of tunnel vision on my part! Sorry! You are quite correct to point out this possibility. BTW: I've seen CSV files with 512 or 1024 fields. These things can have humongous line lengths. Perl is very good at getting me the dozen or so fields that I care about.	[reply]
Re^4: Complex file manipulation challenge by Tux (Canon) on Aug 14, 2019 at 06:32 UTC
Your vision on CSV is indeed very limited :) Consider not only Excel (or other spreadsheet application) exports, but also: Database exports (including images, BLOB's, XML, Unicode, …) Log exports (I know of a situation that has to read 4Tb (tera-byte!) a day CSV exports where not only the data, but also the header-row has embedded newlines in the fields (and comma's) CSV files with mixed encoding (you should know that Oracle supports field-scoped encodings in their most recent versions) Nested CSV: each/any field in the CSV is (correctly or incorrectly quoted) CSV, but the final result is valid CSV I've seen CSV files with more than 65535 columns. All of the above should remember you never to use regular expressions or read-by-line algorithms to parse CSV. It looks too easy to be true. Now reconsider you last line: a CSV file does not have a humongous line length. It is likely to have a humongous record length. (Think of a database export where a table has stored movies in parts and each record has up to 4 pieces of the movies, so each CSV record can be Gb's. People use databases and CSV for weird things. Enjoy, Have FUN! H.Merijn	[reply]
Re^5: Complex file manipulation challenge by karlgoethebier (Abbot) on Aug 14, 2019 at 11:28 UTC
Re^6: Complex file manipulation challenge by afoken (Chancellor) on Aug 14, 2019 at 15:21 UTC
Some notes below your chosen depth have not been shown here


Don't ask to ask, just ask
	PerlMonks