Beefy Boxes and Bandwidth Generously Provided by pair Networks
Don't ask to ask, just ask
 
PerlMonks  

Re: Complex file manipulation challenge

by Marshall (Canon)
on Aug 13, 2019 at 20:10 UTC ( [id://11104416]=note: print w/replies, xml ) Need Help??


in reply to Complex file manipulation challenge

Update: I got a negative vote for questioning the purpose of the OP's question. My bad. However, if the question is modified to reflect a more realistic scenario, there would be some interesting answers applicable in the real world.

I guess this is a Golf question?
I was working on a solution until I got to step 3 and realized that this requirement makes no sense.
It is so silly that I can't image a real world use for it!

A more real world thing might be having to write a humongous amount of data to a multiple CD data set where no single file can span a CD boundary. I saw this sort of requirement in the olden floppy disk days. When loading say 20 diskettes in a data set, you want to keep going even if diskette #5 has a fatal error. At the end, say 19 diskettes loaded and one didn't. Now we can get one diskette and patch the system with that single diskette in a straightforward way.

I have no idea of a practical use for this requirement.
Here is where I stopped:
BTW, I see no need to parse the .csv file. My gosh, I am unaware of any CSV file that is \n field delimited - what that would mean boggles my mind and would result in some confused display with a text editor.

#!/usr/bin/perl use strict; use warnings; # node: 11104399 use constant DIR => ""; #set these as needed... use constant N_FILES => 3; # step 1: List all .csv files in a directory by increasing order of fi +le size my @files_by_size = sort ($a -s <=> $b -s}<DIR/*.csv>; print join (@files_by_size,"\n"),"\n"; # step 2: Drop the first line of each file and concat the rest into a +single output file open OUT, '>', "BigFile" or die "...blah..$!"; foreach my $infile (@files_by_size) { open my $infile, '<', $infile or die "unable to open $infile $!"; <$infile>; #throw away first line of file print OUT while <$infile>; } close OUT; # $infile already closed... # step 3:Split the above output file into "n" smaller files # without breaking up the lines in the input files # # This is a strange requirement! A VERY strange requirement! # the obvious thing to do is to make n-1 "big" files and throw # what is leftover into the nth file (which will be very small) # # The tricky part here is to make sure that at least one line # winds up in the nth file. Now that I think about it... # # geez if n==3 and outsize = total bytes, # Create file1 and write lines >= total_bytes/2 to it. # write one line to file 2. # write the rest of lines to file 3. my $big_files = $n-1; # stopped at this point because this sounds like a Golf situation # with a very contrived situation and I'm not good at that. #step 4: this is easy

Replies are listed 'Best First'.
Re^2: Complex file manipulation challenge
by swl (Parson) on Aug 13, 2019 at 21:54 UTC

    I am unaware of any CSV file that is \n field delimited

    Neither have I, but I have handled CSV files with embedded newlines in quoted fields. Usually these are exported from a spreadsheet program.

      That is indeed a good point++!

      In Excel, there is some kind of formatting option to wrap a line onto another line depending upon the column width. There may be some kind of option to insert a GUI line break that doesn't appear in the CSV (maybe CTL-Enter)? Not sure that is possible.

      However, you are quite correct in that multiple lines within a column is something to be considered -- think about a single field for an address instead of multiple columns for each line of the address.

      All of the CSV files that I currently work with containing addresses are | delimited, have separate columns for each potential line of the address and disallow the | char within an address. So a bit of tunnel vision on my part! Sorry!

      You are quite correct to point out this possibility.

      BTW: I've seen CSV files with 512 or 1024 fields. These things can have humongous line lengths. Perl is very good at getting me the dozen or so fields that I care about.

        Your vision on CSV is indeed very limited :)

        Consider not only Excel (or other spreadsheet application) exports, but also:

        • Database exports (including images, BLOB's, XML, Unicode, …)
        • Log exports (I know of a situation that has to read 4Tb (tera-byte!) a day
        • CSV exports where not only the data, but also the header-row has embedded newlines in the fields (and comma's)
        • CSV files with mixed encoding (you should know that Oracle supports field-scoped encodings in their most recent versions)
        • Nested CSV: each/any field in the CSV is (correctly or incorrectly quoted) CSV, but the final result is valid CSV
        • I've seen CSV files with more than 65535 columns.

        All of the above should remember you never to use regular expressions or read-by-line algorithms to parse CSV. It looks too easy to be true.

        Now reconsider you last line: a CSV file does not have a humongous line length. It is likely to have a humongous record length. (Think of a database export where a table has stored movies in parts and each record has up to 4 pieces of the movies, so each CSV record can be Gb's. People use databases and CSV for weird things.


        Enjoy, Have FUN! H.Merijn

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11104416]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others lurking in the Monastery: (2)
As of 2024-04-20 15:59 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found