Beefy Boxes and Bandwidth Generously Provided by pair Networks
Clear questions and runnable code
get the best and fastest answer
 
PerlMonks  

Re^2: Complex file manipulation challenge

by swl (Parson)
on Aug 13, 2019 at 21:54 UTC ( [id://11104418]=note: print w/replies, xml ) Need Help??


in reply to Re: Complex file manipulation challenge
in thread Complex file manipulation challenge

I am unaware of any CSV file that is \n field delimited

Neither have I, but I have handled CSV files with embedded newlines in quoted fields. Usually these are exported from a spreadsheet program.

  • Comment on Re^2: Complex file manipulation challenge

Replies are listed 'Best First'.
Re^3: Complex file manipulation challenge
by Marshall (Canon) on Aug 13, 2019 at 22:16 UTC
    That is indeed a good point++!

    In Excel, there is some kind of formatting option to wrap a line onto another line depending upon the column width. There may be some kind of option to insert a GUI line break that doesn't appear in the CSV (maybe CTL-Enter)? Not sure that is possible.

    However, you are quite correct in that multiple lines within a column is something to be considered -- think about a single field for an address instead of multiple columns for each line of the address.

    All of the CSV files that I currently work with containing addresses are | delimited, have separate columns for each potential line of the address and disallow the | char within an address. So a bit of tunnel vision on my part! Sorry!

    You are quite correct to point out this possibility.

    BTW: I've seen CSV files with 512 or 1024 fields. These things can have humongous line lengths. Perl is very good at getting me the dozen or so fields that I care about.

      Your vision on CSV is indeed very limited :)

      Consider not only Excel (or other spreadsheet application) exports, but also:

      • Database exports (including images, BLOB's, XML, Unicode, …)
      • Log exports (I know of a situation that has to read 4Tb (tera-byte!) a day
      • CSV exports where not only the data, but also the header-row has embedded newlines in the fields (and comma's)
      • CSV files with mixed encoding (you should know that Oracle supports field-scoped encodings in their most recent versions)
      • Nested CSV: each/any field in the CSV is (correctly or incorrectly quoted) CSV, but the final result is valid CSV
      • I've seen CSV files with more than 65535 columns.

      All of the above should remember you never to use regular expressions or read-by-line algorithms to parse CSV. It looks too easy to be true.

      Now reconsider you last line: a CSV file does not have a humongous line length. It is likely to have a humongous record length. (Think of a database export where a table has stored movies in parts and each record has up to 4 pieces of the movies, so each CSV record can be Gb's. People use databases and CSV for weird things.


      Enjoy, Have FUN! H.Merijn

        I‘m wondering why CSV isn’t replaced with JSON. Doesn’t have PostgreSQL have row_to_json? One row as JSON array and good is? I don’t know in a hurry what Sybase, Oracle or MySQL provide but i guess that they come with something similar. Writing some stored procedure might be an option. And probably there is some fubar Excel macro that does the same. Processing such a file line by line with JSON::Tiny and/or with something from the MCE toolbox should work like a charm. Regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11104418]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (5)
As of 2024-04-18 00:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found