in reply to Re^2: Complex file manipulation challenge
in thread Complex file manipulation challenge

That is indeed a good point++!

In Excel, there is some kind of formatting option to wrap a line onto another line depending upon the column width. There may be some kind of option to insert a GUI line break that doesn't appear in the CSV (maybe CTL-Enter)? Not sure that is possible.

However, you are quite correct in that multiple lines within a column is something to be considered -- think about a single field for an address instead of multiple columns for each line of the address.

All of the CSV files that I currently work with containing addresses are | delimited, have separate columns for each potential line of the address and disallow the | char within an address. So a bit of tunnel vision on my part! Sorry!

You are quite correct to point out this possibility.

BTW: I've seen CSV files with 512 or 1024 fields. These things can have humongous line lengths. Perl is very good at getting me the dozen or so fields that I care about.

  • Comment on Re^3: Complex file manipulation challenge

Replies are listed 'Best First'.
Re^4: Complex file manipulation challenge
by Tux (Canon) on Aug 14, 2019 at 06:32 UTC

    Your vision on CSV is indeed very limited :)

    Consider not only Excel (or other spreadsheet application) exports, but also:

    • Database exports (including images, BLOB's, XML, Unicode, …)
    • Log exports (I know of a situation that has to read 4Tb (tera-byte!) a day
    • CSV exports where not only the data, but also the header-row has embedded newlines in the fields (and comma's)
    • CSV files with mixed encoding (you should know that Oracle supports field-scoped encodings in their most recent versions)
    • Nested CSV: each/any field in the CSV is (correctly or incorrectly quoted) CSV, but the final result is valid CSV
    • I've seen CSV files with more than 65535 columns.

    All of the above should remember you never to use regular expressions or read-by-line algorithms to parse CSV. It looks too easy to be true.

    Now reconsider you last line: a CSV file does not have a humongous line length. It is likely to have a humongous record length. (Think of a database export where a table has stored movies in parts and each record has up to 4 pieces of the movies, so each CSV record can be Gb's. People use databases and CSV for weird things.


    Enjoy, Have FUN! H.Merijn

      I‘m wondering why CSV isn’t replaced with JSON. Doesn’t have PostgreSQL have row_to_json? One row as JSON array and good is? I don’t know in a hurry what Sybase, Oracle or MySQL provide but i guess that they come with something similar. Writing some stored procedure might be an option. And probably there is some fubar Excel macro that does the same. Processing such a file line by line with JSON::Tiny and/or with something from the MCE toolbox should work like a charm. Regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        I‘m wondering why CSV isn’t replaced with JSON.

        "Millionen Fliegen können nicht irren." (1)

        CSV is so widely used to connect systems you don't even have dreamed of, it won't go away any soon. If all parties agree on a common subset of CSV (i.e. they define away the evil edge cases Text::CSV cares about), a CSV parser is trivial to implement in many systems. A fully featured JSON parser is much harder, plus it may generate unexpected data structures. Assume your CSV-replacement JSON is defined as an array of arrays. What if your CSV-replacement-JSON parser is suddenly fed an array of hashes, or an AoAoA? What if array elements are not simple scalars? Plus, JSON requires some interpunction (e.g. for keys) that is optional in CSV, so the JSON files might get bigger.

        Alexander

        (1) Did you know that electronic delivery notes for blood products are specified as floppy disks containing CSV files? See for yourself: https://dgti.de/docs/doclink/10158/Lieferscheindiskette_Version2_2.doc (MS Word Format, german). And yes, this format is still in use, still on floppy disks, but also via e-mail. I know at least 10 different systems implemented in at least five different languages that use exactly this format, at least in Germany and Austria. Every change in the format requires changing ALL of that systems, followed by a complete verification of the systems. No one wants to pay for that, so changes are rare, even if the format sucks.

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)