Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?

Re^6: Complex file manipulation challenge

by afoken (Chancellor)
on Aug 14, 2019 at 15:21 UTC ( #11104452=note: print w/replies, xml ) Need Help??

in reply to Re^5: Complex file manipulation challenge
in thread Complex file manipulation challenge

I‘m wondering why CSV isn’t replaced with JSON.

"Millionen Fliegen können nicht irren." (1)

CSV is so widely used to connect systems you don't even have dreamed of, it won't go away any soon. If all parties agree on a common subset of CSV (i.e. they define away the evil edge cases Text::CSV cares about), a CSV parser is trivial to implement in many systems. A fully featured JSON parser is much harder, plus it may generate unexpected data structures. Assume your CSV-replacement JSON is defined as an array of arrays. What if your CSV-replacement-JSON parser is suddenly fed an array of hashes, or an AoAoA? What if array elements are not simple scalars? Plus, JSON requires some interpunction (e.g. for keys) that is optional in CSV, so the JSON files might get bigger.


(1) Did you know that electronic delivery notes for blood products are specified as floppy disks containing CSV files? See for yourself: (MS Word Format, german). And yes, this format is still in use, still on floppy disks, but also via e-mail. I know at least 10 different systems implemented in at least five different languages that use exactly this format, at least in Germany and Austria. Every change in the format requires changing ALL of that systems, followed by a complete verification of the systems. No one wants to pay for that, so changes are rare, even if the format sucks.

Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
  • Comment on Re^6: Complex file manipulation challenge

Replies are listed 'Best First'.
Re^7: Complex file manipulation challenge
by karlgoethebier (Abbot) on Aug 14, 2019 at 17:30 UTC
    ""Millionen Fliegen..."

    Sure, I know. And the heaven will on some day fall on our heads.

    But wait: Data for exchange (CSV is made for this) should be sound and well defined.

    To me a line of a file is an array with n slots. If something else happens something else went wrong. A reason to die.

    And using CSV for data exchange is a total abuse. Often it's because the different departments in some company restrict/don't allow access to their holy grail AKA their database. Remember Turnschuhnetzwerk. They send it as Email-Attachment, hand over some USB-Sticks, a DVD etc. A practice that makes admins shudder. Best regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      using CSV for data exchange is a total abuse

      OK, let's play a game of delivering blood products:

      It's about 1990. The internet is a place for nerds who know ftp, telnet, Unix and VMS, WWW is in its infancy. "Data" means either text files or some propritary binary format. There is no JSON, because Javascript has not yet been invented. Java is six years in the future, SSH and TLS will be implemented in five years.

      You have hospitals all over the country needing blood, and about twenty to thirty independent blood services of very different sizes. Making all of those parties cooperate is a heroic effort, despite the fact that about 50% of the blood services and several of the hospitals work using the same, very famous brand name. You end up with a lot of doctors and very few people from the IT departments in a room discussing everything but the exchange of data. To make things more interesting, only a few of the IT departments people actually know anything about IT. That's the meat bag side of the problem.

      The technical side of the problem: Each hospital and each blood service has its very own set of hard- and software. In the medical sector, companies can easily survive by selling and supporting a two-digit number of installations if they find their niche. So a large amount of systems you find is propritary, and if you lucky, it can read text from a floppy. Some software runs on Windows NT, other on CP/M, OS/2, DOS, Multi-User-DOS, QNX, classic MacOS, and some even runs on the bare metal, without an operating system. Some systems are networked using Novell Netware, Windows shares, switched RS232 lines, ISDN, Laplink Cables, sneaker network, Arcnet, capturing parallel printer output from one system on another system, you name it. Other systems are isolated, some intentionally, some due to lack of knowledge. Software is written in Pascal, Delphi, C, C++, FORTRAN, COBOL, MUMPS, propritary C dialects, BASIC, DOS Batches, SQL dialects, early versions of VBA, and other nightmares, and to drive you really mad, in random combinations of those languages. Backups are rare, verified backups are even rarer. Absolutely no one wants to restore backups. Test environments exist only in your dreams.

      Now find a data exchange format that is supported on all of that systems, on all of that networks, and can be written in all of that languages, with as little effort and knowledge as possible, by people qualified as medical lab assistant, scientific diver, software seller, or - if you are lucky - database or network admin, short: People who barely can spell "parser", and have absolutely no clue how to implement one. People who think that strings and arrays are highly complex data structures. Often, software is written in-house, or comes from one of those niche companys that will charge you a fortune for every line of code written by their "professionals" of similar qualification.

      To summarize: Whatever data format you choose, it basically has to be parseable by reading lines from a text file and doing very basic string operations (substr() and index()). Writing is limited in a similar way: Write a line, pad a string to a given number of characters, concat strings.

      If you have a closer look at the DGTI format spec, you will find that the file format is not only CSV, but can also be read and written using fixed width columns. This is very intentional, because some of the systems can handle fixed width, but not CSV, and vice versa.

      And if you have worked with the IT side of medical lab equipment, you will know that CSV and fixed width files resp. serial data streams are very commonly used, for similar reason. Yes, every lab equipment company has a wide varity of data management systems that allows to centally manage all data from all machines the have ever sold, but the sad reality is that they have to connect to legacy systems, each different for each customer. And, as with the delivery notes for blood products, CSV and fixed width are ugly, but they do work.

      Using a simple serial line with CSV or fixed width to transport a batch job to the device and a result back has a huge security advantage: You can keep the computers on both sides of the serial line isolated from each other. One internal network with all of your precious data, and one "dirty" lab network that allows remote management by the manufacturer. There is no way for malware into your internal network (given you can't exploit the CSV parser), as there is only a CSV parser. No command interface, no PPP connection, no Windows shares. Everything not looking like the expected data is simply discarded. So no problem if one of the field technicians for your lab equipment comes with a laptop full of malware collected in other networks. The malware won't get into your internal network. See also Re: What's the right way to include a config file as a .pm bareword?, Re: Accessing variables in an external hash without eval

      Fast forward to 2019:

      Those legacy systems still work, and more have grown over the decades. DIN/EN/IEC 60601, ISO 13485, and related standards have grown, require tons of paperworks if you dare to change any of the systems, and even more if you try to build a new one. Plus, changing an established standard to something completly new requires that all parties agree on using the new standard. You need to get all of those nasty meat bags into one room and yell "you are doing it wrong, and you will have to pay to fix it" at them. Good luck!


      Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)

        Thank you very much for this great in-depth digression about IT history.

        But i don't give up so easy. Remember: It is/we have:

        Karls-Mac-mini:~ karl$ date Sa 17 Aug 2019 19:44:17 CEST
        "...legacy systems still work..."

        That is O.K. But i'm still convinced that JSON is the way to go now. On the other hand decades ago i "endorsed" XML in the companies i was with. Yet another "reach into the toilet". Best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11104452]
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others perusing the Monastery: (3)
As of 2023-12-10 11:10 GMT
Find Nodes?
    Voting Booth?
    What's your preferred 'use VERSION' for new CPAN modules in 2023?

    Results (39 votes). Check out past polls.