in reply to Re^3: looping efficiency
in thread looping efficiency

I will have to check that out, but (superficially) I wouldn't think so. Each running of the entire script typically involves 2000-7000 files (not the entire 10,000), but the files range from 3 MB to over 5 Mb, and that would make a fairly massive database. The secondary processing of the files after writing takes advantage of the sequential numbering, and the operations involved don't really lend themselves to database lookups (and involve other external programs). I typically (in the second stage) process subsets 400 to 1000 of the files at a time just to keep the final output files (which involve combining the data in the original files) to a reasonable size.

Replies are listed 'Best First'.
Re^5: looping efficiency
by Marshall (Canon) on Dec 30, 2020 at 08:24 UTC
    Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB. You are writing that much data to the file system in the first place. Your app is much bigger than I thought. But a DB can handle that, even SQLite for 64 bit processor. The processing that you do with this data is unclear as well as the size of the result set.
      "...even SQLite..."

      See limits.

      From ibidem:

      "Every database consists of one or more "pages". Within a single database, every page is the same size, but different database can have page sizes that are powers of two between 512 and 65536, inclusive. The maximum size of a database file is 4294967294 pages. At the maximum page size of 65536 bytes, this translates into a maximum database size of approximately 1.4e+14 bytes (281 terabytes, or 256 tebibytes, or 281474 gigabytes or 256,000 gibibytes)."

      Should be enough i guess.

      «The Crux of the Biscuit is the Apostrophe»

      perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        Thank you for the excellent link!

        The SQlite file will have to fit within the maximum file size for a particular O/S's filesystem. Some adjustments to various SQLite parameters as referenced in your link may have to be adjusted. I have never done so, but I am confident that my 64 bit Windows NTFS file system can deal with a 25GB file.

      > Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB.

      Yep. I get data sets greater that 25 GB all the time.

      >You are writing that much data to the file system in the first place.

      The bottleneck (of course) comes from writing the data (to an old-fashioned hard disk - things would obviously proceed much more quickly if I had 64 GB of RAM and could write everything to a RAM disk :) ), so I never considered the possibility that writing 7000 files slowed things down much.

      > Your app is much bigger than I thought.

      I wouldn't describe the app as large at all (obviously the sets of data weigh in heavily). :) Beyond the loops, reading in a one-line second file and a small (less than 1 MB) third file, testing for EOF on the third file, and setting a variable based on changing values in the third file, the remainder (and only significant part) of the script consists of a one line system call. The entire script doesn't exceed 1500 bytes, commented, and not obfuscated in any way.

      > But a DB can handle that, even SQLite for 64 bit processor.

      Again, the external programs that do the subsequent processing don't work with databases, but I will definitely keep the technique in mind.

      > The processing that you do with this data is unclear as well as the size of the result set.

      I have kept that rather vague (nothing illegal, but the possibility exists of a "terms of use" violation somewhere along the way). :) The final result ends up as only slightly smaller than the input files (with reformatting and some selective discarding).

      That said, I have come up with a few ideas for the "upstream" script which would completely eliminate the need for this one. It would replace the naming system with a different one that would maintain sort order, and could combine some of the files at acquisition time (doing the processing in that script that this script does) that would reduce the number of files (but not the total size of the data) by about an order of magnitude.

      Thanks again for your help and suggestions.

        Writing to a single DB versus 5,000 separate files could speed things up considerably.

        There are various buffering issues that effect throughput to the hard disk.
        For INSERT's into a DB, the concept of a "transaction" matters a lot.
        The database will attempt to maintain ACID properties.
        The number of transactions per second is important.

        A million INSERTs can be run as a single transaction.
        In my SQLite testing, about 128K INSERT rows per transaction is just fine.
        That means that waiting until a million rows, doesn't make any significant performance difference.

        An obvious thing to do, is to buy an SSD disk. That eliminates the rotational delays associated with a normal HD.
        Do not use that SSD disk for the O/S - use just for this big 25GB file.

        Update: I am still not sure why you need these leading zeroes?