in reply to Re^2: looping efficiency
in thread looping efficiency

I had no idea that you were going to generate 10,000 separate files!
A single SQLite DB file is likely to be far, far more efficient for whatever you are doing.

Replies are listed 'Best First'.
Re^4: looping efficiency
by Anonymous Monk on Dec 30, 2020 at 06:36 UTC
    I will have to check that out, but (superficially) I wouldn't think so. Each running of the entire script typically involves 2000-7000 files (not the entire 10,000), but the files range from 3 MB to over 5 Mb, and that would make a fairly massive database. The secondary processing of the files after writing takes advantage of the sequential numbering, and the operations involved don't really lend themselves to database lookups (and involve other external programs). I typically (in the second stage) process subsets 400 to 1000 of the files at a time just to keep the final output files (which involve combining the data in the original files) to a reasonable size.
      Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB. You are writing that much data to the file system in the first place. Your app is much bigger than I thought. But a DB can handle that, even SQLite for 64 bit processor. The processing that you do with this data is unclear as well as the size of the result set.
        "...even SQLite..."

        See limits.

        From ibidem:

        "Every database consists of one or more "pages". Within a single database, every page is the same size, but different database can have page sizes that are powers of two between 512 and 65536, inclusive. The maximum size of a database file is 4294967294 pages. At the maximum page size of 65536 bytes, this translates into a maximum database size of approximately 1.4e+14 bytes (281 terabytes, or 256 tebibytes, or 281474 gigabytes or 256,000 gibibytes)."

        Should be enough i guess.

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

        > Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB.

        Yep. I get data sets greater that 25 GB all the time.

        >You are writing that much data to the file system in the first place.

        The bottleneck (of course) comes from writing the data (to an old-fashioned hard disk - things would obviously proceed much more quickly if I had 64 GB of RAM and could write everything to a RAM disk :) ), so I never considered the possibility that writing 7000 files slowed things down much.

        > Your app is much bigger than I thought.

        I wouldn't describe the app as large at all (obviously the sets of data weigh in heavily). :) Beyond the loops, reading in a one-line second file and a small (less than 1 MB) third file, testing for EOF on the third file, and setting a variable based on changing values in the third file, the remainder (and only significant part) of the script consists of a one line system call. The entire script doesn't exceed 1500 bytes, commented, and not obfuscated in any way.

        > But a DB can handle that, even SQLite for 64 bit processor.

        Again, the external programs that do the subsequent processing don't work with databases, but I will definitely keep the technique in mind.

        > The processing that you do with this data is unclear as well as the size of the result set.

        I have kept that rather vague (nothing illegal, but the possibility exists of a "terms of use" violation somewhere along the way). :) The final result ends up as only slightly smaller than the input files (with reformatting and some selective discarding).

        That said, I have come up with a few ideas for the "upstream" script which would completely eliminate the need for this one. It would replace the naming system with a different one that would maintain sort order, and could combine some of the files at acquisition time (doing the processing in that script that this script does) that would reduce the number of files (but not the total size of the data) by about an order of magnitude.

        Thanks again for your help and suggestions.