Re^4: looping efficiency

Replies are listed 'Best First'.
Re^5: looping efficiency by Marshall (Canon) on Dec 30, 2020 at 08:24 UTC
Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB. You are writing that much data to the file system in the first place. Your app is much bigger than I thought. But a DB can handle that, even SQLite for 64 bit processor. The processing that you do with this data is unclear as well as the size of the result set.	[reply]
Re^6: looping efficiency by karlgoethebier (Abbot) on Dec 30, 2020 at 13:42 UTC
"...even SQLite..." See limits. From ibidem: "Every database consists of one or more "pages". Within a single database, every page is the same size, but different database can have page sizes that are powers of two between 512 and 65536, inclusive. The maximum size of a database file is 4294967294 pages. At the maximum page size of 65536 bytes, this translates into a maximum database size of approximately 1.4e+14 bytes (281 terabytes, or 256 tebibytes, or 281474 gigabytes or 256,000 gibibytes)." Should be enough i guess. ŤThe Crux of the Biscuit is the Apostropheť `perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'`Help	[reply] [d/l]
Re^7: looping efficiency by Marshall (Canon) on Dec 31, 2020 at 22:03 UTC
Thank you for the excellent link! The SQlite file will have to fit within the maximum file size for a particular O/S's filesystem. Some adjustments to various SQLite parameters as referenced in your link may have to be adjusted. I have never done so, but I am confident that my 64 bit Windows NTFS file system can deal with a 25GB file.	[reply]
Re^8: looping efficiency by karlgoethebier (Abbot) on Dec 31, 2020 at 22:37 UTC
Re^9: looping efficiency by Marshall (Canon) on Jan 01, 2021 at 00:15 UTC
Re^6: looping efficiency by Anonymous Monk on Dec 30, 2020 at 13:13 UTC
> Geez, as a middle point, 5,000 files times 4 MB each is 20,000 MB => 20 GB. Yep. I get data sets greater that 25 GB all the time. >You are writing that much data to the file system in the first place. The bottleneck (of course) comes from writing the data (to an old-fashioned hard disk - things would obviously proceed much more quickly if I had 64 GB of RAM and could write everything to a RAM disk :) ), so I never considered the possibility that writing 7000 files slowed things down much. > Your app is much bigger than I thought. I wouldn't describe the app as large at all (obviously the sets of data weigh in heavily). :) Beyond the loops, reading in a one-line second file and a small (less than 1 MB) third file, testing for EOF on the third file, and setting a variable based on changing values in the third file, the remainder (and only significant part) of the script consists of a one line system call. The entire script doesn't exceed 1500 bytes, commented, and not obfuscated in any way. > But a DB can handle that, even SQLite for 64 bit processor. Again, the external programs that do the subsequent processing don't work with databases, but I will definitely keep the technique in mind. > The processing that you do with this data is unclear as well as the size of the result set. I have kept that rather vague (nothing illegal, but the possibility exists of a "terms of use" violation somewhere along the way). :) The final result ends up as only slightly smaller than the input files (with reformatting and some selective discarding). That said, I have come up with a few ideas for the "upstream" script which would completely eliminate the need for this one. It would replace the naming system with a different one that would maintain sort order, and could combine some of the files at acquisition time (doing the processing in that script that this script does) that would reduce the number of files (but not the total size of the data) by about an order of magnitude. Thanks again for your help and suggestions.	[reply]
Re^7: looping efficiency by Marshall (Canon) on Dec 31, 2020 at 22:38 UTC
Writing to a single DB versus 5,000 separate files could speed things up considerably. There are various buffering issues that effect throughput to the hard disk. For INSERT's into a DB, the concept of a "transaction" matters a lot. The database will attempt to maintain ACID properties. The number of transactions per second is important. A million INSERTs can be run as a single transaction. In my SQLite testing, about 128K INSERT rows per transaction is just fine. That means that waiting until a million rows, doesn't make any significant performance difference. An obvious thing to do, is to buy an SSD disk. That eliminates the rotational delays associated with a normal HD. Do not use that SSD disk for the O/S - use just for this big 25GB file. Update: I am still not sure why you need these leading zeroes?	[reply]