Liebranca has asked for the wisdom of the Perl Monks concerning the following question:

Hello everyone,

My monolith makescript maker/syntax file generator/auto FFI-bindings emitter/parser/preprocessor/inliner/someday to be compiler thingy has a lot of hashes, alright. Translation tables, symbol tables, keywords organized by loose cathegories, lots of cool stuff.

Now, the data actually in use by the program is generated from perl variables that are usually hases as well; because there's some processing of these I need to do at init time I thought just start saving these things to disk before it gets too big and actually slows down startup.

I'm doing that with store/retrieve and already have a mechanism in place to either load the file if it exists and no update is needed, else regenerate. This is done automatically on INIT blocks. Looks something like this:

my $result; INIT {load_cache('name',\$result,\&generator,@data)};

^slightly abbreviated for clarity, but you get the idea. Now, this is fine but it essentially means I need a separate file for each instance of some structure, which is undesirable in my case.

I'd much rather do this per-package, or a multitude of packages even, and it wouldn't really be too difficult to implement. So what's the question? There's no question. But I'd like to request some general advice on *local* databases, meaning my own computer: no cloud, no net, no servers no mambo, I save things to disk and no one else needs to know.

See, I can not duck for "Database" and not get flooded with absolutely irrelevant results about frameworks for whatever it is modern web developers and java mongers are concerned with. It's ridiculous and it's driving me crazy.

So... tips? Conventional wisdom? Pitfalls? What to watch out for? That kind of stuff. It might be mostly just things I already know but I'd rather hear them twice than never.

Just for context, I'm on a half-burned, half-dead one decade old model two core cpu and the bigger file in this scenario is like what, 64kb. Absolutely *gargantuan* quantities of data. But I'm interested in efficiently storing this program data uncompressed so that I don't end up with a million small files that need to be read individually at startup.

Cheers, lyeb.

free/libre post licensed under gnu gplv3; your quotes will inherit.

Replies are listed 'Best First'.
Re: Big cache
by Corion (Patriarch) on Jul 29, 2022 at 05:41 UTC

    Personally, I'm storing JSON blobs in SQLite as a crude form of persistence. As key columns I use the SHA256 of __FILE__ and some application-specific key. This automatically invalidates the cache whenever I change the code.

    Another thing to potentially add transparent caching is Memoize, together with its persistence options.

Re: Big cache (my top ten software development practices)
by eyepopslikeamosquito (Archbishop) on Jul 29, 2022 at 08:52 UTC

    I'd like to request some general advice on *local* databases ... So... tips? Conventional wisdom? Pitfalls? What to watch out for? That kind of stuff.

    For fun, let's start with my top ten list of general software development practices (adapted from On Coding Standards and Code Reviews):

    • Correctness, simplicity and clarity come first. Avoid unnecessary cleverness.
    • Systems should be designed as a set of cohesive modules as loosely coupled as is reasonably feasible.
    • Minimize exposure of module implementation; provide stable interfaces to protect programs from the details of the implementation (which are likely to change).
    • Design components that can be easily tested in isolation.
    • Add new test cases before you start debugging.
    • Adopt a policy of zero tolerance for warnings and errors.
    • Establish a rational error handling policy and follow it strictly. Handle all errors (e.g. don't ignore error returns). Fail securely.
    • Use least privilege; only run with superuser privilege when you need to.
    • Don't optimize prematurely. Benchmark before you optimize. Comment why you are optimizing.
    • Always code as if the guy who ends up maintaining your code will be a violent psychopath who knows where you live.

    So I'd start by defining your "local database interface" based on your requirements. You could then try an implementation of it using, for example, the excellent suggestions above from the Discipulus. If your interface is well designed, you could further experiment with different implementations of it, based on your requirements. If performance was crucial, for example, you could benchmark alternative implementations of your local database interface using different technologies - comparing the performance of built-in Perl hashes with a SQLite memory-based database and Judy Arrays, for example.

    If performance is critical for your application, you might be interested in:

      Use least privilege; only run with superuser privilege when you need to.

      Yes, and also take it with a grain of salt if your OS vendor/distributor/kernel author/Linus Torvalds says that you need to be superuser to do a certain thing.

      For example, by default you need to be root to listen to network ports below 1024, meaning that nearly ALL default network programs on your server need to at least have elevated privileges while starting up (or need some sort of port forwarding stuff thats inflexible, awkward and easy to get wrong). This is especially annoying and potentially dangerous when you are actively developing software (like a webserver or a nameserver).

      I usually run my system with net.ipv4.ip_unprivileged_port_start=0. This way no more root required to run your DIY webserver or nameserver (or to debug them in the IDE).

      PerlMonks XP is useless? Not anymore: XPD - Do more with your PerlMonks XP
        For example, by default you need to be root to listen to network ports below 1024, meaning that nearly ALL default network programs on your server need to at least have elevated privileges while starting up (or need some sort of port forwarding stuff thats inflexible, awkward and easy to get wrong).

        See below the line.

        This is especially annoying and potentially dangerous when you are actively developing software (like a webserver or a nameserver).

        Right.

        I usually run my system with net.ipv4.ip_unprivileged_port_start=0. This way no more root required to run your DIY webserver or nameserver (or to debug them in the IDE).

        And so, even the least privileged user can run DNS, Mail, Web, FTP, whatever servers. That's not secure.


        Savely starting an unprivileged TCP server on a privileged port (i.e. port < 1024) that entirely runs without root privileges is a solved problem. You need a tiny privileged program that opens the socket, then drops privileges, and finally exec()s the real server that inherits the opened socked filehandle to listen on that handle.

        Alexander

        --
        Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Big cache
by hippo (Archbishop) on Jul 29, 2022 at 09:50 UTC

    Whatever you decide on, I would strongly suggest fronting it with CHI so that the details are abstracted away from your calling code. You can then use (and subsequently change) the back-end of your choice.


    🦛

      Fantastic hippo!

      do you use it already? It seems really a cool module

      L*

      There are no rules, there are no thumbs..
      Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.

        Yes, I do use it quite a bit. It presents a uniform interface to a variety of back-ends and so essentially performs the same function for caching as DBI does for databases. My testing of it has shown negligible overhead for my use cases but of course every user should do their own benchmarking just in case.

        If you use caching at all then I thoroughly recommend CHI.


        🦛

        Disci+ I also use CHI a lot; note that there is a Dancer/Dancer2 plugin where you can use CHI as your cache layer and swap out back ends by just changing the config of the app. Thus you can have the same code for test env and production, and just specify the actual cache engine in the relvant config, e.g. in memory for testing and a DB for production.

        Hope this helps!


        The way forward always starts with a minimal test.
Re: Big cache -- serialization
by Discipulus (Canon) on Jul 29, 2022 at 07:13 UTC
    Hello Liebranca and welcome to the monastery,

    > But I'd like to request some general advice on local databases, meaning my own computer: no cloud, no net, no servers no mambo, I save things to disk and no one else needs to know.

    I often miss questions meaning, but you can be interested in the core module Storable (but also Data::Dumper can be used for this) or CPAN Sereal and others. If you need to inspect the saved structure you can prefere YAML

    See also: data-serialization-in-perl

    L*

    There are no rules, there are no thumbs..
    Reinvent the wheel, then learn The Wheel; may be one day you reinvent one of THE WHEELS.
Re: Big cache
by Liebranca (Acolyte) on Jul 29, 2022 at 19:11 UTC

    Thanks everyone c: I got some useful pointers here.

    Came up with a pretty simple solution: I just added an import sub to the package that implements the cache loading; all it does is register other packages using it. The load function is then responsible for registering the objects that need to be read from disk or regenerated. Which gives us the following structure:

    $cache={ 'package'=>{ 'object'=>[\$result,\&generator,@data], }, };

    The order in which the packages and it's objects is registered is kept also; from this we make a stack of indices. And then, way I handled writing to disk:

    218 for my $block(@blocks) { 219 220 $block=freeze($block); 221 $body.=$block; 222 223 push @header,$header[-1]+length $block; 224 225 }

    Where 'blocks' is the objects themselves and 'header' is a list of offsets into the file, both corresponding to the index stack. One can then just write it all into a single file:

    229 unshift @header,int(@header); 230 231 my $header=$signature.( 232 pack 'L'x@header,@header 233 234 ); 235 236 print {$FH} $header.$body;

    And then open, seek and read; so I can save data in big blocks but only keep in memory the ones I need.

    I will have to implement a mechanism for minimizing reads on consecutive entries, but since the entries are sorted by access, successive cache lookups can be done in one go. Squeaky clean ;>

    So yeah, I wasn't sure how I was going to solve the problem, and less than a day later it is essentially fixed. Nice.

    free/libre node licensed under gnu gplv3; your quotes will inherit.

Re: Big cache
by LanX (Saint) on Jul 28, 2022 at 21:47 UTC
    From my experience is the OS' swapping quite efficient as long as you can bundle the accesses to your data-structures inside a concentrated window. (hence minimizing the number of swaps)

    This can often be done by reorganizing the data and the processing.

    I wrote about this before, if interested I can dig up those discussions from the archive.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      Absolutely, I'd love to read anything that can better inform my thinking here; I haven't run into any issues just yet, but I sense that if I don't rethink my approach it will become a problem later on.

      free/libre post licensed under gnu gplv3; your quotes will inherit.