in reply to Re: Unpacking and converting
in thread Unpacking and converting

I really don't understand this part.

There is a black box of software that dumps some data in text format every three seconds. The data is actually a representation of internal process state in a report format with fixed-width fields containing numeric values or, in certain cases, empty strings instead. I need to gather this data, process it and put in a database. There is also a side requirement of placing minimal possible load on that server. This is why the database and all processing is on an external machine I can control.

The amount of text is quite significant, as I said each dump can easily contain 15-30 mb of text depending on server state, and these dumps are coming every three seconds. Some kind of compression is required even when the data is sent over fast LAN to a nearby machine, otherwise I run the risk of not having enough time to send over and process one batch before next one is ready.

My first thought was to use an agent software that would collect the data, parse it into arrays with unpack(), serialize with Storable and send over to processing machine. Now it appears to me that this approach is wrong altogether; it will be much easier to compress the text with gzip before sending. Nevertheless, the question of the fastest array iteration remains since I still need to process the data.

Regards,
Alex.

Replies are listed 'Best First'.
Re^3: Unpacking and converting
by andal (Hermit) on Feb 16, 2011 at 10:50 UTC

    Well. Probably you are looking the wrong way. If your programm directly connects to DB using DBI or alike, then DBI takes care of all necessary optimizations. Your attempts to convert strings to numbers just make system less efficient. If your programm simply passes data from one server to another, where another program picks it up, then again it makes sense simply take the text, zip it (or bzip2 it :) and then copy to remote. Probably rsync with option -z would be best for this.

    The point is. Conversion of perl variable from text to number does not add any compactness.

    The other point is. Iterating over elements of array using foreach is faster.

      Probably you are looking the wrong way.

      Yes, I came to the same conclusion. In my case, I will be able to use SSH with compression and this indeed solves all my pains; somehow it wasn't the first choice for me. I was dumbstruck probably, as it seems obvious in hindsight.

      For consistency sake though, I'd like to mention that conversion from text to number does indeed add compactness to serialized data. Consider this:

      use Storable qw(freeze); my @a = '0001'..'1000'; my $foo = freeze \@a; $_ += 0 for @a; my $bar = freeze \@a; print "before: ", length $foo, ", after: ", length $bar, "\n"; before: 6016, after: 4635

      I fail to understand why so many people are insistent on ignoring the obvious. Granted, today's fast and abundant hardware resources may have spoiled us but there are situations yet when every byte counts. Make the dataset in above example three orders of magnitude larger and the difference becomes quite distinct.

      Regarding array iteration, I feel this discussion was beneficial for me as it cleared some murky points. I wish all my questions were answered so productively in future. :)

      Regards,
      Alex.

        For consistency sake though, I'd like to mention that conversion from text to number does indeed add compactness to serialized data. I fail to understand why so many people are insistent on ignoring the obvious.

        You fail to understand it because you don't understand what is "serialized data" and what is "conversion from text to number". Serialized data is something stored in sequential area (file, memory chunk etc.) The "conversion" is something that is done at run time to provide context specific information. Perl does the conversion internally and automatically and you normally don't need to think about it. Conversion to or from numbers does not necessary save any RAM, since you don't know if the perl has discarded the string buffer, or keeps it around for the sake of speed optimization when the string is needed again.

        The "serialization" is needed when you copy the data from perl into file (for example). In this case, to save space you may use function "pack". Here's the piece of code that stores '00000001' as single character in the file.

        print FILE pack('C', '00000001');
        Again, perl converts from string to number automatically and stores in the buffer single byte with value 1.

        I hope this clarifies for you why explicit conversion from string to number is a nonsense in general. Though it might be needed in certain cases, but definitely not for decreasing RAM usage :)

Re^3: Unpacking and converting
by flexvault (Monsignor) on Feb 17, 2011 at 15:57 UTC

    Alex,

    The amount of text is quite significant ... easily contain 15-30 mb. of text..

    You really have 2 problems that you are trying to solve with one script. First, you need to get the data on a separate server, and second, you need to process the data.

    For the first part of the problem, I would use  use IO::Compress::Gzip and then send the data to the second machine. Your mileage may vary, but I would expect your 15-30MByte file to compress to 1-3MBype. Fast and secure, and core code. Then use  use IO::Uncompress::Gunzip on the second machine to get back to the original data.

    For the second problem, IMO, use the power of Unix. Use multiply scripts to process the data in parallel. The data should be time-stamped, so the data going into the database will be correct, which is more important than the fastest script. I would use cron to check on the status of the running scripts. Save your pids ($$) in a common place and use a small simple perl script to check on their running. It's quite simple to send a "text message" to multiple admins if you discover problems with the scripts! And use 'sleep' or 'usleep' between script passes, you'll get a lot more work done in the long run.

    Good Luck!

    "Well done is better than well said." - Benjamin Franklin