dbush has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow Monks!

I humbly ask for help with the following problem:

I need to scan through a very large volume of data (potentially gigs of data via DBI) and retrieve the most minimal relational data type for each "cell" of data. Below is my current attempt wrapped up in a test loop. It does what I want (I think!) but runs unacceptably slowly on large volumes. I suspect that my regular expression is not helping.

My approach in pseudo code is to use a regular expression to decide if the "cell" is numeric, then go through some conditional logic to find out the most appropriate numeric data type. The fall back is to use a character type.

Based on my research via Super Search I found the following useful Matching floats according to perlfaq4 (by jmcnamara on Sep 25, 2002) but this node does not provide capture (which I need). I have also looked at the regex in the perlfaq.

Another thing that I am considering is using caching, possibly using Memoize or similar. I just haven't had chance to try it out yet. Any thoughts?

Any help would be very gratefully received.

Regards,
Dom.

#perl -w use strict; my ($szValue, $szDatatype); while ($szValue = <DATA>) { chomp $szValue; $szDatatype = GetMinimalDataType($szValue); write; } exit; format STDOUT_TOP = Value Datatype ------------------------------- -------------------- . format STDOUT = @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<< $szValue, $szDatatype . ##################################################################### sub GetMinimalDataType ##################################################################### { #This subroutine takes one parameter and returns the most minimal #datatype that would contain it. my $varValue = shift(@_); my ($iScale, $iPrecision, $iLength); my ($szIntegerPart, $szDecimalPart, $szFloatPart); if (not defined $varValue) { return 'NULL' } #Check for numeric types if ($varValue =~ /^ [+-]? ( (\d+) (\.(\d*))? | (\.(\d+)) ) ([Ee][+-]?\d+)? $/ox ) { $szIntegerPart = $2; $szDecimalPart = $4 . $6; $szFloatPart = $7; #Is this a simple integer? if ($szIntegerPart ne '' and $szDecimalPart eq '' and $szFloatPart eq '' ) { #Check for smallint values #SMALLINT 16 bits –32,768 to 32,767 Signed short (word) if ( ($varValue >= -32768) and ($varValue <= 32767) ) { return 'SMALLINT'; } #Check for integer values #INTEGER 32 bits –2,147,483,648 to 2,147,483,647 Signed #long (longword) if ($varValue >= -2147483648 and $varValue <= 2147483647) +{ return 'INTEGER'; } } #Is this a simple decimal? if ($szFloatPart eq '') { $iScale = length $szDecimalPart; $iPrecision = (length $szIntegerPart) + $iScale; if ( #Precision = 1 to 18 ($iPrecision >= 1 and $iPrecision <= 18) and #Scale = 0 to 18 ($iScale >=0 and $iScale <=18) and #Scale must be less than or equal to precision ($iScale <= $iPrecision) ) { return "DECIMAL($iPrecision,$iScale)"; } } #Float maybe? #1.175 x 10 –38 to 3.402 x 10 38 if ($varValue >= 1.175 * 10 ** -38 and $varValue <= 3.402 * 10 ** 38) { return 'FLOAT'; } #2.225 x 10 –308 to 1.797 x 10 308 if ($varValue >= 1.175 * 10 ** -308 and $varValue <= 3.402 * 10 ** 308) { return 'DOUBLE PRECISION'; } } #Dates here eventually #<TBD> #Check for strings $iLength = length $varValue; return( 'CHAR(' . $iLength . ')' ); } __DATA__ 0e0 0 +0 -0 1. 0.14 .14 1.24e5 24e5 -24e-5 2.3. 2.3.4 1..2 .1.1 4. .23-2147483648 2147483647 +2147483647 -2147483649 2147483648 10e10 10e-10 Dom Dominic Dominic Bush

Replies are listed 'Best First'.
Re: (88% faster) Discovering minimal relational data types on large volumes
by BrowserUk (Patriarch) on Nov 21, 2002 at 00:45 UTC

    Continuing my quest to be familiar with what coding methods result in the most efficient code, I couldn't resist having a crack at speeding your GetMinimalDataType routine up a little. The results are:

    start 18.840ms average per iteration end 2.222ms average per iteration 88% less time or 847% improvement?

    Which way you view the improvement depends on whether your a scientist or a salesman I guess :^).

    Note: I realise that the level of improvement is highly dependant upon the data used, and I'd love to see you profile data post any changes you choose to adopt.

    Step by step log of changes and the improvements derived

    The final cut of your test program with the modifications. Sorry, but some of your comments got 'lost' in the transitions.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

      Many thanks to BrowserUk for the excellent reply and apologies for the delay in responding. I have taken all these changes and incorporated them into my script. Are there any sources of information where these techniques are described? A cookbook for fast code if you will.

      On attempting to measure the performance improvement, I discovered that the profile I posted above was in error. dws was absolutely correct in that another part of the code was in large part responsible for the slow execution. That being said, the version of GetMininalDataType that BrowserUk created ran against my test data in approximately 60% of the time of my inferior version.

      Regards,
      Dom.

Re: Discovering minimal relational data types on large volumes
by dws (Chancellor) on Nov 20, 2002 at 21:53 UTC
    It does what I want (I think!) but runs unacceptably slowly on large volumes. I suspect that my regular expression is not helping.

    Check out that suspicion before you start hacking on code.

    If you're really getting data via DBI, particularly if you're going over the network, the overhead to get a row of data is very likely to dwarf any per-row computation you're doing. You can check that by measuring the time it takes to fetch that large volume of data without processing. Then compare that to the time you've measured for fetch+processing.

      Hi dws,

      You make a very fair point with respect to not optimising prior to understanding that the code works first and that the overhead associated with fetching the data from the database may well be significant. To check this out I have profiled the execution of my script. Here is a clip from the results of dprofpp:

      > dprofpp Total Elapsed Time = -82.1369 Seconds User+System Time = 497.6071 Seconds Exclusive Times %Time ExclSec CumulS #Calls sec/call Csec/c Name 35.9 179.0 175.93 105289 0.0002 0.0002 DBI::st::execute 33.9 168.9 137.38 105266 0.0000 0.0000 main::GetMinimalDataType snip...

      As you can see the majority of the time is spent in DBI::st::execute but a significant portion of the time is taken by the GetMinimalDataType function. Any savings here would be very useful.

      Many thanks for the feedback.

      Regards,
      Dom.

      Update: Please note that this profile is on a small test set of data and not on the real volume.

Re: Discovering minimal relational data types on large volumes
by dbush (Deacon) on Nov 21, 2002 at 00:16 UTC
    Update: I have looked into using Memoize (admittedly without using expiration) and have found that this actually slows down the process by about 30-50%. Sorry about the range of values but I'm estimating based on the progress so far, it's only 1/5 of the way through, but I need to go to bed.