dbush has asked for the wisdom of the Perl Monks concerning the following question:
Greetings fellow Monks!
I humbly ask for help with the following problem:
I need to scan through a very large volume of data (potentially gigs of data via DBI) and retrieve the most minimal relational data type for each "cell" of data. Below is my current attempt wrapped up in a test loop. It does what I want (I think!) but runs unacceptably slowly on large volumes. I suspect that my regular expression is not helping.
My approach in pseudo code is to use a regular expression to decide if the "cell" is numeric, then go through some conditional logic to find out the most appropriate numeric data type. The fall back is to use a character type.
Based on my research via Super Search I found the following useful Matching floats according to perlfaq4 (by jmcnamara on Sep 25, 2002) but this node does not provide capture (which I need). I have also looked at the regex in the perlfaq.
Another thing that I am considering is using caching, possibly using Memoize or similar. I just haven't had chance to try it out yet. Any thoughts?
Any help would be very gratefully received.
Regards,
Dom.
#perl -w use strict; my ($szValue, $szDatatype); while ($szValue = <DATA>) { chomp $szValue; $szDatatype = GetMinimalDataType($szValue); write; } exit; format STDOUT_TOP = Value Datatype ------------------------------- -------------------- . format STDOUT = @<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< @<<<<<<<<<<<<<<<<<<< $szValue, $szDatatype . ##################################################################### sub GetMinimalDataType ##################################################################### { #This subroutine takes one parameter and returns the most minimal #datatype that would contain it. my $varValue = shift(@_); my ($iScale, $iPrecision, $iLength); my ($szIntegerPart, $szDecimalPart, $szFloatPart); if (not defined $varValue) { return 'NULL' } #Check for numeric types if ($varValue =~ /^ [+-]? ( (\d+) (\.(\d*))? | (\.(\d+)) ) ([Ee][+-]?\d+)? $/ox ) { $szIntegerPart = $2; $szDecimalPart = $4 . $6; $szFloatPart = $7; #Is this a simple integer? if ($szIntegerPart ne '' and $szDecimalPart eq '' and $szFloatPart eq '' ) { #Check for smallint values #SMALLINT 16 bits –32,768 to 32,767 Signed short (word) if ( ($varValue >= -32768) and ($varValue <= 32767) ) { return 'SMALLINT'; } #Check for integer values #INTEGER 32 bits –2,147,483,648 to 2,147,483,647 Signed #long (longword) if ($varValue >= -2147483648 and $varValue <= 2147483647) +{ return 'INTEGER'; } } #Is this a simple decimal? if ($szFloatPart eq '') { $iScale = length $szDecimalPart; $iPrecision = (length $szIntegerPart) + $iScale; if ( #Precision = 1 to 18 ($iPrecision >= 1 and $iPrecision <= 18) and #Scale = 0 to 18 ($iScale >=0 and $iScale <=18) and #Scale must be less than or equal to precision ($iScale <= $iPrecision) ) { return "DECIMAL($iPrecision,$iScale)"; } } #Float maybe? #1.175 x 10 –38 to 3.402 x 10 38 if ($varValue >= 1.175 * 10 ** -38 and $varValue <= 3.402 * 10 ** 38) { return 'FLOAT'; } #2.225 x 10 –308 to 1.797 x 10 308 if ($varValue >= 1.175 * 10 ** -308 and $varValue <= 3.402 * 10 ** 308) { return 'DOUBLE PRECISION'; } } #Dates here eventually #<TBD> #Check for strings $iLength = length $varValue; return( 'CHAR(' . $iLength . ')' ); } __DATA__ 0e0 0 +0 -0 1. 0.14 .14 1.24e5 24e5 -24e-5 2.3. 2.3.4 1..2 .1.1 4. .23-2147483648 2147483647 +2147483647 -2147483649 2147483648 10e10 10e-10 Dom Dominic Dominic Bush
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: (88% faster) Discovering minimal relational data types on large volumes
by BrowserUk (Patriarch) on Nov 21, 2002 at 00:45 UTC | |
by dbush (Deacon) on Nov 26, 2002 at 11:57 UTC | |
|
Re: Discovering minimal relational data types on large volumes
by dws (Chancellor) on Nov 20, 2002 at 21:53 UTC | |
by dbush (Deacon) on Nov 20, 2002 at 22:15 UTC | |
|
Re: Discovering minimal relational data types on large volumes
by dbush (Deacon) on Nov 21, 2002 at 00:16 UTC |