in reply to Re: Strip non-numeric
in thread Strip non-numeric

You should consider using s/\D+//g, as that's often a lot faster than s/\D//g. Here's a benchmark:
#!/usr/bin/perl use strict; use warnings; use Benchmark; my @sizes = (10, 25, 50, 100, 250, 500, 1000); my @chars = ('A' .. 'Z', 'a' .. 'z', 0 .. 9); our @d = map {join "" => map {$chars [rand @chars]} 1 .. $_} @sizes; map { Benchmark::cmpthese timethese (-2 => { "simple_$sizes[$_]" => '$_ = $::d[' . $_ . ']; s/\D//g;', "multiple_$sizes[$_]" => '$_ = $::d[' . $_ . ']; s/\D+//g;' }, 'none'); } 0 .. $#sizes __END__ Rate simple_10 multiple_10 simple_10 196495/s -- -15% multiple_10 231225/s 18% -- Rate simple_25 multiple_25 simple_25 89788/s -- -50% multiple_25 180650/s 101% -- Rate simple_50 multiple_50 simple_50 47507/s -- -64% multiple_50 130727/s 175% -- Rate simple_100 multiple_100 simple_100 23206/s -- -77% multiple_100 103096/s 344% -- Rate simple_250 multiple_250 simple_250 10488/s -- -71% multiple_250 36407/s 247% -- Rate simple_500 multiple_500 simple_500 5046/s -- -75% multiple_500 20382/s 304% -- Rate simple_1000 multiple_1000 simple_1000 2528/s -- -76% multiple_1000 10549/s 317% --

Abigail

Replies are listed 'Best First'.
Re^3: Strip non-numeric
by Aristotle (Chancellor) on Jan 14, 2003 at 09:47 UTC
    In this case transliteration is really the most efficient solution though. Consider the results of adding         "xlit_$sizes[$_]"     => '$_ = $::d[' . $_ . ']; tr/0-9//cd;', to the benchmark:
    Rate simple_10 multiple_10 xlit_10 simple_10 86400/s -- -31% -70% multiple_10 124615/s 44% -- -57% xlit_10 292712/s 239% 135% -- Rate simple_25 multiple_25 xlit_25 simple_25 45324/s -- -49% -82% multiple_25 88062/s 94% -- -65% xlit_25 248802/s 449% 183% -- Rate simple_50 multiple_50 xlit_50 simple_50 23823/s -- -71% -89% multiple_50 82566/s 247% -- -62% xlit_50 218684/s 818% 165% -- Rate simple_100 multiple_100 xlit_100 simple_100 13397/s -- -69% -92% multiple_100 43191/s 222% -- -74% xlit_100 168434/s 1157% 290% -- Rate simple_250 multiple_250 xlit_250 simple_250 5608/s -- -71% -95% multiple_250 19639/s 250% -- -81% xlit_250 103656/s 1748% 428% -- Rate simple_500 multiple_500 xlit_500 simple_500 2832/s -- -72% -95% multiple_500 10189/s 260% -- -83% xlit_500 59072/s 1986% 480% -- Rate simple_1000 multiple_1000 xlit_1000 simple_1000 1380/s -- -77% -96% multiple_1000 5939/s 330% -- -83% xlit_1000 34457/s 2397% 480% --
    Esp in large data sets, transliteration screams.

    Makeshifts last the longest.

      True, but my point was the pattern of s/PAT//g, which would benefit to be written as s/PAT+//g. tr isn't as flexible - not even in this case. \D follow the locale and Unicode rules when appropriate, where as the tr has the digits hardcoded.

      Abigail

        Ah, I didn't think of Unicode. I briefly considered the locale awareness, but couldn't think of any case where that would change the meaning of \d.

        Makeshifts last the longest.

Re: Strip non-numeric
by FireBird34 (Pilgrim) on Jan 14, 2003 at 04:49 UTC
    Ok, thanks. Also did have one more question about this -- I didn't take into consideration about valid non-numeric (only decmial). So for example, 1a2b3.4c5d6e, I would want 123.456, not just 123456 -- tried a few combinations, but nothing yet. I'm missing something obvious I know ;) Any pointers?
      It always helps if you are specific. Noone enjoys a game of "How do I do X?", "This is how you do X", "But I don't really want to do X, I want to do Y".

      So, from 1a2b3.4c5d6e, you want 123.456. But what if you have 1a2b.3c4d.5e6f? What do you want then?

      Abigail

      If I understand you correctly, you want to remove floating point numbers as well, together with normal integer numbers but not if they are part of a normal word (like 'HAL1'), don't you?

      Then maybe you should have a look at Regexp::Common. Together with putting whitespace around the regex you should get it to work.

      -- Hofmator

Re: Strip non-numeric
by FireBird34 (Pilgrim) on Jan 14, 2003 at 21:08 UTC
    Basically, in any given string, I want only numeric and decimal values retained. So, basically:

    1a => 1
    1a.2b => 1.2
    1a.2b.3c => 1.2.3

    ^^ Just as those show, no matter how many decimals or numeric values, those are the only characters I want retained. Hope this clears it up just a bit.
      So, you want to delete anything that isn't a number or a dot? Just do:
      s/[^\d.]+//g;
      or
      tr/0-9.//cd;

      Abigail