in reply to Fetch Mysql huge database

For bulk processing like this, I'd sidestep DBI and use command lines tools.

Create a file called: getnames.sql containing:

select UserName from table where UserName <> '' and UserName not like +'deleted%'

And two perl scripts. 1) partNames.pl:

#! perl -w use strict; my %fhs; while( <> ) { my $char = substr $_, 0, 1; if( not exists $fhs{ $char } ) { open $fhs{ $char }, '>', "names.$char" or die $!; } print { $fhs{ $char } } $_; }

And 2) uniqNames. pl:

#! perl -w use strict; for my $fname ( glob 'names.*' ) { my %uniq; open my $fh, '<', $fname or die $!; while( <$fh> ) { print unless exists $uniq{ $_ }; $uniq{ $_ } = 1; } }

Then run the following pair of commands:

mysql --quick < getnames.sql | perl partNames.pl perl uniqNames.pl > uniqedNames.txt

The first line dumps the names and pipes them to the perl script that then writes them to a set of files (called names.?, where ? is the first character of the name).

The second script reads those files one at a time and writes the unique names to stdout where they are directed into a file called uniqedNames.txt.

The whole (untested) process should take less time than sorting the 34 million names -- inside the DB or with your system sort utility. I'd hazard a guess that it would take less than 20 minutes.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

Replies are listed 'Best First'.
Re^2: Fetch Mysql huge database
by locked_user ym_chaitu (Initiate) on Sep 29, 2011 at 11:05 UTC
    Hai Thank you for taking time and giving me a reply. But the script which as given here didn't showed the expeted results. i have changed your script and worked on a test database first it gave one error in the partNames.pl so i have changed the my $c =  substr $, 0, 1; line to my $c =  substr $_, 0, 1; something like this. then it started working but it stored all the values from the table into one file called as "names" even with the column name and the duplicate values. then when i ran the uniqNames.pl script i didnt got any result the file created is of 0KB. can you kindly help me on this. Once again Thank you for your reply. Pardon me if there are any mistakes
      didn't showed the expeted results.

      I did warn that the code was untested.

      i have changed the my $c = substr $, 0, 1; line to my $c = substr $_, 0, 1;

      Correct correction :)

      but it stored all the values from the table into one file called as "names"

      I assigned the first char to a variable $c, but then used $char everywhere else.

      You should have received a message: Name "main::char" used only once: possible typo ...?

      I've updated the post above to correct the errors you've pointed out, but there may be more.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        Hai Thank you again for your reply. i have modified your script and this worked out for me. Here is the script that i have used. partNames.pl
        while( <> ) { my $c = substr $_ ,0, 1; #print "C value --> ".$c."\n"; $file = "names".$c; &file_write($file,$_); } sub file_write(){ $_file = shift; $_data = shift; open (DATA_FILE,">>$_file") or die("Cannot open a file $_file ->$!\n" +); print DATA_FILE "$_data"; close(DATA_FILE); }
        uniqNames.pl
        for my $fname ( glob 'names*' ) { my %uniq; open my $fh, '<', $fname or die $!; while( <$fh> ) { print unless exists $uniq{ $_ }; $uniq{ $_ } = 1; } }
        Thank you once again for your help.