in reply to Re: Strategy for randomizing large files via sysseek
in thread Strategy for randomizing large files via sysseek

And how would these databases return all the rows in the table randomly?
  • Comment on Re^2: Strategy for randomizing large files via sysseek

Replies are listed 'Best First'.
Re^3: Strategy for randomizing large files via sysseek
by radiantmatrix (Parson) on Sep 09, 2004 at 15:08 UTC
    Random selection back out of the DB shouldn't be that hard if you build the table correctly.

    Use an ID feild that gives the order of insertion (i.e. first row inserted gets "0", next gets "1", etc.). Then populate an array with those values and randomize it using established methods. Your array might now look like:

    50, 30, 98, 3, 6, 127, 42 ...
    You can then do something like:
    # @rand_array contains the randomized array $sth = $dbh->prepare("SELECT data FROM my_table WHERE ID=?"); foreach $id (@rand_array) { $sth->execute($id); print $OUTFILE join('',$sth->fetchrow_array); #there will only be o +ne element, no worries. }
    As long as you keep track of which ID's you create during your INSERT session, this will work smashingly. I also recommend issuing a CREATE TABLE at the start and a DROP TABLE at the end to ensure that you aren't duplicating once the file is written.
    --
    $me = rand($hacker{perl});
Re^3: Strategy for randomizing large files via sysseek
by poqui (Deacon) on Sep 09, 2004 at 15:04 UTC
    How random do you need them to be?
    Oracle will return the rows in physical order (which may match input order if you haven't done a lot of page splits during the load) but not in sorted order unless you specifically say "Order By" in the query.

    How important is the randomness?
    You could approach some level of randomness by repeatedly catting the files together and then splitting them at different points and catting them together again... kind of like shuffling cards...
      (OP) They need to be pretty random. I imagine a database would pretty much keep them in the same order as how they loaded them - i.e. the existing order. So a lack of ORDER BY would not really help too much.
Re^3: Strategy for randomizing large files via sysseek
by RiotTown (Scribe) on Sep 09, 2004 at 15:16 UTC
    Are the initial files random, or are they already in some sorted order? If they are already random, just concatenate them together into one giant load file. Once loaded you can pull them out based on rownum so they will come out in the exact order they went into the table. Not truly random, but random as the initial files.

    If the initial files are ordered, the above solution doesn't work. For a mySQL based table, you could do something like

    SELECT * FROM BLA ORDER BY RAND()

    although I'm not sure of exactly how random the results will be.
      Should have read just a bit further on the mySQL page about RAND()

      Note that RAND() in a WHERE clause is re-evaluated every time the WHERE is executed. RAND() is not meant to be a perfect random generator, but instead a fast way to generate ad hoc random numbers that will be portable between platforms for the same MySQL version.
Re^3: Strategy for randomizing large files via sysseek
by Arunbear (Prior) on Sep 09, 2004 at 15:25 UTC
    Using Mysql you could do this
    SELECT * FROM table ORDER BY RAND() LIMIT 1;
    to select one record at a time, or leave out the LIMIT to select all records.
      MySQL does that by first creating a temp-table, then give every row a random number, and then sort it.. Not a good idea :)