Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: Strategy for randomizing large files via sysseek

by RiotTown (Scribe)
on Sep 09, 2004 at 14:28 UTC ( [id://389695]=note: print w/replies, xml ) Need Help??


in reply to Strategy for randomizing large files via sysseek

It may be overkill, but you can use a database for this functionality. Both mysql and Oracle have efficient ways of loading large datasets into a pre-defined table while obeying constraints on those tables. This may not be the fastest solution, but it should ease up on the memory requirements.

MySQL LOAD DATA

Oracle SQL*Loader
  • Comment on Re: Strategy for randomizing large files via sysseek

Replies are listed 'Best First'.
Re^2: Strategy for randomizing large files via sysseek
by Anonymous Monk on Sep 09, 2004 at 14:41 UTC
    And how would these databases return all the rows in the table randomly?
      Random selection back out of the DB shouldn't be that hard if you build the table correctly.

      Use an ID feild that gives the order of insertion (i.e. first row inserted gets "0", next gets "1", etc.). Then populate an array with those values and randomize it using established methods. Your array might now look like:

      50, 30, 98, 3, 6, 127, 42 ...
      You can then do something like:
      # @rand_array contains the randomized array $sth = $dbh->prepare("SELECT data FROM my_table WHERE ID=?"); foreach $id (@rand_array) { $sth->execute($id); print $OUTFILE join('',$sth->fetchrow_array); #there will only be o +ne element, no worries. }
      As long as you keep track of which ID's you create during your INSERT session, this will work smashingly. I also recommend issuing a CREATE TABLE at the start and a DROP TABLE at the end to ensure that you aren't duplicating once the file is written.
      --
      $me = rand($hacker{perl});
      How random do you need them to be?
      Oracle will return the rows in physical order (which may match input order if you haven't done a lot of page splits during the load) but not in sorted order unless you specifically say "Order By" in the query.

      How important is the randomness?
      You could approach some level of randomness by repeatedly catting the files together and then splitting them at different points and catting them together again... kind of like shuffling cards...
        (OP) They need to be pretty random. I imagine a database would pretty much keep them in the same order as how they loaded them - i.e. the existing order. So a lack of ORDER BY would not really help too much.
      Are the initial files random, or are they already in some sorted order? If they are already random, just concatenate them together into one giant load file. Once loaded you can pull them out based on rownum so they will come out in the exact order they went into the table. Not truly random, but random as the initial files.

      If the initial files are ordered, the above solution doesn't work. For a mySQL based table, you could do something like

      SELECT * FROM BLA ORDER BY RAND()

      although I'm not sure of exactly how random the results will be.
        Should have read just a bit further on the mySQL page about RAND()

        Note that RAND() in a WHERE clause is re-evaluated every time the WHERE is executed. RAND() is not meant to be a perfect random generator, but instead a fast way to generate ad hoc random numbers that will be portable between platforms for the same MySQL version.
      Using Mysql you could do this
      SELECT * FROM table ORDER BY RAND() LIMIT 1;
      to select one record at a time, or leave out the LIMIT to select all records.
        MySQL does that by first creating a temp-table, then give every row a random number, and then sort it.. Not a good idea :)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://389695]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (1)
As of 2024-04-25 02:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found