Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Random entry from combined data set

by Anonymous Monk
on Jul 04, 2001 at 07:03 UTC ( [id://93758]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi folks. I have multiple files, each containing one text phrase per line. How do I read these multiple files, and randomly select one line item from the combined data set? This snippet would be accessed by a CGI and multiple of them could be running simultanesouly. I therefore believe that populating an array would be a better approach than creating a temp file, but I'm not exactly sure if there's another better way. Any help/pointers would be very helpful. Thanks! -Jason

Replies are listed 'Best First'.
Re: Random entry from combined data set
by damian1301 (Curate) on Jul 04, 2001 at 07:10 UTC
    Hmmm... what you could do is open each file and push (line by line) it in a single array.

    open(FH,"$file1") or die " Could not open $file1!!! Here's why: $!"; push @array, <FH>; close FH or die "Could not close $file1. Here's why: $!";


    Etc, etc. Then once you have all the files in a you can use a simple line to select a random phrase.

    print $array[rand(@array)]; # just for you, meowchow :)
    note: This could drain some memory so be careful. Also, it would probably be much more efficient if you made a subroutine to open the files and return the data to you. That way you won't have so many open calls all the time. I will post an example in a bit.

    UPDATE:Got the sub.
    sub openf{ my $file = shift; open(FH,$file) ||die"There's a problem! Here's what's up: $!"; my @array = <FH>; close FH ||die"There's a problem! Here's what's up: $!"; @array=grep{$_ ne ""} @array; return @array; } push @array,openf("test2.txt"),openf("test.txt"); print $array[4];

    UPDATE2: Use wog's advice, as his does not drain memory and has the same functionality as mine...but shorter.

    UPDATE3: I don't know why but I felt that push @bla,$_ while <FH>; was the best solution...man I am stoopid tonight... thanks dvergin.

    $_.=($=+(6<<1));print(chr(my$a=$_));$^H=$_+$_;$_=$^H; print chr($_-39); # Easy but its ok.
      Actually, you don't need to:

            push @array,$_ while <FH>;

      'push' takes a list as its second parameter, so:

            push @array, <FH>;

      for each file works fine.

Re: Random entry from combined data set
by wog (Curate) on Jul 04, 2001 at 07:23 UTC
    The FAQ How do I select a random line from a file? answers the largest part of your question. In fact, you can use the code they give if you just set @ARGV to the list of files to read. (Thanks to how $. works with <>.) And you don't hog memory.

    You can probably make your script run faster if you create (in advance) and use in your script some data about the files that allows you to choose a line without going through every signle one and possibly seek to a specific line quicker.

      Cute. Does that really give the same distribution as knowing the number of lines in advance?

      Update: The definitive answer

      use strict; use warnings; sub trial($) { my $result; foreach (1..shift) { $result= $_ if rand($_) < 1; } return $result; } ############ my $trials= shift || 1000; my $size= shift || 20; my %results; for (1..$trials) { ++$results{trial($size)}; } # show the results for (1..$size) { printf "%5d: %5d\n", $_, ($results{$_} || 0); }
        Yes. Say we go over n lines in that algorithm. The last line has a 1 in n chance (probablity we want) of being selected. If it isn't selected, then we use the line that was previously selected, which was either the one before which had a 1 in n-1 chance, correct for selecting from the n-1 remaining values. If the n-1th line wasn't choosen then the line before that had a correct 1 in n-2 chance of being choosen for selecting from the n-2 remaining values. And so on and so on, till you get to 1, which will still be choosen if all others weren't.

        I hope this is understandable.

        update: To attempt to clarify/summerize/whatever after seeing sierrathedog04's response: At any one point in the algorithm, the chance of that line being choosen is correct for if the algorithm ended there, and the chance for the line choosen before staying choosen is correct for if the algorithm ended there (1 out of n, and n-1 out n, respectively.) So though earlier lines are choosen with more liklihood at that point, they may be overridden by the later choices and everything turns out alright.

        Do you know the number of lines in advance?
        (if so, it may be more efficient to randomly select a file weighted by the number of lines in each, then randomly select a line from that file)
Re: Random entry from combined data set
by cLive ;-) (Prior) on Jul 04, 2001 at 07:13 UTC
    oops, beaten to it by damian...

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://93758]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (4)
As of 2024-03-29 13:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found