Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

So I have a text file of arbitrary length, and each line contains a discrete piece of data. I would like to read a random line from this file into $_. One way, of course, would be as follows:
@lines = <file>; #read the whole friggin' thing into memory $_ = @lines[int rand scalar @lines - 1];
But that can be prohibitively expensive for large text files. What's the best way to do this? Thanks.

Replies are listed 'Best First'.
Re: reading a random line from a file
by belg4mit (Prior) on Nov 17, 2002 at 17:29 UTC
    Did you consult the FAQ, "How do I select a random line from a file?"

    --
    I'm not belgian but I play one on TV.

      For scalabilities sake I quote perlfaq5 (most of the solutions offerred are redundant and wasteful):

      How do I select a random line from a file?

      Here's an algorithm from the Camel Book:

      srand; rand($.) < 1 && ($line = $_) while <>;

      This has a significant advantage in space over reading the whole file in. A simple proof by induction is available upon request if you doubt the algorithm's correctness.

        Scalability? If the OP only wants to ever read one random line from the file, the FAQ solution is fine.

        If however, he wishes to read a second or subsequent random line, the FAQ solution is far from efficient.


        Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
        Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
        Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
        Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

        If you haven't, the first call to rand will call srand for you. You shouldn't do that explicitly anymore, since you run the risk of putting the srand call in a tight loop, and then your random numbers won't be very random anymore..

        Makeshifts last the longest.

Re: reading a random line from a file
by BrowserUk (Patriarch) on Nov 17, 2002 at 17:30 UTC

    If you can ensure that each line is of a constant length by space padding to length of the longest pice of data, you could use seek to move directly to the line you pick randomly.

    A second way would be to pre-read the file line by line and record the start position of each line in an array. You could do this using something like

    my @positions = (0); open FILE, '<', 'file' or die $!; while(<FILE>) { push @positions, tell FILE; }

    You then select your line by picking a random position from the array, seek to it and read your data.

    my $randData = do{ seek FILE, $positions[rand @positions], 0; <FILE>; +}

    Note: Untested code. Read the docs and check the syntax etc.


    Okay you lot, get your wings on the left, halos on the right. It's one size fits all, and "No!", you can't have a different color.
    Pick up your cloud down the end and "Yes" if you get allocated a grey one they are a bit damp under foot, but someone has to get them.
    Get used to the wings fast cos its an 8 hour day...unless the Govenor calls for a cyclone or hurricane, in which case 16 hour shifts are mandatory.
    Just be grateful that you arrived just as the tornado season finished. Them buggers are real work.

Re: reading a random line from a file
by fruiture (Curate) on Nov 17, 2002 at 17:32 UTC

    Just jump to a very random position via seek() and then read the next line you find:

    open my $fh,'<',$some_file or die $!; my $size = -s $fh; seek $fh, rand($size) , 0; <$fh>; #throw away current line; my $randomline = <$fh>; close $fh;

    Only problem is that you might seek() to a position in the last line of the file and will get no random line. But it's not too hard to solve this problem.

    --
    http://fruiture.de

      Using seek for this is probably not a very good idea. Consider a file with these two lines:

      a bcdefghijklmnopqrstuvwxy z
      Most random seeks will end up in the second line. If you move backwards, you'll almost always pick the long line. If you move forwards, you'll almost always pick 'z'. One way or the other, your distribution isn't random at all but determined by line length.

      BrowserUk had the good sense to point out that the lines would have to be padded out to equal length to use this method.

      -sauoq
      "My two cents aren't worth a dime.";
      
Re: reading a random line from a file
by broquaint (Abbot) on Nov 17, 2002 at 17:33 UTC
    Just pick a random position in the file and seek() backwards until you find the line separator or the beginning of the file e.g
    open(my $f, $your_file) or die("ack: $!"); seek($f, int rand(-s $f), 0); { local $/ = \1; ## anyone care to optimise this? seek($f, -2, 1) until ($c = <$f>) eq "\n" or tell($f) == 0; } print scalar <$f>;
    That should print out a random line in a file without having to do a slurp.
    HTH

    _________
    broquaint

Re: reading a random line from a file
by sauoq (Abbot) on Nov 17, 2002 at 22:47 UTC

    If you have to pick a random line from the file many times over the FAQ answer is not very good as it would require reading the file once for each pick. If the file is big enough that slurping it isn't desirable, then using the second method BrowserUk suggested makes a lot of sense.

    I'd probably use Tie::File though. It would be easier and likely almost if not just as efficient because Dominus went to some length to make it fast.

    #!/usr/bin/perl -w use strict; use Tie::File; my @lines; my $o = tie @lines, 'Tie::File', 'somefile.txt' or die $!; print "$lines[rand @lines]\n" for 1..10;
    -sauoq
    "My two cents aren't worth a dime.";
    
Probablilty of reading a random line from a file
by UnderMine (Friar) on Nov 17, 2002 at 20:14 UTC
    The two major methods :-

    Prefetch the line locations and then randomly pick one has an even distribution so the chance of picking a short like is the same as a long line.

    Randomly picking a possition and moving to the head/end of the current record will be distorted by the length of the record. Longer records will have a higher probability o being picked than short records.

    However you can combine the methods if you are using fixed length records.

    my $record_length=200; # set the record length open my $fh,'<',$filename or die $!; # open handle my $size = -s $fh; # get the file size my $records=$size/$record_length; # workout the number of records seek $fh, rand($records)*$record_length, 0;# move to a random record my $randomline = <$fh>; # read that record close $fh; # Close handle
    Technically ypu should lock the file so it does not change beween getting its length and getting the random record but generally that is not a massive issue. Hope this helps

    UnderMine

Re: reading a random line from a file
by pg (Canon) on Nov 17, 2002 at 17:34 UTC
    One thing you can do is to seek to a random position, and then read what is between two newlines.