Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

What's the easiest way to search a file and take the first XX words (anything separated by one or more whitespaces whether it be numbers, letters, etc.)?

I am trying to mimic what some search spiders do by reading just the first set of words and declare special phrases from it.

To save time, I know how to open a file but I don't know how to read on white space.

Thanks for your time!

Replies are listed 'Best First'.
Re: Counting words
by Limbic~Region (Chancellor) on Sep 13, 2004 at 14:10 UTC
    Anonymous Monk,
    I think what you are asking is:

    How do I get a list of words, defined by white space, in a file and the number of times they appear. I realize that words like "book keeper" which can be spelled with or without a space, different case, and words that wrap lines are going to be an issue, but I want a 99% solution.

    #!/usr/bin/perl use strict; use warnings; my $file = $ARGV[0] || 'foo.txt'; open (INPUT, '<', $file) or die "Unable to open $file for reading : $! +"; my %word; while ( <INPUT> ) { chomp; $word{$_}++ for split " "; } print "$_ : $word{$_}\n" for sort { $word{$b} <=> $word{$a} } keys %wo +rd;
    It isn't perfect (99% solution), and the sort routine is not the most efficient, but you get the idea.

    Cheers - L~R

Re: Counting words
by zejames (Hermit) on Sep 13, 2004 at 14:15 UTC
    Use the split function.
    my $max_words = 8; open F, "< d:\\temp\\perl\\test.txt" or die "Unable to open file\n"; undef $/; $data = <F>; @words = split /\W+/, $data; print join ':', @words[0..$max_words - 1];

    --
    zejames
Re: Counting words
by jbware (Chaplain) on Sep 13, 2004 at 14:40 UTC
    Here is a regex way to grab the first x words for each line (in my example x=4).
    use strict; open (IN,"<in.txt") or die $!; while (<IN>) { print "$1\n" if (/^\s*((?:[^\s]+\s+){4})/); } close(IN);

    -jbWare

      I modified your code a little to fit the problem description slightly better:

      use strict; my $num_words = 4; my @words = (); open FILE, "<in.txt" or die $!; while ( <FILE> ) { last if ( push ( @words, m/\s*([^\s]+)\s*/g ) >= $num_words ); } close FILE; print join( " ", @words ) . "\n";

      As a side note, is there a reason to use [^\s] rather than \S, or is it just a matter of preference? Thanks.

      Zenon Zabinski | zdog | zdog@perlmonk.org

        Yeah, "last if" is a good call, I wasn't thinking. The OP wasn't clear exactly how they wanted the results back (string or array of words), so I choose string. Nice convert to array though.

        As far as [^\s] versus \S, force of habit. If my laziness virtue would kick in like its supposed to, I'd have switched to \S by now and saved some keystrokes :)

        -jbWare
Re: Counting words
by tachyon (Chancellor) on Sep 14, 2004 at 02:31 UTC

    As you have a stream you probably want to use stream parsing logic. Here is an example:

    my ( $space, $wc ); my $get_words = 5; while( read(DATA,$_,1) ) { if ( m/\s/ ) { $space = 1; } else { if ( $space ) { print "\n"; $space = 0; $wc++; last if $wc >= $get_words; } print; } } __DATA__ There was an old lady who lived in a shoe

    cheers

    tachyon