This script, originally from http://www-2.cs.cmu.edu/~rosie/, and modified a bit by me, finds word pairs in a file and counts them. What I want it to do is find ONLY a single word pair from each line in the file it looks at. So the the line "Joe works hard" will only return "Joe works" as a word pair, and not also return "works hard." In other words, I only want it to find one pair per line. Heh! I'm no Perl Guru,so keep it as simple as possible, and I thank you in advance for your help. So any ideas?
#!/usr/bin/perl -w print "Content-type: text/html\n\n"; $databasefile = "/home/virtual/admin16/var/www/cgi-bin/DB_Search/Data_ +files/top.xml"; $wordfile = "/home/virtual/admin16/var/www/html/news/wordburst.txt"; my $file = $databasefile; open (FILE,$file) || die "Cannot read from $file"; flock(FILE, 2); # Locking file open (OUTFILE, ">$wordfile") || die "error opening $wordfile $!\n"; $lastword = "BEGINNING_OF_TEXT"; $wordcounts{$lastword}++; while (<FILE>) { s/[^\w\s]//g; foreach $word (split /\s+/) { # we only want to deal with normal words # replace all non-alphabetic characters in the word $word =~ s/\W//g; # add one to the count of each word in this file # the curly brackets mean an associative array; # indexed by the word name $wordcounts{$word}++; $totalwords++; # we can make an associative array on the pair of words # if it's the first time we've seen this pair, # record how to split it back into two words $word_pair_counts{"$lastword,$word"}++ or $word_pair_split{"$lastword,$word"} = [ $lastword, $word ]; # now remember what word we saw last for the next pair $lastword = $word; } } # now look at the most frequent word pairs $word_pairs_printed = 0; foreach (sort { $word_pair_counts{$b} <=> $word_pair_counts{$a} } k +eys %word_pair_counts) { ($word1, $word2) = split(/,/); printf OUTFILE ("\ $word1 $word2 $word_pair_counts{$_}\n"); $word_pairs_printed++; # last ends the loop last if ($word_pairs_printed > 39); } flock(FILE, 8); # Unlocking file close(OUTFILE);

In reply to Word Pairs and Lines by bob

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.