Re: Word Pairs and Lines
by ikegami (Patriarch) on Oct 08, 2004 at 20:07 UTC
|
my ($word1, $word2) = split(/\s+/);
if defined($word2) {
...
}
instead of:
foreach $word (split /\s+/) {
...
}
In other words, don't loop over every word. Get the first two words of every line, and work with those.
| [reply] [d/l] [select] |
Re: Word Pairs and Lines
by jeffa (Bishop) on Oct 08, 2004 at 20:24 UTC
|
Interesting ... I had to waste some time on this one. ;)
Now then, if you do perform a $word =~ s/\W//g; on the entire sentance, how do you
know when the sentance ends? You have to keep some punctuation around. Anyways, try this out.
Hopefully some other monks will have better answers for you, but this is a simple approach.
#!/usr/bin/perl -l
use strict;
use warnings;
use Data::Dumper;
my $data = do {local $/;<DATA>};
my @sent = split /[.!?]\B/,$data;
my @parsed;
for my $i (0 .. $#sent) {
next if $sent[$i] =~ /^$/;
my @word = map $_ || (), split /\s+/,$sent[$i];
for (my $j = 0; $j < @word; $j += 2) {
push @{ $parsed[$i] }, [ $word[$j], $word[$j+1] ];
}
}
print Dumper \@parsed;
# second pair from second sentance (should be 'a test')
print join ' ', @{$parsed[1]->[1]};
__DATA__
This is Joe. This is a test. This is not a test.
Blah blah isn't this fun? I wish you were here! Nah ...
| [reply] [d/l] [select] |
Re: Word Pairs and Lines
by TedPride (Priest) on Oct 08, 2004 at 22:17 UTC
|
I'm assuming you want to count how many of each word pair is in the document, and rank the words by number found. I'm also assuming that a line break is always the end of a sentence, and that you want periods to signify the end of a sentence.
use strict;
my (@lines, @words, $i, $pair, %hash);
foreach (<DATA>) {
$_ =~ s/[^\w\. ]//g; # Remove unneeded characters
$_ =~ s/ +/ /g; # Many spaces to one
$_ =~ s/ ?\. ?(\. ?)*/\./g; # Boundaries with . change to .
$_ =~ s/^ //; $_ =~ s/ $//; # Spaces at start and end removed
$_ = lc($_); # Lowercase
@lines = split(/\./, $_); # Split on sentence boundaries
foreach (@lines) { # Get words for each sentence
@words = split(/[\. ]/, $_);
for ($i = 0; $i < $#words; $i++) { # For each word pair
$pair = @words[$i] . ' ' . @words[$i+1];
$hash{$pair}++; # Increment count for word pair
}
}
}
foreach (sort {$hash{$b} <=> $hash{$a}} keys %hash) {
print $_ . ' ' . $hash{$_} . "\n";
}
__DATA__
Four score and seven years ago our fathers brought forth, upon this co
+ntinent, a new nation, conceived in liberty, and dedicated to the pro
+position that "all men are created equal"
Now we are engaged in a great civil war, testing whether that nation,
+or any nation so conceived, and so dedicated, can long endure. We are
+ met on a great battle field of that war. We have come to dedicate a
+portion of it, as a final resting place for those who died here, that
+ the nation might live. This we may, in all propriety do. But, in a l
+arger sense, we can not dedicate -- we can not consecrate -- we can n
+ot hallow, this ground -- The brave men, living and dead, who struggl
+ed here, have hallowed it, far above our poor power to add or detract
+. The world will little note, nor long remember what we say here; whi
+le it can never forget what they did here.
It is rather for us, the living, we here be dedicated to the great tas
+k remaining before us -- that, from these honored dead we take increa
+sed devotion to that cause for which they here, gave the last full me
+asure of devotion -- that we here highly resolve these dead shall not
+ have died in vain; that the nation, shall have a new birth of freedo
+m, and that government of the people by the people for the people, sh
+all not perish from the earth.
| [reply] [d/l] |
Re: Word Pairs and Lines
by bob (Novice) on Oct 09, 2004 at 03:29 UTC
|
Trying to rethink this... First, these aren't sentences. They're lists of headlines -- so phrases, each ending with a hard return.
Second, the first script I posited above works just fine in listing the various word pairs and their frequency. So that's not a problem. The PROBLEM I'm having is massive redundancy. Below is a short example of the word pairs found, and their frequency - output from the script above.
OPEN SOURCE 9
WINDOWS XP 8
NERO BURNING 7
BURNING ROM 7
FLAW FOUND 6
Pairs 3 and 4 refer to the same headline. It's something like "Nero Burning ROM." I'd like the script to produce only one pair for each headline. So that once "Nero Burning" is output, "Burning Rom" is recognized as redundant and deleted.
Now there may be an easier way to do this than what I asked for above. As I said, my thinking maybe wasn't straight enough. Possibly a second script, which takes the output file, wordburst.txt, and removes all pairs where there is in the second pair a word which appeared in a previous pair. I've tried to formulate a regex to do this, but no luck.....
| [reply] |
|
|
Hmm. So what you want is the first word pair in each sentence - but a count for that pair across all sentences?
| [reply] |
Re: Word Pairs and Lines
by bob (Novice) on Oct 08, 2004 at 21:00 UTC
|
I see my question was poorly formulated. I want the highest frequency pairs to float to the top, so the first two words won't do. Let me come back to you with a better way to put this.... Might be that the whole doc will have to be parsed as in the original script, then the results worked over... | [reply] |
Re: Word Pairs and Lines
by The_Rabbit (Acolyte) on Oct 08, 2004 at 20:02 UTC
|
What I want it to do is find ONLY a single word pair from each line in the file it looks at. So the the line "Joe works hard" will only return "Joe works" as a word pair, and not also return "works hard." In other words, I only want it to find one pair per line.
I'm sort of confused by this statement. Is there any criteria for selecting a word pair? Or do you always want to select the first two words on a line as the word pair? | [reply] |
Re: Word Pairs and Lines
by bob (Novice) on Oct 09, 2004 at 03:44 UTC
|
Just to continue a bit, Then, if I had a headline like "Nero Burning Rom Released Today," I'd wind up with:
Nero Burning
Burning Rom
Rom Released
Released Today
And all except the first would be deleted because they are redundant with the first, or with subsequent redundancies...???? (Hmmmm.... might want it to work backwards... deleting the first initially.... So redundancies would be matched....??? Not sure that's necessary...)
| [reply] |
Re: Word Pairs and Lines
by bob (Novice) on Oct 09, 2004 at 04:07 UTC
|
The reason I can't go with the first two words only is that I might have a headline like "Nero Burning Rom Released Today" and then another like "New Release of Nero Burning Rom Out." I'd want to pick up and count a matched pair from both. | [reply] |
|
|
bob,
You have a hard problem. It is easy for a human to see that those two headlines are related, but a program only does what you tell it. One approach may be:
For each headline -
- Create a 2 element array of first two words and entire headline
- Go through all previous full headlines to see if it has been seen already
- If yes - increment the count, if no - add it as a new item
The problem is that there is likely a high probability that two words will be present in two different headlines that are not related. Other approaches might be to split out the words, sort them, and look for the total number in common. In any case, you are not going to come up with a fool proof system. If the logic above is what you want and you can't figure it out, let me know and I can whip up something.
| [reply] |
Re: Word Pairs and Lines
by bob (Novice) on Oct 09, 2004 at 23:09 UTC
|
Oh Ted--you said that too... Sorry.. I missed your post | [reply] |
Re: Word Pairs and Lines
by bob (Novice) on Oct 09, 2004 at 15:09 UTC
|
L~R, if I'm following you, your suggestion may be the closest yet to what I want. I'll think about that.
Not sure about the mechanics....
Is this what you mean? Take first two words, count the frequency of each such word pair against the headlines....
(BTW, this would be done after removing "stopwords" (and, the, a, and the rest of a long list) | [reply] |