Hello everyone!
Just to make everything clear. I need this for a project were I am being graded. However, the project is not about Perl. It is about doing statistics with unstructured
data, ie text.
I can do this on excel manually, but I think it would be nice to have a code that will generalyze my analysis and algorithm to any collection of texts.
The final goal is to create a process to classify text and information, with no human interaction. That is what I being graded on and I do not need help on that. Just perl...
counting words in text and stuff like that.
Here is my question
Given this code
#! perl -w
$filename = "tryit.txt";
open(IN, $filename) || die;
my %freq;
my @title; # array of titles
my $story; # number of current story
while(<IN>) {
if(/^\<(.*)\>\s*$/) {
# It's a title
push @title, $1;
$story = $#title;
} elsif (defined $story) {
# It's plain text
s/[\.,:;\?"!\(\)\[\]\{\}(--)_]//g;
foreach my $word (/\w+/g) {
$freq{lc $word}[$story]++;
}
}
}
# print "\n\nOutput tab delimited text file:\n\n";
{
local($\, $,) = ("\n", "\t");
print '', @title;
foreach my $row (sort keys %freq) {
print $row, map $_ || '', @{$freq{$row}}[0 .. @title-1]
}
}
close IN
this code takes a multiple pieces of text and creates a table with the words that appear on each story as rows and story titles as columns. Then each "cell" counts the number of times each word appears on each story
So for example
Story One: Perl is great
Story Two: Perl is free perl
Story three: Will I learn perl?
will return:
story1 Story2 Story3
Perl 1 2 1
is 1 1
great1
free 1
will 1
i 1
learn 1
NOw in order to do what I need to accomplish my final task i need to sum rows,
that is for example: how many times does the word perl appears on the stories?
then I need to sum colums, how may words does story one have?,
And finally I need to find out how many words do stories 1 and 2 or 3 have in common.
I know I could take the output and do this on excel, however i need to hand in perl code....
Thank you!!
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.