maxamillionk has asked for the wisdom of the Perl Monks concerning the following question:
Brief: Basically I am trying to work on data from a file as a big string, but the diamond <> only reads one line at a time and (I think) that's why it's slow.
Long explanation: I have a bash shell script that concatenates all text of an input file into memory, then it will count all words, and the number of times the words appear, in descending order. It will also filter out common words and delete empty lines. It takes 4 seconds to finish on large files. I have tried to re-create this in Perl:However, it takes way too long, about 40 to 50 seconds to finish. I am guessing it's because it reads one line at a time from the file. What is a good way to make Perl go faster? (I'm a noob!) For contrast, below is the bash shell script that runs way faster.#!/usr/bin/perl # word counting program use strict; use warnings; use autodie; # list of excluded words my @excluded = qw( a about although also an and another are as at be b +een before between but by can do during for from has how however in in +to is it many may more most etc ); # list of excluded characters my @excluded_chars = ( "\\'", "\\:", "\\@", "\\-", "\\~", "\\,", "\\." +, "\\(", "\\)", "\\?", "\\*", "\\%", "\\/", "\\[", "\\]", "\\=", '"' ); my %count; # this will contain many words while (<>) { foreach (split) { s/ ([A-Z]) /\L$1/gx; # lowercase each word # remove non-letter characters foreach my $char (@excluded_chars) { $_ =~ s/$char//g; } # remove excluded words foreach my $word (@excluded) { $_ =~ s/\b$word\b//g; } $count{$_}++; # count each separate word } } foreach my $word (sort { $count{$a} <=> $count{$b} or $a cmp $b } keys + %count) { print "$count{$word} $word\n"; }
I apologize if a similar question was asked before - I tried searching and wasn't able to find a thread resembling my own.#!/bin/bash # input a file name like this: # # count_mem.sh filename.txt # if [ $# -eq 0 ]; then echo "example usage: $(basename $0) file.txt" >&2 exit 1 elif [ $# -ge 2 ]; then echo "too many arguments" >&2 exit 2 fi sed s/' '/\\n/g "$1" | tr -d '[\.[]{}(),\!\\'\'''\"'\`\~\@\#\$\%\^\&\* +\+\=\|\;\:\<\>\?]' | tr [:upper:] [:lower:] | sed "s/\blong\b//gi" | sed "s/\blist\b//gi" | sed "s/\bof\b//gi" | sed "s/\bexcluded\b//gi" | sed "s/\bwords\b//gi" | sed "s/\bhere\b//gi" | sed '/^$/d' | sort | uniq -c | sort -nr | less
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Counting and Filtering Words From File
by haukex (Archbishop) on May 09, 2020 at 22:07 UTC | |
by maxamillionk (Acolyte) on May 10, 2020 at 00:55 UTC | |
|
Re: Counting and Filtering Words From File
by tybalt89 (Monsignor) on May 10, 2020 at 00:34 UTC | |
by hippo (Archbishop) on May 10, 2020 at 10:42 UTC | |
|
Re: Counting and Filtering Words From File
by jwkrahn (Abbot) on May 09, 2020 at 22:49 UTC | |
|
Re: Counting and Filtering Words From File
by perlfan (Parson) on May 12, 2020 at 02:33 UTC |