Re: Help sorting contents of an essay

I'm confused by some of your statements of your requirements.

#1. Sort alphabetically (ignoring capitalization).
#2. Sort alphabetically with upper case words just in front of lower case words with the same initial characters.
[Emphases added.]

These seem like two separate requirements. Do you want to do #1 first and then use the result to do #2, or do you want to do both and save both sets of results?

#3. Sort by frequency, from high to low, (any order for equal frequency).
#4. Sort by frequency, with alphabetical order for words with the same frequency.
[Emphases added.]

Again, these requirements seem at odds. Can you please clarify?

Please see Short, Self-Contained, Correct Example for info on providing example input and desired output and maybe also the actual code you've got so far. Maybe even see How to ask better questions using Test::More and sample data for a way to posit desired input/output examples.

Be that as it may, here's an approach to extracting words from a multi-line block of text and then sorting first alphabetically (upper-case first) and second by word frequency.

c:\@Work\Perl\monks>perl
use strict;
use warnings;

use Data::Dump qw(dd);  # for debug

my $text = <<'EOT';
Now is the time, now is the hour.
The rain in Spain falls mainly in Spain.
The rain in Spain falls mainly in Spain.
Foo foo foo Bar bar bar FOO BAR FOO BAR
EOT
print "[[$text]] \n";  # for debug

my $rx_word = qr{ \S+ }xms;

my @words = $text =~ m{ $rx_word }xmsg;
# dd \@words;  # for debug

my %word_count;
++$word_count{$_} for @words;
# dd \%word_count;  # for debug

my @sorted =
    sort { $a->[0] cmp $b->[0]  # sort first by alpha ascending
                   or
           $a->[1] <=> $b->[1]  # then by frequency ascending
         }
    map  [ $_, $word_count{$_} ],
    keys %word_count
    ;

dd \@sorted;  # for debug

print "'$_->[0]' ($_->[1]) \n" for @sorted;

__END__
[[Now is the time, now is the hour.
The rain in Spain falls mainly in Spain.
The rain in Spain falls mainly in Spain.
Foo foo foo Bar bar bar FOO BAR FOO BAR
]]
[
  ["BAR", 2],
  ["Bar", 1],
  ["FOO", 2],
  ["Foo", 1],
  ["Now", 1],
  ["Spain", 2],
  ["Spain.", 2],
  ["The", 2],
  ["bar", 2],
  ["falls", 2],
  ["foo", 2],
  ["hour.", 1],
  ["in", 4],
  ["is", 2],
  ["mainly", 2],
  ["now", 1],
  ["rain", 2],
  ["the", 2],
  ["time,", 1],
]
'BAR' (2)
'Bar' (1)
'FOO' (2)
'Foo' (1)
'Now' (1)
'Spain' (2)
'Spain.' (2)
'The' (2)
'bar' (2)
'falls' (2)
'foo' (2)
'hour.' (1)
'in' (4)
'is' (2)
'mainly' (2)
'now' (1)
'rain' (2)
'the' (2)
'time,' (1)
[download]

Note that, e.g., 'Spain' and 'Spain.' are extracted and counted separately because of the period at the end of one of them, and punctuation like , ; : ! ? ... will have a similar effect. This effect is due to the naive definition of the $rx_word regex; a better definition could eliminate such punctuation, but just what constitutes a "word" is tricky to define in general.

Note also that the entire content of a file can be read to a scalar string with the idiom
my $text = do { local $/; <$filehandle>; };
See perlvar for $/ info.

Update: The idiom used to produce the @sorted array

my @sorted =
    sort { $a->[0] cmp $b->[0]  # sort first by alpha ascending
                   or
           $a->[1] <=> $b->[1]  # then by frequency ascending
         }
    map  [ $_, $word_count{$_} ],
    keys %word_count
    ;
[download]

is known as a Schwartzian Transform (ST). Please see A Fresh Look at Efficient Perl Sorting for more info on this and other sorting idioms. Also see "How do I sort an array by (anything)?" in perlfaq4 and sort.

Give a man a fish: <%-{-{-{-<

Comment on Re: Help sorting contents of an essay Select or Download Code

Replies are listed 'Best First'.
Re^2: Help sorting contents of an essay by tobyink (Canon) on Apr 18, 2020 at 13:15 UTC
I would suggest `/\w+/` would be a pretty sensible place to start for matching words. Hyphenated words will be matched as two separate words, which may or may not be what you want, depending on the task at hand. toby döt ink	[reply] [d/l]
Re^3: Help sorting contents of an essay by AnomalousMonk (Archbishop) on Apr 18, 2020 at 20:31 UTC
Ah, what's in a word? In addition to hyphenations, I was thinking of cases like `son's sons' wouldn't wouldn't've O'Brien ain't t'ain't` etc, etc. And that's just ASCII English! `\w+` might be perfect for harmattan_'s application, but I don't know what that application is. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]