in reply to Help sorting contents of an essay

I'm confused by some of your statements of your requirements.

#1. Sort alphabetically (ignoring capitalization).
#2. Sort alphabetically with upper case words just in front of lower case words with the same initial characters.
[Emphases added.]
These seem like two separate requirements. Do you want to do #1 first and then use the result to do #2, or do you want to do both and save both sets of results?
#3. Sort by frequency, from high to low, (any order for equal frequency).
#4. Sort by frequency, with alphabetical order for words with the same frequency.
[Emphases added.]
Again, these requirements seem at odds. Can you please clarify?

Please see Short, Self-Contained, Correct Example for info on providing example input and desired output and maybe also the actual code you've got so far. Maybe even see How to ask better questions using Test::More and sample data for a way to posit desired input/output examples.

Be that as it may, here's an approach to extracting words from a multi-line block of text and then sorting first alphabetically (upper-case first) and second by word frequency.

c:\@Work\Perl\monks>perl use strict; use warnings; use Data::Dump qw(dd); # for debug my $text = <<'EOT'; Now is the time, now is the hour. The rain in Spain falls mainly in Spain. The rain in Spain falls mainly in Spain. Foo foo foo Bar bar bar FOO BAR FOO BAR EOT print "[[$text]] \n"; # for debug my $rx_word = qr{ \S+ }xms; my @words = $text =~ m{ $rx_word }xmsg; # dd \@words; # for debug my %word_count; ++$word_count{$_} for @words; # dd \%word_count; # for debug my @sorted = sort { $a->[0] cmp $b->[0] # sort first by alpha ascending or $a->[1] <=> $b->[1] # then by frequency ascending } map [ $_, $word_count{$_} ], keys %word_count ; dd \@sorted; # for debug print "'$_->[0]' ($_->[1]) \n" for @sorted; __END__ [[Now is the time, now is the hour. The rain in Spain falls mainly in Spain. The rain in Spain falls mainly in Spain. Foo foo foo Bar bar bar FOO BAR FOO BAR ]] [ ["BAR", 2], ["Bar", 1], ["FOO", 2], ["Foo", 1], ["Now", 1], ["Spain", 2], ["Spain.", 2], ["The", 2], ["bar", 2], ["falls", 2], ["foo", 2], ["hour.", 1], ["in", 4], ["is", 2], ["mainly", 2], ["now", 1], ["rain", 2], ["the", 2], ["time,", 1], ] 'BAR' (2) 'Bar' (1) 'FOO' (2) 'Foo' (1) 'Now' (1) 'Spain' (2) 'Spain.' (2) 'The' (2) 'bar' (2) 'falls' (2) 'foo' (2) 'hour.' (1) 'in' (4) 'is' (2) 'mainly' (2) 'now' (1) 'rain' (2) 'the' (2) 'time,' (1)
Note that, e.g.,  'Spain' and  'Spain.' are extracted and counted separately because of the period at the end of one of them, and punctuation like  , ; : ! ? ... will have a similar effect. This effect is due to the naive definition of the  $rx_word regex; a better definition could eliminate such punctuation, but just what constitutes a "word" is tricky to define in general.

Note also that the entire content of a file can be read to a scalar string with the idiom
    my $text = do { local $/;  <$filehandle>; };
See perlvar for  $/ info.

Update: The idiom used to produce the  @sorted array

my @sorted = sort { $a->[0] cmp $b->[0] # sort first by alpha ascending or $a->[1] <=> $b->[1] # then by frequency ascending } map [ $_, $word_count{$_} ], keys %word_count ;
is known as a Schwartzian Transform (ST). Please see A Fresh Look at Efficient Perl Sorting for more info on this and other sorting idioms. Also see "How do I sort an array by (anything)?" in perlfaq4 and sort.


Give a man a fish:  <%-{-{-{-<

Replies are listed 'Best First'.
Re^2: Help sorting contents of an essay
by tobyink (Canon) on Apr 18, 2020 at 13:15 UTC

    I would suggest /\w+/ would be a pretty sensible place to start for matching words. Hyphenated words will be matched as two separate words, which may or may not be what you want, depending on the task at hand.

      Ah, what's in a word? In addition to hyphenations, I was thinking of cases like  son's sons' wouldn't wouldn't've O'Brien ain't t'ain't etc, etc. And that's just ASCII English!  \w+ might be perfect for harmattan_'s application, but I don't know what that application is.


      Give a man a fish:  <%-{-{-{-<