Help sorting contents of an essay

harmattan_ has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: Help sorting contents of an essay by haukex (Archbishop) on Apr 18, 2020 at 06:34 UTC
I have tried everything I've seen online but cant get anywhere. Unfortunately, you don't show what code you've tried to do these things, so we can't help you with the specific problems you had - we're happy to help you learn if you show your efforts (= code). Here are a few general hints: Always Use strict and warnings See "open" Best Practices See the Basic debugging checklist See perlintro Note that `@yes_finally = <IN>` will read the entire file into the array, leaving nothing left for the following `while($in = <IN>)`. If you want to sort the words in each line, then yes, you'll need to use split.	[reply] [d/l] [select]
Re: Help sorting contents of an essay by AnomalousMonk (Archbishop) on Apr 18, 2020 at 07:51 UTC
I'm confused by some of your statements of your requirements. #1. Sort alphabetically (ignoring capitalization). #2. Sort alphabetically with upper case words just in front of lower case words with the same initial characters. [Emphases added.] These seem like two separate requirements. Do you want to do #1 first and then use the result to do #2, or do you want to do both and save both sets of results? #3. Sort by frequency, from high to low, (any order for equal frequency). #4. Sort by frequency, with alphabetical order for words with the same frequency. [Emphases added.] Again, these requirements seem at odds. Can you please clarify? Please see Short, Self-Contained, Correct Example for info on providing example input and desired output and maybe also the actual code you've got so far. Maybe even see How to ask better questions using Test::More and sample data for a way to posit desired input/output examples. Be that as it may, here's an approach to extracting words from a multi-line block of text and then sorting first alphabetically (upper-case first) and second by word frequency. c:\@Work\Perl\monks>perl use strict; use warnings; use Data::Dump qw(dd); # for debug my $text = <<'EOT'; Now is the time, now is the hour. The rain in Spain falls mainly in Spain. The rain in Spain falls mainly in Spain. Foo foo foo Bar bar bar FOO BAR FOO BAR EOT print "[[$text]] \n"; # for debug my $rx_word = qr{ \S+ }xms; my @words = $text =~ m{ $rx_word }xmsg; # dd \@words; # for debug my %word_count; ++$word_count{$_} for @words; # dd \%word_count; # for debug my @sorted = sort { $a->[0] cmp $b->[0] # sort first by alpha ascending or $a->[1] <=> $b->[1] # then by frequency ascending } map [ $_, $word_count{$_} ], keys %word_count ; dd \@sorted; # for debug print "'$_->[0]' ($_->[1]) \n" for @sorted; __END__ [[Now is the time, now is the hour. The rain in Spain falls mainly in Spain. The rain in Spain falls mainly in Spain. Foo foo foo Bar bar bar FOO BAR FOO BAR ]] [ ["BAR", 2], ["Bar", 1], ["FOO", 2], ["Foo", 1], ["Now", 1], ["Spain", 2], ["Spain.", 2], ["The", 2], ["bar", 2], ["falls", 2], ["foo", 2], ["hour.", 1], ["in", 4], ["is", 2], ["mainly", 2], ["now", 1], ["rain", 2], ["the", 2], ["time,", 1], ] 'BAR' (2) 'Bar' (1) 'FOO' (2) 'Foo' (1) 'Now' (1) 'Spain' (2) 'Spain.' (2) 'The' (2) 'bar' (2) 'falls' (2) 'foo' (2) 'hour.' (1) 'in' (4) 'is' (2) 'mainly' (2) 'now' (1) 'rain' (2) 'the' (2) 'time,' (1) [download] Note that, e.g., `'Spain'` and `'Spain.'` are extracted and counted separately because of the period at the end of one of them, and punctuation like `, ; : ! ? ...` will have a similar effect. This effect is due to the naive definition of the `$rx_word` regex; a better definition could eliminate such punctuation, but just what constitutes a "word" is tricky to define in general. Note also that the entire content of a file can be read to a scalar string with the idiom `my $text = do { local $/; <$filehandle>; };` See perlvar for `$/` info. Update: The idiom used to produce the `@sorted` array `my @sorted = sort { $a->[0] cmp $b->[0] # sort first by alpha ascending or $a->[1] <=> $b->[1] # then by frequency ascending } map [ $_, $word_count{$_} ], keys %word_count ;` [download] is known as a Schwartzian Transform (ST). Please see A Fresh Look at Efficient Perl Sorting for more info on this and other sorting idioms. Also see "How do I sort an array by (anything)?" in perlfaq4 and sort. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Help sorting contents of an essay by tobyink (Canon) on Apr 18, 2020 at 13:15 UTC
I would suggest `/\w+/` would be a pretty sensible place to start for matching words. Hyphenated words will be matched as two separate words, which may or may not be what you want, depending on the task at hand. toby döt ink	[reply] [d/l]
Re^3: Help sorting contents of an essay by AnomalousMonk (Archbishop) on Apr 18, 2020 at 20:31 UTC
Ah, what's in a word? In addition to hyphenations, I was thinking of cases like `son's sons' wouldn't wouldn't've O'Brien ain't t'ain't` etc, etc. And that's just ASCII English! `\w+` might be perfect for harmattan_'s application, but I don't know what that application is. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re: Help sorting contents of an essay by kcott (Archbishop) on Apr 18, 2020 at 09:00 UTC
G'day harmattan_, Welcome to the Monastery. There are some issues with your code which aren't helping you. You haven't used the strict or warnings pragmata. You should include these in all of your code. See "perlintro: Safety net". You've used package variables for filehandles; choose lexical filehandles instead (and preferably in the smallest scope possible). If you get into the habit of using names like `IN` and `OUT`, you'll likely continue that usage in larger programs and possibly run into all sorts of problems. See "perlintro: Files and I/O". You haven't checked if I/O operations have been successful. The link above (perlintro: Files and I/O) shows one way to do this (i.e. '`... or die "...";`'). I find hand-crafting all of those `die` messages tedious and, frankly, error-prone: it's very easy to miss them out, not update them when other code changes, and so on. A far easier way is to let Perl handle all of that for you with the autodie pragma; there are some cases where that's not appropriate, but mostly it is: I use it whenever possible; including in production code. There are some issues with your post which aren't helping us to help you. It helps us greatly if you provide a small, but representative, input sample; and exactly the output you expect from that data. Please post such data within `<code>...</code>` tags — as you did with your code — so we can use the [download] link to get a verbatim copy of your data. Prosaic descriptions of input, processing, and output, are rarely useful: "a picture paints a thousand words" and so does data! Please show us what you've tried, rather than just saying you tried lots of (unspecified) things. You may be on the right track and we can nudge you closer to a solution; you may be working under some misapprehension and going completely down the wrong path — we can help with that too if we know what you're doing wrong. Also show us excatly the output you're getting, including all error and warning messages. Please also provide that within `<code>...</code>` tags. Have a look at these links: "How do I post a question effectively?" and "Short, Self-Contained, Correct Example". You have two fundamental flaws in the code you have supplied. You are reading all input records with '`@yes_finally = <IN>`' and then attempt to read more records with '`while($in = <IN>)`'. I don't think you need to do this here; however, for future reference, you'd need to reposition the file pointer back to the start (see seek) and possible reset the record counter (see "perlvar: $."). Your comments regarding sorting are all within the `while` loop. You won't be able to sort the data until you have the data to sort. I think you're completely on the wrong track here; although, without any code, I can't tell for certain — this may be your main stumbling block. In the code below, I've shown a single pass through the input which collects the data (`@data_all`) as well as other information (`%data_info`) that is used in various places by sort — there's no need to recalculate counts, or perform transformation for case-insensitive checks, multiple times. Note: I used lc but fc would be a far better choice; `fc` requires Perl 5.16 or later — use `fc` if you have an appropriate Perl version. You'll note a map-sort-map pattern in the code. That's called a Schwartzian Transform. Take a look at "A Fresh Look at Efficient Perl Sorting" for a description of that and other sorting methods. I've include example code for each of the four sorts you mentioned. I believe the first three are what you want. The fourth may not be exactly what you're after: this is an example where expected output, as I wrote about above, would have really helped. #!/usr/bin/env perl use strict; use warnings; use constant { NO_CASE => 0, COUNT => 1, }; my (@data_all, %data_info); while (<DATA>) { chomp; push @data_all, $_; if (exists $data_info{$_}) { ++$data_info{$_}[COUNT]; } else { $data_info{$_} = [lc, 1]; } } print "Sort alphabetically - ignore case\n"; print "$_\n" for map { $_->[0] } sort { $a->[1] cmp $b->[1] } map { [ $_, $data_info{$_}[NO_CASE] ] } @data_all; print "Sort alphabetically - capitalisation matters\n"; print "$_\n" for map { $_->[0] } sort { $a->[1] cmp $b->[1] \|\| $a->[2] cmp $b->[2] \|\| $a->[0] cmp $b->[0] } map { [ $_, substr($data_info{$_}[NO_CASE], 0, 1), substr($_, 0, 1) ] } @data_all; print "Sort by frequency - ignore alphabetical order\n"; print "$_->[1]: $_->[0]\n" for sort { $b->[1] <=> $a->[1] } map { [ $_, $data_info{$_}[COUNT] ] } keys %data_info; print "Sort by frequency - then by alphabetical order\n"; print "$_->[1]: $_->[0]\n" for sort { $b->[1] <=> $a->[1] \|\| $a->[0] cmp $b->[0] } map { [ $_, $data_info{$_}[COUNT] ] } keys %data_info; __DATA__ bb Aa CC dD bb AA dD aa BB aa dD aA [download] Output: `Sort alphabetically - ignore case Aa AA aa aa aA bb bb BB CC dD dD dD Sort alphabetically - capitalisation matters AA Aa aA aa aa BB bb bb CC dD dD dD Sort by frequency - ignore alphabetical order 3: dD 2: aa 2: bb 1: Aa 1: BB 1: aA 1: CC 1: AA Sort by frequency - then by alphabetical order 3: dD 2: aa 2: bb 1: AA 1: Aa 1: BB 1: CC 1: aA` [download] — Ken	[reply] [d/l] [select]
Re: Help sorting contents of an essay by BillKSmith (Monsignor) on Apr 18, 2020 at 17:51 UTC
The first two things you must do is read the essay and divide it into words. The best way to read it depends on how you plan to divide it. The best way to divide it depends on the format of the essay and your definition of 'word'. You probably think that this is obvious and if there is an occasional problem, you will deal with it later. That is a big mistake. For simplicity, let us assume that the essay consists only of English words (only ASCII letters, no numbers, no hyphenated or foreign words) with standard English punctuation(,.'"?!). I will also assume that the essay is less than 10,000 characters long and that it is divided into lines less than 80 characters long. Lines are separated by newlines. Paragraphs are separated by blank lines. Sentences are separated by two spaces (or a newline). Words are separated by a single space. A program which handles this very well may be extremely difficult to modify, You should let us know which of these assumptions are not true and which are likely to change in the future. You specify four outputs. Do you really want them all written to the same file? If so, how whould they be identified (or at least separated)? Bill	[reply]
Re: Help sorting contents of an essay by leszekdubiel (Scribe) on Apr 18, 2020 at 21:33 UTC
#!/usr/bin/perl -CSDA use Modern::Perl; use Data::Dump qw{dd}; use Path::Tiny; # read my @essay = path('./essay.txt')->lines_utf8({chomp => 1}); print "\n\n\nessay is: "; dd(\@essay); my %stats = ( Sort_alphabetically_ignoring_capitalization => [ sort { lc $a cmp lc $b } @essay ], #2. Sort alphabetically with upper case words just in front of low +er case words with the same initial characters. # --- needs sophisticated algorithm to extract "same ininitial cha +racters" first... # I don't code that #3. Sort by frequency, from high to low, (any order for equal freq +uency). # frequency of words or lines? if words then: Sort_by_fequency => do { # word count my %wc; $wc{$_}++ for # 3. and feed these words to + "for" loop map { (split /\s+/, $_) } # 2. each line becomes stream of + words @essay; # 1. lines from file # return array ref... (read from botton -- take words, sort co +mparing counts, # if counts are same then compare words, make pairs word -- wo +rd count [ map { [$_, $wc{$_}] } sort { $wc{$b} <=> $wc{$a} or $a cmp $b } keys %wc ]; }, #4. Sort by frequency, with alphabetical order for words with the +same frequency. # same as above ); print "\n\n\nstats are: "; dd(\%stats); essay is: [ "Lorem ipsum dolor sit amet, eos ei nihil feugait, ius id sonet volu +mus molestiae, no nonumes vivendo nam. Mea diam", "putant te. Volumus euripidis instructior id pro, et accusata instru +ctior quo. Et sed facete alienum, duo cu audire", "expetendis. Pro nibh nostrum efficiendi te.", "", "Unum viderer mnesarchum eos no, dico liberavisse ius eu, ad dicant +aliquid partiendo sed. Mea no vivendo persecuti", "abhorreant. Enim possim mei ut, nibh noluisse delectus ei his. Atqu +i convenire vituperatoribus his at, ut meliore", "senserit usu. Idque verear latine mel id, everti latine et qui, in +alia erat vix. Ex his accusata elaboraret, quem illud", "in eam.", "", ] stats are: { Sort_alphabetically_ignoring_capitalization => [ "", "", "abhorreant. Enim possim mei ut, nibh noluisse delectus ei his. At +qui convenire vituperatoribus his at, ut meliore", "expetendis. Pro nibh nostrum efficiendi te.", "in eam.", "Lorem ipsum dolor sit amet, eos ei nihil feugait, ius id sonet vo +lumus molestiae, no nonumes vivendo nam. Mea diam", "putant te. Volumus euripidis instructior id pro, et accusata inst +ructior quo. Et sed facete alienum, duo cu audire", "senserit usu. Idque verear latine mel id, everti latine et qui, i +n alia erat vix. Ex his accusata elaboraret, quem illud", "Unum viderer mnesarchum eos no, dico liberavisse ius eu, ad dican +t aliquid partiendo sed. Mea no vivendo persecuti", ], Sort_by_fequency => [ ["Mea", 2], ["accusata", 2], ["ei", 2], ["eos", 2], ["et", 2], ["his", 2], ["id", 2], ["in", 2], ["instructior", 2], ["ius", 2], ["latine", 2], ["nibh", 2], ....... ["ut,", 1], ["verear", 1], ["viderer", 1], ["vituperatoribus", 1], ["vix.", 1], ["volumus", 1], ], } [download]	[reply] [d/l]