grahambuck has asked for the wisdom of the Perl Monks concerning the following question:

It's a brand new year and I'm a brand new member! I want to thank you all for your help here.

I've used perl mostly with regex to do text modifications on files for digital publishing. Most of this is simple search/replace phrasing. So that's most of the extent of my knowledge.

I've started in on a new project and have found myself in a quandry. I've got sections in this text file that I need to resort. They look like this:

[section start marker (shift-opt-5)] [some line of text] entry \(frequency data\) entry \(frequency data\) … [section start marker (shift-opt-5)] [and the pattern repeats]

I need to re-sort the entries (not the first line of text) in two ways, first by descending frequency and then within equal frequencies, by ascending alphabet.

I know that if I create an array of data I can do this kind of sort. Here is what I have in that regard:

sub two_way_sort { ($b =~ /\((\d+)/)[0] <=> ($a =~ /\((\d+)/)[0] || lc($a) cmp lc($b) } my @unsorted_input; while (my $line = <Input>) { [section here of some text edits] push @unsorted_input, $line; } my @sorted = sort two_way_sort @unsorted_input;

This code will sort the entire text file as I've placed the whole thing in an array. What I can't seem to figure out is how to create an another array into which I can place each section, do the sort, return the sorted text to the original file, and move on to the next section.

Any help you guys/gals could provide would be amazing. I'm really looking forward to growing in my understanding of Perl.

Replies are listed 'Best First'.
Re: Use Perl's Sort to only sort certain lines in a file?
by nlwhittle (Beadle) on Jan 01, 2015 at 19:50 UTC

    If I understand you correctly, you want to keep each section in place, and only sort the entries under the bracketed line. If that is the case, then one way to do that would be:

    my @unsorted_input; while (my $line = <Input>) { if ($line =~ /\A\[/) { # line starts with a bracket if (@unsorted_input) { # if array has entries >>>call your sort routine here<<< print @unsorted_input; undef @unsorted_input; } print $line; # print the header line } else { # line is an entry line push @unsorted_input, $line; } } >> your sort routine << # added code per soonix's correction print @unsorted_input;

    How it works: The program will keep each section heading line (with the brackets) in place. All of the other lines for that section will go into your array. Once the next section line is encountered, the program will sort the entry lines currently in the array (for the last section), print them, then print the next section line and repeat the process.

    Update: I know the code looks backward, but when the first header line is encounterd, there will be no data in the array, so the array printing will be skipped and just the header line gets printed. After accumulating the entries for that section in the array, then when the next section header is encountered, the array gets printed, followed by the section header.

    Update 2: Updated code per soonix's correction below

    --Nick
      ++, but if the input doesn't end with a "section" line, the last block will not be printed, so you have to put
      >>>call sort routine<<< print @unsorted_input;
      after the loop...

        I knew I was missing something...

        --Nick

      Nick,

      Thanks for your help. Would you (or someone else) be able to help me further?

      I'm using BBEdit as my text editor. When I copied your code and made a few edits (see below, NB. the html character code is that shift-opt-5 character) my outputted text became everything in the "if" section. The program then popped open a Unix Script Log window and contained within it is everything from the "else" section. However, nothing in this Log output is sorted.

      while (my $line = <Input>) { $line =~ s{(z0)}{&#64257;$1}g; # $line =~ s{(?<=^)\n\Z}{&#64257;\n}g; if ($line =~ /^&#64257;/) { # line starts with a bracket if (@unsorted_input) { # if array has entries # my @sorted = sort two_way_sort @unsorted_input; print @unsorted_input; undef @unsorted_input; } print {Output} $line; # print the header line } else { # line is an entry line my @sorted = sort two_way_sort @unsorted_input; push @unsorted_input, $line; } } foreach my $line (@unsorted_input) { print Output $line; }

      I'm not certain about what to do next.

Re: Use Perl's Sort to only sort certain lines in a file?
by Anonymous Monk on Jan 01, 2015 at 22:44 UTC
    Basically, when something seems too complex, your first instinct as a progammer should be to throw more functions at it. If it doesn't help, your second instinct should be to use objects (as in OOP) :) Or so I heard :)
    use strict; use warnings; process_file( \*DATA ); exit 0; sub process_file { my ($fh) = @_; while ( my $line = <$fh> ) { print $line; if ( $line =~ /section start marker/ ) { handle_section($fh); } } } sub handle_section { my ($fh) = @_; my ( @entries, $line ); while ( $line = <$fh> ) { last unless $line =~ /^entry/; push @entries, $line; } print map { $_->[1] } sort { $a->[0] cmp $b->[0] } map { [ get_cmp_key(), $_ ] } @entries; print $line if defined $line; } sub get_cmp_key { m{ \( (\d+) \s+ (.+) }x or die "Can't match '$_'!"; # inspired by johngg :) return pack 'Na*', $1, $2; } __DATA__ [section start marker (shift-opt-5)] [some line of text] 1 and no entries 0 none at all! [section start marker (shift-opt-5)] [some line of text] entry \(7 data\) entry \(5 data\) entry \(6 data\) [section start marker (shift-opt-5)] [some line of text] entry \(001 data\) entry \(1 data\) entry \(01 data\) ^ those are equivalent as numbers text C text B text A [section start marker (shift-opt-5)] [some line of text] entry \(001 dataC\) entry \(1 dataB\) entry \(01 dataA\)
    output:
    [section start marker (shift-opt-5)] [some line of text] 1 and no entries 0 none at all! [section start marker (shift-opt-5)] [some line of text] entry \(5 data\) entry \(6 data\) entry \(7 data\) [section start marker (shift-opt-5)] [some line of text] entry \(001 data\) entry \(1 data\) entry \(01 data\) ^ those are equivalent as numbers text C text B text A [section start marker (shift-opt-5)] [some line of text] entry \(01 dataA\) entry \(1 dataB\) entry \(001 dataC\)
      Oops, there is a bug, come to think of it! But subs make bugs much easier to fix.
      sub process_file { my ($fh) = @_; while ( my $line = <$fh> ) { print $line; if ( $line =~ /section start marker/ ) { $line = handle_section($fh); redo if defined $line; } } } ... sub handle_section { ... return $line; }

        Thanks for your interesting help. The more I look at things the more I seem to think that I would like the cleanliness of subroutines. Sad admission, I've been doing most things inside one large "while" loop.

        When I used your code I added an open line called "Input" and exchanged my $line = <$fh> with my $line = <Input>. If I didn't do that I got an error:

        Name "main::Data" used only once: possible typo at untitled text line +9. readline() on unopened filehandle Data at untitled text line 16.
        Ought I have done that?

        The output, when I made that change, however looked exactly like the input file. If I gave a sample of the actual text would that help?

        &#64257; a bunch of text that isn't important :– dear (13) dear friends (22) love (10) dear friend (10) loved (3) dearly loved (1) friends (1) loved so much (1 [+(xi)1181(-i)]) &#64257; more unimportant text :– competes in the games (1) contend (1) fight (2) fought (1) make every effort (1) strive (1) wrestling (1)

        No matter what I tried to mess with I could not get the data to change like you showed in your example.

        Thanks for all the help!

Re: Use Perl's Sort to only sort certain lines in a file?
by Laurent_R (Canon) on Jan 01, 2015 at 19:54 UTC
    It seems to me that you need to build an array of arrays (AoA) in which you would slice each of your sections into a sub-array and then sort the sub-arrays individually.

      True.   However, in the present case, it looks to me like the problem consists of processing a file that consists of “sections,” with different self-contained sorting requirements for each section.   Once the entirety of “any particular section” has been read-in to memory from the file, it can be processed in its entirety, written to the output file, and then completely forgotten.   So, an array of arrays ought to be unnecessary.

      The file consists of (1) “lines that are section-headers,” and (2) “everything else,” which lines are to be interpreted as part of the preceding section (if any).   Processing of each (preceding ...) section begins when the next section-header is read, and once-again at the end of the file.   The sort-requirements of each section can easily be handled by (one or several) sort-comparison functions.   The processing in each case is to sort the array of accumulated lines appropriately, then spit-out the lines preceded by their appropriate section-header, then get ready to process the next section.   Only the lines from “the current section” need be retained in memory at any point.

        Yes, you're absolutely right, sundialsvc4, this can be done section by section (read one section, sort it, print it out, then proceed to the next section, etc.), and this is actually what I would most probably do in such a case, especially if the input file is large.