Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl-Sensitive Sunglasses
 
PerlMonks  

sort on numbers embedded in text

by d4vis (Chaplain)
on Oct 09, 2000 at 19:04 UTC ( [id://35912]=perlquestion: print w/replies, xml ) Need Help??

d4vis has asked for the wisdom of the Perl Monks concerning the following question:

Howdy,
I've got a data file that's resisting all of my attempts to sort it correctly. Unfortunately, my skill with Perl is such that even a simple sort is still a challenge, much less a numerical sort on data that's embedded in text strings, thusly; (three space seperator)
953-FMT-FMT 954-FMT-FMT BLOGO93-FMT-FMT 955-FMT-FMT BOXSTART2- +FMT-FMT BOXEND2-FMT-FMT 956-FMT-FMT 413-FMT-FMT DATE1-FMT-FMT + 414-FMT-FMT 415-FMT-FMT 416-FMT-FMT BLOGO107-FMT-FMT 417-F +MT-FMT 419-FMT-FMT 418-FMT-FMT BLOGO63-FMT-FMT 393-FMT-FMT +394-FMT-FMT BLOGO75-FMT-FMT 420-FMT-FMT 421-FMT-FMT 395-FMT-F +MT
As you can see, there are 3 formats that I need ordered. Those that are ddd-FMT-FMT need a numerical sort. Those that are BLOGOddd-FMT-FMT need to be sorted seperately, but also numerically. Everything else can be sorted normally, but I'm going to run the output through a 20 year old ATEX system, so I need to retain the -FMT-FMT and BLOGO tags.
I've read the sort docs, to no avail. So far, my (somewhat feeble) attempts to get this done have all been variations on reading the data into an array and trying various sort and s/// combinations. Most of the code I've tried looks something like this:
#!/usr/bin/perl -w open (FH1, "input.txt") || die "Couldn't open input.txt"; $/ = " "; @records = <FH1>; open (FH2, ">>output.txt") || die "Couldn't open output.txt"; foreach my $record (sort {$a cmp $b} @records) { print FH2 "$record\n"; }

Am I approaching this all wrong? Should I be looking at split or map maybe? The prognosis for me figuring this one out anytime soon seems slim, so any help/guidance would be greatly appreciated.

a fronte praecipitium a tergo lupi

~scribe d4vis
#!/usr/bin/fnord

Replies are listed 'Best First'.
Re: sort on numbers embedded in text
by rlk (Pilgrim) on Oct 09, 2000 at 19:24 UTC
    If you have three sets of data that need to be sorted independently, I'd reccomend you start by seperating your @records into three seperate arrays, like so:
    my (@records1, @records2, @records3) foreach (@records) { if (/^BLOGO/) { push @records1, $_; } elsif (/\d{3}-FMT-FMT) { push @records2, $_; } else { push @records3, $_; }
    Now you have 3 arrays, each with one data type. Assuming each piece of data has exactly one block of digits, you can sort them with
    foreach (\@records1, \@records2, \@records3) { my @recs = @$_; #-- print join "\n", sort {$a1 = $a =~ /(\d+)/; $b1 = $b =~ /(\d+)/; $a1 <=> $b1 } @recs; #-- }
    If you don't know about references yet, you can just do the code in between "#--"'s once for each array, substituting the array name for @recs.

    --
    Ryan Koppenhaver, Aspiring Perl Hacker
    "I ask for so little. Just fear me, love me, do as I say and I will be your slave."

(tye)Re: sort on numbers embedded in text
by tye (Sage) on Oct 09, 2000 at 20:07 UTC

    Well, the most common problem when sorting text that contains numbers is that values with different numbers of digits don't sort numerically. Here is one way to fix that. Here is another:

    my @data= qw( 953-FMT-FMT 954-FMT-FMT BLOGO93-FMT-FMT 955-FMT-FMT BOXSTART2-FMT-FMT BOXEND2-FMT-FMT 956-FMT-FMT 413-FMT-FMT DATE1-FMT-FMT 414-FMT-FMT 415-FMT-FMT 416-FMT-FMT BLOGO107-FMT-FMT 417-FMT-FMT 419-FMT-FMT 418-FMT-FMT BLOGO63-FMT-FMT 393-FMT-FMT 394-FMT-FMT BLOGO75-FMT-FMT 420-FMT-FMT 421-FMT-FMT 395-FMT-FMT ); my %data; foreach my $data ( @data ) { ( my $sort= $data ) =~ s/(0*)(\d+)/ pack("C",length($2)) . $1 . $2 /ge; $data{$sort}= $data; } print join( "\n", @data{ sort keys %data } ), "\n";
    which produces
    393-FMT-FMT 394-FMT-FMT 395-FMT-FMT 413-FMT-FMT 414-FMT-FMT 415-FMT-FMT 416-FMT-FMT 417-FMT-FMT 418-FMT-FMT 419-FMT-FMT 420-FMT-FMT 421-FMT-FMT 953-FMT-FMT 954-FMT-FMT 955-FMT-FMT 956-FMT-FMT BLOGO63-FMT-FMT BLOGO75-FMT-FMT BLOGO93-FMT-FMT BLOGO107-FMT-FMT BOXEND2-FMT-FMT BOXSTART2-FMT-FMT DATE1-FMT-FMT
    note how "BLOGO107" comes after "BLOGO93".

    If you also want "BLOGO" to sort last, then either separate those beforehand as already mentioned or try this:

    my @data= qw( 953-FMT-FMT 954-FMT-FMT BLOGO93-FMT-FMT 955-FMT-FMT BOXSTART2-FMT-FMT BOXEND2-FMT-FMT 956-FMT-FMT 413-FMT-FMT DATE1-FMT-FMT 414-FMT-FMT 415-FMT-FMT 416-FMT-FMT BLOGO107-FMT-FMT 417-FMT-FMT 419-FMT-FMT 418-FMT-FMT BLOGO63-FMT-FMT 393-FMT-FMT 394-FMT-FMT BLOGO75-FMT-FMT 420-FMT-FMT 421-FMT-FMT 395-FMT-FMT ); my %data; foreach my $data ( @data ) { ( my $sort= $data ) =~ s/(0*)(\d+)/ pack("C",length($2)) . $1 . $2 /ge; $sort =~ s/^BLOGO/~BLOGO/; $data{$sort}= $data; } print join( "\n", @data{ sort keys %data } ), "\n";

            - tye (but my friends call me "Tye")
      This did the trick.
      Sorted as one column, with ddd-FMT-FMT items followed by the BLOGOddd-FMT-FMT items, then an alphabet sort on the remaining items.
      Many thanks.

      ~d4vis the scribe
      #!/usr/bin/fnord

Re: sort on numbers embedded in text
by Shendal (Hermit) on Oct 09, 2000 at 19:21 UTC
    Although from your question it is a bit unclear what you are looking for, perhaps the following code segment will help.
    #/usr/bin/perl -w use strict; foreach (sort split /\s+/,<DATA>) { /^\d/ ? print "FH1: $_\n" : print "FH2: $_\n"; } __DATA__ 953-FMT-FMT 954-FMT-FMT BLOGO93-FMT-FMT 955-FMT-FMT BOXSTART2- +FMT-FMT BOXEND2-FMT-FMT 956-FMT-FMT 413-FMT-FMT DATE1-FMT-FMT + 414-FMT-FMT 415-FMT-FMT 416-FMT-FMT BLOGO107-FMT-FMT 417-F +MT-FMT 419-FMT-FMT 418-FMT-FMT BLOGO63-FMT-FMT 393-FMT-FMT +394-FMT-FMT BLOGO75-FMT-FMT 420-FMT-FMT 421-FMT-FMT 395-FMT-F +MT
    Which outputs...
    FH1: 393-FMT-FMT FH1: 394-FMT-FMT FH1: 395-FMT-FMT FH1: 413-FMT-FMT FH1: 414-FMT-FMT FH1: 415-FMT-FMT FH1: 416-FMT-FMT FH1: 417-FMT-FMT FH1: 418-FMT-FMT FH1: 419-FMT-FMT FH1: 420-FMT-FMT FH1: 421-FMT-FMT FH1: 953-FMT-FMT FH1: 954-FMT-FMT FH1: 955-FMT-FMT FH1: 956-FMT-FMT FH2: BLOGO107-FMT-FMT FH2: BLOGO63-FMT-FMT FH2: BLOGO75-FMT-FMT FH2: BLOGO93-FMT-FMT FH2: BOXEND2-FMT-FMT FH2: BOXSTART2-FMT-FMT FH2: DATE1-FMT-FMT

    Cheers,
    Shendal
Re: sort on numbers embedded in text
by merlyn (Sage) on Oct 09, 2000 at 19:09 UTC
    Your specification is incomplete. When you say
    Those that are BLOGOddd-FMT-FMT need to be sorted seperately, but also numerically.
    do you mean the result can show up in an entirely separate list, or all at the end of the ddd-FMT-FMT list as a group? And then there's
    Everything else can be sorted normally,
    which again doesn't say if those elements should end up as a string sort (aside: is that what you mean by "normal" {grin}?) in yet a third output list, or put after or before the other two sort lists in the same output.

    Most of the battle of the code is getting the specification right. And then the coding falls naturally from that. So, if you want a solution that fits your problem, please give us a codeable specification.

    -- Randal L. Schwartz, Perl hacker

      My bad.
      The answer to both questions is yes. ;)
      All three lists in the same output works just as well for this purpose as three seperate lists. Whicever way is the simpler. Heh...and by "normal" in this case I meant a simple alphabetical sort, though my definition of normal is subject to change without notice.
      Clarity, unfortunately, is not always my strong point.

      ~d4vis the scribe
      #!/usr/bin/fnord

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://35912]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others learning in the Monastery: (5)
As of 2024-04-19 15:02 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found