opolat has asked for the wisdom of the Perl Monks concerning the following question:

I have a partially sorted text file. Each line has 9 numbers seperated by whitespace. Example:

1 4 6 7 8 9 0 1 0

1 3 4 8 8 9 0 3 2

1 6 7 8 8 8 8 7 1

4 3 6 7 9 1 3 2 1

4 3 4 5 6 7 7 7 7

4 3 4 5 6 3 3 3 7

4 3 4 5 6 3 3 3 7

5 4 6 6 6 6 6 2 4

5 3 4 8 8 9 0 3 2

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

I need to read each line starting with the same number and put them into a nested array. In this simplified example I need to read all lines starting with 1 and put them into a nested array then later sort this array based on a column. Do the same for all lines starting with the same number. Basically I need to read each line starting with the same number put them in to nested array sort it and then print it, and continue doing th same for the remaining lines that start with the same number. The nested array:

1 4 6 7 8 9 0 1 0

1 3 4 8 8 9 0 3 2

1 6 7 8 8 8 8 7 1

Sorting it based on column 9:

1 4 6 7 8 9 0 1 0

1 6 7 8 8 8 8 7 1

1 3 4 8 8 9 0 3 2

The nested array:

5 4 6 6 6 6 6 2 4

5 3 4 8 8 9 0 3 2

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

After sorting:

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

5 6 7 8 8 8 8 7 1

5 4 6 6 6 6 6 2 4

5 3 4 8 8 9 0 3 2

I now How to put lines into a nested array and sort them based on column. My trouble is that I don't know how to read lines starting with the same number and put them into an array then continue doing the same for the rest of the file. Please help! Thanks.

  • Comment on How can I read multiple lines starting with the same number and put in to a nested array and print it to a file?

Replies are listed 'Best First'.
Re: How can I read multiple lines starting with the same number and put in to a nested array and print it to a file?
by Zaxo (Archbishop) on Jun 24, 2002 at 01:51 UTC

    Your examples appear to be sorted in reverse numerical order, rather than by column 9 as you state. Coincidentally, with this structure, that is the same as a reverse dictionary sort on the strings. You can save a lot of memory by storing them as strings instead of splitting them.

    There are still a couple of ways to go with it. You can either (A.)read the file as an array of lines and do a custom sort, or else you can (B.)build an array of arrays indexed by the first digit and sorted by the remainder:

    # data file is open to read on the FOO handle my @lines = <FOO>; # option A. my @lines = sort { substr( $a, 0, 1) cmp substr( $b, 0, 1) or substr( $b, 1) cmp substr( $a, 1) } @lines; # option B. my @ary; push @{$ary[ substr( $_, 0, 1)]}, substr( $_, 1) for @lines; @ary = map {[ reverse sort @$_ ]} @ary;
    If you don't need fast selection grouped on the first digit, option A is the pick. it will be lighter on memory, and possibly faster.

    If you need the row data as an array, split will give you that as needed.

    Update: Re: your followup That would be option B. @{$ary[4]} is the sorted array of all entries starting with four. The four has been stripped for convenience in sorting, but since you know the index you're using, you can restore it for printing or whatever.

    After Compline,
    Zaxo

      As I said I have no trouble puting lines into an array and sorting them the way I want, my trouble is how I can read multiple lines starting with the same number and put only those into an array (sort it print it) and continue. In my example read the first line then all the lines starting with that number( happens to be number 1), put them into an array, sort it, open output file file print it, get all the lines starting with next number which is 4 in this example, put them into an array, sort it , open the output file print it, and do this till the end of file. I need an algorithm, I feel quite stupid, I should be able to work this out, but my brain is not working at the moment! I can put the entire file into an array, but I do not want to that, because of memory problems. Thanks a lot for all the responds.
      Unfortunately I do not know what the starting line numbers are. I have 3 GB file and I can not put it into a big array. The example above is extremely simplified example. I might have element 10 (which is first number in the line) and 20 lines starting with that number , The next number might be 10000 and I might have 20 lines starting with that number. I might have 20000 elements, which means 20000 different elements and 20000 * 20 lines (lets say there are 20 lines starting with the same element number). Basically I just would not know what the element numbers are. I need to find out as I am going through the file.
Re: How can I read multiple lines starting with the same number and put in to a nested array and print it to a file?
by atcroft (Abbot) on Jun 24, 2002 at 00:50 UTC

    My mind tumbled through several permutations in reading your post, so I hope one of them will prove helpful or illuminative.

    When I first looked at this, the thought of "homework" passed briefly through my mind, but with just the ideas I am tossing out, it would still be up to you how best to implement the information, so I don't think that will be a factor herein.

    My first thought about the question regarding how to put it into a data structure was that perhaps you were looking at something reminiscent of a radix sort. There are several references for such things online, easily found through a search such as Google, so I'll forego that as well.

    One thing you may wish to look at is using a hash of arrays, so you might have something like $yourhash{'5'}(0..5) = ('567888871', '567888871', '567888871', '567888871', '546666624', '534889032'), for instance. You might wish to check my syntax, as I am writing this without a terminal window open to test it in, but you get the idea.

    Another option might be an array of arrays, so that in the above you end up with $yourarray[5][4] = '546666624'.

    You may wish to review perllol or another such reference for additional assistance.

    I look forward to others' responses, so I may learn as well.

Re: How can I read multiple lines starting with the same number and put in to a nested array and print it to a file?
by flounder99 (Friar) on Jun 24, 2002 at 12:32 UTC
    I think this is what you want. It only uses enough memory to hold the data from one element number. It assumes the file is already sorted by element number and not just sorted ascii-betically. The data will be sorted on column 1 and then on column 9.
    use strict; open (INFILE, "<filename") or die "could not open file"; open (OUTFILE, ">outfilename") or die "could not open outfile"; my $line = <INFILE>; my @record = split " ", $line; ELEMENT: while (defined $line) { my $element = $record[0]; my @array; push @array, [$line, $record[8]]; while (1) { $line = <INFILE>; @record = split " ", $line; if ($element == $record[0]) { push @array, [$line, $record[8]]; } else { #sort using psudo-Schwartzian Transform @array = map {$_->[0]} sort {$a->[1] <=> $b->[1]} @array; print OUTFILE @array; next ELEMENT; } } }

    --

    flounder

      Thanks flounder once again, this program works and does a really good job. I would like to also thank everybody else who responded. Thank you.
Re: How can I read multiple lines starting with the same number and put in to a nested array and print it to a file?
by BrowserUk (Patriarch) on Jun 24, 2002 at 07:07 UTC

    Why do you want to process all the records of one type first rather than build 9 arrays as you move through the file and then sort and output then at the end?

    I think the missing peices of information from your question are:

    a) How big is the input file?

    b) Is the input file pre-sorted by the first field? ie. will all the "5 x x x x x x x x" lines be contiguous in the input file?

    If the input file is not pre-sorted and especially if it is large, then it would probably be faster to use the system utility to pre-sort the input file, or a copy of it if you need to preserve the original.

    If your reasons for doing one record type at a time is because each subset if very large, then it may well be quicker to process the input file to 9 output files in one pass and then either reload the 9 files sort and output again, or use the sort utility on them.

    A clear picture of the scale of the problem would make the choice of solution easier.

      The initial input file can be 3 GB. I have a NASTRAN (Finite Element Analysis software) output file which is pre sorted. The first number in each line represent an element. Element number can be aywhere between 1 to 9999999. They are not consistent either. The example above was just a simplified example. I would not know exact element numbering. So I might have 10 lines starting with element ID 12000. The next element id might be 13450 and I might have 10 lines starting with that number. So I have a huge file, and memory is crucial. I need to have an inteligent algorithm that goes through each line and finds the lines starting with the same number (which I would not know what they are) and put only those ones in to an array. I do not need to create different array for each element number. Once I print the sorted nested array, I can use the same array name and put new lines starting with the same numbers into this array. I am not sure if I am making sense. But thanks for listening.

        This is untested and bad Perl (I'm new to it) but the algorithm should be clear and work ok. If your lucky, one of the experts here will be so appalled by my Perl that he will step in a clean it up or offer you better.

        # somewhere to remember the records we processing my $lastFirstNum = ""; my @nums; # work array while (<>) { # get the first number from the line my $firstNum = split /\s/, $_, 1; # prime the pump if its the first time through $lastFirstNum = $firstNum if $lastFirstNum = ""; if ( $firstNum eq $lastFirstNum ) { # its still the same type so save it push @nums, $_; next; # skip to next record } # we found the last one sort the array @nums = map {[ reverse sort @$_ ]} @nums; #open output using the first number as the name open( FH, ">$lastFirstNum" ) or die "Can't open $lastFirstNum: $!" +; print @nums; close( FH ) or die "Couldn't close $lastFirstNum: $!"; # the number changed $lastFirstNum = $firstNum; undef @nums; #clean the array push @nums, $_; # push the new record }

        Update: corrected my own (first) obvious mistake.