rookierabbit has asked for the wisdom of the Perl Monks concerning the following question:

Sigh! I come to you, Monks, out of frustration, desperation, etc. I am fairly new to Perl and programming. The hardest part isn't learning the language, rather understanding approaches. With that said, here is my problem. I have a file of lines in an array

group1 32 48 group1 31 49 group1 57 91 group1 52 89 group3 10 19 group4 23 77

I would like to compare each element with all the rest. If the groups are the same, I would like to look at the first numerical value. If they are within 10 of each other, I would like to take the lesser value and then look at the second number. If they are within 10, I would like to take the greater value.

It seems the best thing to do is put them into a hash of hashes and compare these. Is this smart/possible?

I read the many archived posts and think I have an idea how to do this, but not really an understanding. Also, if two or more lines are "collapsed" into one, I want to make sure to eliminate the old lines so that I have one representative that encompasses the largest range.


i.e.
group1 31 49 group1 52 91 group3 10 19 group4 23 77

Does this make sense? I know I excluded any code. I am looking more for theory right now than code as I would like to try to learn it while writing it. Like I said, it seems to be programming theory that I am having real troubles with. Thanks in advance!

Replies are listed 'Best First'.
Re: compare values within a hash
by JavaFan (Canon) on Oct 27, 2008 at 16:57 UTC
    What happens if you have:
    group1 10 20 group1 18 22 group1 25 21
    When looking at the first numbers, 10 and 18 are less than 10 apart, 18 and 25 are less than 10 apart, but 10 and 25 aren't. Do you want to keep the 10 (because it's the lesser of 10 and 18). Do you want to keep the 18? (because it's the lesser of 18 and 25). Keep them both? Is the 25 discarded? And if I have
    group1 10 20 group1 15 25
    What is the result?
    group1 10 25
    that is, take the lesser of (10, 15) and the greater of (20, 25)?

    Finally, what if I have:

    group1 10 15 group1 15 50
    The first numeral values are less than 10 apart, so you keep the 10. But 15 and 50 are more than 10 apart. What's the result?

      Thanks Java
      I didn't realize how vague I was until I read it again. The file is sorted lexically. Also, in all cases, the lesser of the two numbers is in the first column. I basically am trying to say that foreach same group, are the numbers in the first column within 10 of each other. If yes, are the numbers in the second column ALSO within 10 of each other.

      Your example would give me.

      group1 10 25

      This assumes that the last row would actually be

      group1 21 25

      But they idea is to collapse all three lines into one. I am looking for a "window" or frame that would essentially include all of the overlapping lines collapsed into one. Eventually, I will use the single lines in downstream applications.

      You folks are fantastic. I work with people that are great programmers, not so great at explaining why I need to do why they say!

        When I am interviewing programmers, one of the defining differences between a senior developer and an intermediate developer is the ability to explain what they are doing. I normally expect seniors to be able to mentor more junior people.

        G. Wade
Re: compare values within a hash
by gwadej (Chaplain) on Oct 27, 2008 at 17:21 UTC

    One of the most important things to learn when starting to program is precision in your specification of the problem. You must thoroughly understand what you want to program, because the computer is stupid enough to do exactly what you ask.

    Now, to help refine your requirements, we need to ask more questions:

    Can you have multiple output items in the same group?

    • It looks like the answer is yes, but we should be clear on it.
    • What determines the separate groups...non-overlapping ranges?

    What if the first values are not within 10 but the second ones are, what do we want to do?

    The main point of the questions is to make certain you are perfectly clear on what should happen for any inputs.

    If you can only have one output line for each group, then a hash is a really good way to organize the data. If you can have more than one output line, then you need to get more creative. A hash can only have one value per key, but that value could be a reference to an array.

    For example, the data structure for the initial input could be:

    my %data = ( 'group1' => [ [32,48], [31,49], [57,91], [52,89] ], 'group3' => [ [10,19] ], 'group4' => [ [23,77] ], );

    In this case, each group (key) has an array as a value, and each of those arrays has references to a 2 item list that has the min and max values.

    Although not really difficult, this structure might take a little study to become comfortable.

    G. Wade
      Gotcha! I appreciate everyone's help. These ideas are exactly what I am in need of.

      The computer might be stupid, but I know just enough to convince it to really screw up. I have already accomplished the infinite loop!

      I would like to collapse all overlaps into ONE line. I get what you are asking. If the min values are not within 10 but the max values are, then I would like those lines reduced to one representative.

      Your second question is harder for me, I think. I want at least 10 units of separation between the max of one element and the min of the next. 10 is an arbitrary number, but again, its theory I am looking for, I think.

      I always thought I was pretty bright and an abstract thinker, but you guys are in a completely different league. I have lurked for a long time. Whenever I needed to be humbled, I came here!

        I believe you are getting it. The key is that you need to understand what you want well enough to explain it to this stupid computer. As you get more practice, you will learn which requirements you really need to specify well and which you can slide on.

        Unfortunately, bright and abstract thinking are sort of independent of the way to think about this kind of problem. You almost have to look at it as if the computer is going to exploit anything you haven't specified well. The worst part is that sometimes an underspecified program will appear to work and lull you into a false sense of security.

        The answers you are getting are from people that have been bitten over and over again by times when they thought they knew what they were doing.

        Welcome to the wonderful world of programming!

        G. Wade

        The following two statements about what you are trying to do seem very different to me.

        If the groups are the same, I would like to look at the first numerical value. If they are within 10 of each other, I would like to take the lesser value and then look at the second number. If they are within 10, I would like to take the greater value.
        I want at least 10 units of separation between the max of one element and the min of the next.

        If you have the following:

        group1 10 15 group1 22 50

        Do you want these to be merged because the "separation" between 15 and 22 is less than 10 or not merged because neither 10 and 22 nor 15 and 50 are within 10 of each other?

Re: compare values within a hash
by GrandFather (Saint) on Oct 27, 2008 at 22:41 UTC

    So, you are trying to merge "overlapped" ranges except that you have a fixed size extension around each range? If that is what you are after then probably the easiest way to manage the problem is to build an array sorted by range start. Probably makes things easier if you subtract half the overlap value from the low value of each range and add half the overlap to the high value - that is, store the extended range in each case.

    Once you have build your data structure (a hash of groups where each group entry is an array of ranges), you can run through each range array and amalgamate adjacent ranges in the array that overlap. Because the ranges have already been extended the logic for that should be fairly straight forward. When you are done with amalgamating ranges you can run through the result and adjust each range end point by removing the extension value.


    Perl reduces RSI - it saves typing
Re: compare values within a hash
by toolic (Bishop) on Oct 27, 2008 at 17:12 UTC
    I think you should clarify your spec. Are you only comparing numbers within a group, or also between groups? If the former, maybe a hash-of-arrays might be a suitable approach. This assumes that your input file is not huge, and you can read it into memory all at once.

    Each hash key could be a group name: group1, group2, etc. You would push 2-element arrays for each group. Once you have everything read in, you could loop through the keys, perform your comparisons on each array by 1st element, then by 2nd element, if necessary.