in reply to Re: Size-limited, fitness-based lists
in thread Size-limited, fitness-based lists

Ah, thank you! Heaps sound like a useful data structure for this problem. (It really shows that I don't have a CS background; doesn't it? I don't know the lingo, or even many of the concepts.)

That said -- wouldn't a heap grow to include all items first before you extract the relevant ones? Depending on how large your data is and how much memory you can or cannot afford to throw at it, I can see advantages to my approach, too.

Heap::Simple doesn't work for me, unfortunately: it requires either Heap::Simple::XS or Heap::Simple::Perl, and both fail their test suites.

  • Comment on Re^2: Size-limited, fitness-based lists

Replies are listed 'Best First'.
Re^3: Size-limited, fitness-based lists
by Athanasius (Archbishop) on Aug 09, 2015 at 12:42 UTC
Re^3: Size-limited, fitness-based lists (tests)
by tye (Sage) on Aug 09, 2015 at 17:41 UTC
    Heap::Simple doesn't work for me, unfortunately: it requires either Heap::Simple::XS or Heap::Simple::Perl, and both fail their test suites.

    Nice defeatist attitude. I strongly suspect that the test failures have almost nothing to do with the functionality of the module and that if you simply "force" the install, that Heap::Simple will work fine and be quite useable for you.

    Two minutes of investigation certainly reinforces this impression for me. The complexity of getting the unit tests to run correctly is much greater than and very different from the complexity of making the module functional, especially in this case.

    Update: And a quick glance at the test results leads me to suspect that just using version 0.11 would even lead to the unit tests just working. Update2: And a look at the Changes shows that there are likely no feature differences between 0.11 and the latest version.

    - tye        

      There's defeatist and then there's beating a dead horse. I'm curious what told you that the test results [c|sh]ould be ignored in this case?

      Dum Spiro Spero

        First, I simply understand that the machinery for getting modules installed is rather complex and must deal with differences in environments while implementing a data structure in Perl (without using XS) can easily be done in a way that is not sensitive to environment differences. So, the module having worked for one person while failing for another was simply more likely to be a problem with the installation machinery than with the functionality of a module for implementing a data structure.

        So I downloaded Heap::Simple::Perl and ran "perl Makefile.PL" and then "make test" (actually "dmake test" in this case) under Windows and saw that all of the tests failed but the error messages were about not even being able to load modules. So the problems for my instance had nothing to do with the functioning of the module (since it seemed that the module under test wasn't even being found to be loaded). I find that actually reading error messages is very often useful in getting past problems (and I see even experienced software developers very often skip that simple step).

        Next, I looked at the published test results for the module and saw that the majority of them pass. This reinforced my initial assessment. Next, I noticed that the first few failures I found listed were all on Windows. So the most likely explanation was that the process of getting the tests to run correctly was due to a quirk of Windows or of a common build of Perl on Windows. Update: Actually, taking a couple more minutes looking this morning, http://matrix.cpantesters.org/?dist=Heap-Simple-Perl+0.14 shows that the problem is newer versions of Perl and (as noted in another reply) with unit tests testing fragile things (unit tests being more sensitive to environment changes is another common problem).

        Next, since many modules are able to get tests to run fine under Windows, I was curious if the module was doing something unusual or complex in how it arranged for the tests to be run. Looking at Makefile.PL, there was certainly some extra complexity there (at least in part to prompt the user for whether or not they wanted benchmarks to be run). Though, it still looked like it boiled down to a fairly vanilla invocation of ExtUtil::MakeMaker. So it'd take me more time to dive into that complexity to see if I could find a likely source for the failures on Windows, and I had no plans to debug to that level.

        Finally, I noticed that the "% pass" value for the module was 100% for several prior versions of the module. So just asking for a slightly older version to be installed would likely make all of the problems go away. And then reading the Changes file showed that those changes were all about how installation was done and not about what the module actually does, so it seemed extremely likely that both new and old versions of the module would work just fine but that slightly old versions of the module would likely install w/o issues.

        - tye        

        The failing tests in question seems to count the number of magic fetches (through tie), and the number of times a magic value callback is called has varied and is more a Perl implementation specific thing. It doesn't really seem to impact the correctness of the module.

Re^3: Size-limited, fitness-based lists
by jdporter (Paladin) on Aug 10, 2015 at 20:33 UTC
    wouldn't a heap grow to include all items first before you extract the relevant ones

    Yes, but I don't see a way to avoid that and yet satisfy your requirement to include all of the members of a class even if that causes the "minimum" to be exceeded. For example, what do you think we should do if the input list consists of a million words all of exactly the same length? If the "size-limited" aspect of the need is strong enough to make us discard some of that set of values, then you need to specify your "cut-off" heuristics somehow.

    I reckon we are the only monastery ever to have a dungeon stuffed with 16,000 zombies.
      wouldn't a heap grow to include all items first before you extract the relevant ones
      Yes

      In general, no. The reason I use a heap for "top 10 of a huge list" is because you can get that result from a heap that never gets above a size of 10. If you allow for ties, then you only have to let the heap grow beyond 10 based on how many items are currently tied for 10th place, which usually means that the heap only ever gets slightly longer than 10 items.

      Start with an empty heap configured to keep the worst item at the head. For each item, if there are fewer than 10 items in the heap, then push the next item in. Else, if the item is worse than the item at the head of the heap, then simply discard the item. Else, if the new item is tied with the worst item, then push it onto a separate stack of tied items. Else, the new item is better than the worst item so replace the worst item with the new item and push it down, then compare the new worst item (the new head of the heap) with the item you just removed. If the new head is better than the old head, then discard the old head and empty the list of tied items. Otherwise, push the old head onto the list of tied items.

      - tye