in reply to Gay Dating; or, Thrifty Set Intersections with an Accumulator

The first thing I noticed about your post was the statement:
This node compares approaches in the range ( O(N) .. O(N2) ).
and I thought WTF? It's a statement that doesn't make sense.

You do not seem to realize that expressing a complexity in "big-Oh" notation doesn't give you more information than an asymptotic upper bound. Stating an algorithm has an O(N2) run time complexity doesn't mean the algorithm actually grows quadratically. Searching in an unsorted array, for instance, has a run-time complexity of O(N2). It also has a run-time complexity of O(N).

So, I read your post more carefully. You're using a lot of woolly words, but it doesn't seem to be more than a logical expansion of how the FAQ tells you how to intersect arrays. Your memory requirements seem to be Ω(dN), where N is the number of people involved, and d the number of days to be considered. No naïve algorithm would use more. What I don't understand is why you need to rescan the file a second time - you do store all the names anyway.

For the amount of text and code, I'm not really impressed. Here's a program that does essentially the same (it doesn't print a bounding box). About half of the code has to do with formatting, not with actual calculations. Half of the remaining half is used for reading the data.

use 5.010; use warnings; use strict; my (@book, @people); while (<DATA>) { chomp; next if /^#/ || !/\S/; my ($name, $days) = split; my @days = split /,/, $days; foreach my $day (@days) { push @{$book[$day]}, $name; } push @people, [$name, \@days]; } foreach my $info (@people) { my ($name, $days) = @$info; my %matches; foreach my $day (@$days) { my @match = grep {$_ ne $name} @{$book[$day]} or next; $matches{$day} = \@match if @match; } show $name, $days, \%matches; } sub show { my ($name, $days, $matches) = @_; printf "%-20s %s\n", $name, join ", ", @$days; unless (%$matches) { print " Sorry! No matches.\n"; } else { foreach my $day (@$days) { foreach my $name (@{$$matches{$day} || []}) { print " $name can meet you on the $day"; given ($day) { when (/1[0-9]$/) {print "th"} when (/1$/) {print "st"} when (/2$/) {print "nd"} when (/3$/) {print "rd"} default {print "th"} } print ".\n"; } } } print "\n"; } __DATA__ # name days Al 1,2 Bob 2,13 Chuck 12,4 Dick 3,7,30 Edgar 5,7 Fred 2,23 Greg 4,5,12 Harry 6 Ian 1,20 Jack 4,23

Replies are listed 'Best First'.
Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator
by SuicideJunkie (Vicar) on Aug 11, 2010 at 21:22 UTC

    A range of options between O(N) and O(N2) makes perfect sense to me. As long as you're not trying to mentally apply Perl's range operator that is. :)

    It simply means you've got a choice between, for example: O(N) memory with a huge constant, O(N log N) with a modest constant, or O(N2) with a small constant. Which one you should use depends on the size of N.

    When you bring in different independent costs, you could also be considering O(N)cpu and O(N log N) memory, vs O(N log N) cpu and O(N) memory. If you care to think of it this way, there is a multidimensional continuum of efficiencies between O(N) and O(N2) to consider.


    BTW, I don't know of anybody who bothers to be technical enough to use "Omega" instead of just saying Big-Oh. Except you I suppose.

    In practice, I've found it to always be a "Big-Crossed-Fingers" aka "Best Guess at Omega But Not Worth Strictly Proving; Nobody Wants The Cost To Make It Rigorous."

    Few people would know what you're talking about if you said Big-Omega, instead of Big-Oh. Big-Oh gets used as a generic term whenever there are non-mathies around (which is always unless you're still a math student). Bit of a self-sustaining "problem" but as far as I can tell, nobody really cares.

      It simply means you've got a choice between, for example: O(N) memory with a huge constant, O(N log N) with a modest constant, or O(N2) with a small constant. Which one you should use depends on the size of N.
      That doesn't make any sense.
      When you bring in different independent costs, you could also be considering O(N)cpu and O(N log N) memory
      I presume you mean "consider a program that manage to fill supra linear memory in linear time". Which means, even with removing the big-Oh rubbish, you still don't make sense. Since you can only write to a fixed amount of memory in a fixed time, you cannot use more than a linear amount of memory if your program runs in linear time.
      I don't know of anybody who bothers to be technical enough to use "Omega" instead of just saying Big-Oh. Except you I suppose.
      The difference between "Omega" and "Big-Oh" isn't an obscure technicality. It's as big as the difference between < and >. Or is that a difference you usually cannot be bothered about either?

        s/Omega/Theta/, my bad.

        It seems that you are refusing to acknowledge some basic reality checks in favor of theoretical purity.

        N does not actually get to go to infinity, and neither does your budget of time, money, ram, disk space, network bandwidth, etc.

        If you can't imagine when one network request and NlogN cpu cycles (EG: query a central server then figure out the sources from the data patterns) could be better than N network requests and N cpu cycles (EG: asking all the sources directly), then I am truly sorry for you.

        Addendum: Consider also that you may want to not just think about, but actually implement both an O(NlogN) algorithm and an O(N2) for the same task. Then choose which one to run with an if (N > $breakEvenPoint) { $result = doNlogN() } else { $result = doNsquared() }

Re^2: Gay Dating; or, Thrifty Set Intersections with an Accumulator
by Xiong (Hermit) on Aug 11, 2010 at 16:09 UTC

    Big O notation is a little woolly all by itself. I think you know this, so I won't drive the point into the ground. For some value of "faster" (where faster is not better), an O(N2)-bounded algorithm grows faster than an O(N)-bounded one. That doesn't ensure the latter is always preferred; it only indicates.

    I stated a range to exclude the possibility of better (for some value of "better") algorithms than O(N) (or worse than O(N2)). This is not a sophisticated or advanced node. This is a pretty basic introduction to thinking about how to attack a problem set.

    As stated clearly in my FINAL WARNING, the script shown is not any sort of solution to the bioinformatics exercise, here only hinted at. It's a demonstration of technique, only loosely related to the preceding essay. There are only ten simple records; almost anything would work. If we wait, somebody may golf it down to a one-liner.

    The difference between my code and your code is that, as the data set gets larger and the problem set more complex, my editing may go a bit more smoothly. I don't doubt that you'll wind up with a really efficient script; you're good. But as the stress piles on, I'll accommodate it; my pieces are flexible and modular (in the wider sense). To put it another way, I would fear to maintain your code, while you would only hate to maintain mine.

    Note: Contents may change.
      For some value of "faster" (where faster is not better), an O(N2)-bounded algorithm grows faster than an O(N)-bounded one.
      This shows you don't know much about big-Oh.

      Any algorithm whose running time is in "O(N)" (for N towards infinity) is also in "O(N2)".

      And big-Oh notation is not "woolly". In fact, if the running time of an algorithm is expressed as T(N), with N the size of the input, and it's said that T(N) is in O(f(N)), it means that:

          ∃ M, c > 0: ∀ m > M: T(m) <= c f(m)
      
      The difference between my code and your code is that, as the data set gets larger and the problem set more complex, my editing may go a bit more smoothly.
      And you base this statement on what? (And what has editing to do with the size of the data set?)
      To put it another way, I would fear to maintain your code, while you would only hate to maintain mine.
      I don't know what to think of a programmer that fears to maintain code that consists of two small, self contained blocks, and a small subroutine.