temporal has asked for the wisdom of the Perl Monks concerning the following question:

I seek the wisdom of more statistically-minded monks.

I am writing a script in Perl which will tell me when a data feed has finished downloading the majority of its files based on a prediction of its total volume of files.

Example:
My calculated prediction (based on a neural network result): 1400 files
I have recv'd: 1389

Currently, I am simply checking to see if the recv'd amount is within 95% of the prediction. So for this example, this feed would be marked as "completed." The # of files recv'd for a feed this size might vary +-50 files and still be OK.

However, I have since added feeds that recv smaller amounts of data.

Example:
Prediction: 10
Recv'd: 8

Now, 8/10 is only 80%. However, this is probably OK as the feed just has fewer files.

What I'm wondering is this:

Is there a clever way to set some kind of tolerance range that I can use to check feed completion of all sizes against their predictions which will scale better than a percentage?

  • Comment on Calculating Completion of Feeds with Varying Volumes

Replies are listed 'Best First'.
Re: Calculating Completion of Feeds with Varying Volumes
by GrandFather (Saint) on Mar 28, 2012 at 23:05 UTC

    Something like:

    use strict; use warnings; my $kDone = 0.95; my $kPivot = 100; my $kDelta = 0.2; for my $pair ( [10000, 9500], [1400, 1389], [100, 90], [100, 80], [10, 8], [10, 7], [5, 4], [2, 1], [1, 0] ) { my ($toDo, $done) = @$pair; next if !$toDo; my $ratio = $done / $toDo; my $adjusted = $kDone - ($kDelta - $toDo / ($kPivot + $toDo) * $kD +elta); printf "%5d %5d %5.2f %5.2f %s\n", $toDo, $done, $adjusted, $ratio +, $ratio > $adjusted ? 'Done' : 'Working'; }

    Prints:

    10000 9500 0.95 0.95 Done 1400 1389 0.94 0.99 Done 100 90 0.85 0.90 Done 100 80 0.85 0.80 Working 10 8 0.77 0.80 Done 10 7 0.77 0.70 Working 5 4 0.76 0.80 Done 2 1 0.75 0.50 Working 1 0 0.75 0.00 Working
    True laziness is hard work
Re: Calculating Completion of Feeds with Varying Volumes
by ww (Archbishop) on Mar 28, 2012 at 23:53 UTC
    GrandFather has offered a Perl-ish answer to your actual question.

    This node is non-Perl and non-statistical. Rather, it questions your approach as outlined in your comment 'I am simply checking to see if the recv'd amount is within 95% of the prediction. So for this example, this feed would be marked as "completed."'

    I hope I'm wrong, but that sounds a lot like tossing lighted flares into a powder keg.

    But maybe not.

    • What's the significance of marking a feed as "completed"?
    • Does that cause any action other than removing it from your (figurative) todo list?
    • If so, do you dare risk an action based on a possibly misbegotten belief that a feed has been completed?
    • If you're going to wait for a feed to be 80 or 95% complete, why are you trying (or so I gather) to avoid waiting a small, further increment, for actual completion? After all, actual completion can be tested -- fairly rigorously -- in various reliable ways.
Re: Calculating Completion of Feeds with Varying Volumes
by Anonymous Monk on Mar 29, 2012 at 13:06 UTC
    The only way to know the progress of a feed is to somehow first ask the feed source how much data it has to send.   Or, if you control that source, simply include the percent-done information as part of the records in the feed.   Either way, the only party that knows the answer, and therefore must provide it, is the source itself.   A simple number of bytes received just might be as useful to the human operators as anything; they would quickly learn how much data to expect.   Storing the number of bytes received from recent feeds might be useful both as a diagnostic and as a predictor.
Re: Calculating Completion of Feeds with Varying Volumes
by temporal (Pilgrim) on Mar 29, 2012 at 14:48 UTC

    Thank you all for your replies.

    GrandFather's solution is interesting and I will give it a try. Definitely did not think of doing a check that way.

    The reason I have created this script is to track feeds which are trending in one direction or another at some varying rate - it is difficult to set a constant value to check against. Hence, I have created a neural network for each feed which uses extensive historical data to make fairly accurate predictions of feed volumes for the following day.

    To answer your questions ww:

    The significance of marking a feed "completed" is that I can stop worrying about it not being finished on time. I have a similar NN predicting a recv'd time for each feed as well so that I know when I should start worrying about a feed being overly late and can contact the content provider. There is no automated action that I would risk linking to this script, it is simply feeding a webapp that I use for monitoring.

    That said, I am willing to risk the possibility of a mistakenly marking a feed "complete." This is very rare due to a feature of the feeds - the majority of the files will come in 1 or 2 of the updates. It is usually the case that after I recv these updates that the feed can be marked complete.

    Waiting for actual completion is generally not an option due to the feature described above. The final files of the feed typically arrive much, much later than the bulk and I am only concerned with having recv'd that majority of files as it generally guarantees that the rest will follow. So it wouldn't be a small increment to wait =)

    Also, I do not control these feeds, nor do I have a way to contact the owners for a total size.

    So I was thinking that there must be a way to create some sort of continuous sliding tolerance value which I could use to calculate acceptable "complete" volumes for feeds of all sizes.

      + + for your response... even though I still have some issues.
      1. You say "I am only concerned with having recv'd that majority of files as it generally guarantees that the rest will follow." My paranoia/pessimism (about the inclination of complicated processes to fail unexpectedly) tells me that if I don't have the whole package, I may not get it. OTOH, "...generally guarantees...." is likely a fair to good indicator if your "extensive historic data" allows you to infer a stage at which the feed is unlikely to fail.
      2. On the proverbial third hand, why mark a feed "completed" when it's not? You could just as well mark it "Lookin' good, so far at nn%" and report that to your ap. And, perhaps even better, you could also use your historic data to call attention to any feed that is failing to satisfy your "likely to succeed" criterion at some stage of reception.

        A notice that one has a potential problem is likely, IMO, to be more useful than a notice that says 'All's well on the Western Front."

      GrandFather's approach should be easy to adapt to identifying likely failures.