Re: best data structure

Seems to be a few questions need to be asked first:

How many items can be in the list (realistically and maximum)?
How big can the "items" be (in chars)?
Is the set of "items" finite and/or completely known now?
Can you use a tree data structure to locate duplicates?

I have tackled this problem years ago. We had a FIFO that other applications could write to (if they had write permissions) that was read from by a process flow control program. We kept running into users who would submit the same job many times. Each job could only run once per request. We wound up keeping a hash using the job name as the hash key. We would see up to 1000 jobs submitted in a few minutes at times, and performance was fine. Using the hash as the key means perl prevents dups for you. Or you can test for the existance of the hash key, and handle dups any way you want to. We actually implemented the queue so that a job could be set to do one of several things on dup requests.

Rerun flag (if one instance of the job is running and another request comes in, the first job finishes and then starts again as soon as it was done.)
It could set an ignore flag (if one job is running and another request comes in, the request is ignored until the job is finished running.)
It could set a count flag (if a job is running and another request for it comes in, a counter is incremented for that job. When the job finishes, the counter is decremented and the job is run again).

Also, by filtering the job this way, we could determine whether or not a dup job was the same process with same data or same process with any data. Using this model, we were able to create a very powerful tool that handled workflow (both program and people) using a simple process definition language.

The general idea of it is as follows:

look at internal queue
- Is queue empty? - Block on FIFO until something comes in and add it to the queue.
- Is queue not empty? - Poll FIFO and add anything in it to the back of the queue.
Process first queue item.
- Is this a job control command? Do it and go to top. This gave us a way of setting priority of jobs, cancelling jobs, and more...
- Is this a process request? Check the hash for the process name. If it exists, then you have a dup request. If you have a dup request, check the rules for the process to see how that process handles duplicate requests. We then used a hash from the process request to store the data for the request.
  - Check for concurrency (manage dups)
  - if process can start (rules verification), start sub-process. Normally done with a fork.
  - if process can not start
    - are we counting requests? bump counter. remove from queue
    - are we rerunning once? set rerun flag. remove from queue
    - are we ignoring? remove from queue.
...

Though not nescesarily the best solution, using a hash makes it much easier to manage (and allows tie) and opens up many possibilities for growth.

msg me for more on this if you are interested.

Digiryde

Comment on Re: best data structure

In Section Seekers of Perl Wisdom