in reply to extracting from text

Your problem, as I see it, is twofold:
  1. How to parse the data and store it into a Perl data structure
  2. Your data consists of ranges... how to merge adjectant ranges into a single range

Let's start with the first part... Your idea to generate variable names out of data is a very bad one (IMnsHO). See Why it's stupid to 'use a variable as a variable name' for the reasons.

I'd rather use a single hash to store everything, something like:

%presence = ( 'hi' => [ { from => 65, to => 85 }, { from => 86, to => 106} ], 'bye' => [ { from => 12, to => 32 }, { from => 33, to => 53 } ], );
Instead of the 2 item hashes with "from" and "to" values, you could choose to use a 2 item array instead, which allegedly is better for resource usage (memory):
%presence = ( 'hi' => [ [65, 85], [86, 106] ], 'bye' => [ [12, 32], [33, 53] ], );
With constants
use constant FROM => 0; use constant TO => 1;
access code to dig into the data structure could look quite similar.

Now, how do you process the data and put it into the hash? Something like this:

my %presence; while(<INPUT>) { my($name, $from, $to) = /^(\w+):\s+(\d+)\s+.*?\s+(\d+)$/ or next; push @{$presence{$name}}, { from => $from, to => $to }; # or, with arrays: # push @{$presence{$name}}, [ $from, $to ]; }
Yes, that really is all it takes to build the data structure I showed above from your data files.

Part 2 of your problem is merging ranges that are adjectant, or possibly may overlap. For that, you'll have to loop through the collected values in the array for each name, and see if it touches any other range in the array, and if so, merge them.

You could do that with nested loops, for each item, loop through all other all already selected ranges. However, I think this could prove bugprone, you may have to loop again after each merge to see if you can't merge them even more.

Or you could use a module. Set::IntSpan::Fast looks like a good candidate, what's more: looking at its docs, it apparently internally uses the data structure I would have thought best for merging range sets. I don't know of an official name, but I'd call it a "toggle list". That would have been my 3rd suggestion. :)

The internal representation used is extremely simple: a set is represented as a list of integers. Integers in even numbered positions (0, 2, 4 etc) represent the start of a run of numbers while those in odd numbered positions represent the ends of runs. As an example the set (1, 3-7, 9, 11, 12) would be represented internally as (1, 2, 3, 8, 11, 13).

(Note: I was first introduced to this kind of representation by demerphq. Thanks for that. I understood he uses it to represent Unicode character classes in his updates to the perl5 regexp engine.)

Once you get your final IntSpan object, you could store it as is, as the value for a name in the hash; or you could convert it back to the representation I showed you above.

Replies are listed 'Best First'.
Re^2: extracting from text
by oha (Friar) on Dec 10, 2007 at 13:44 UTC
    just for fun:
    my @data = ([1,30], [40, 50], [25, 37], [60, 70], [50, 60], [65, 99]); my @data = map { my ($a, $b) = split/:/; [$a, $b]; } split / /, join ':', map { my @d = split / /; scalar @d > 1 && $d[0]>=$d[1] ? ():"@ +d" } split /:/, join ' ', map {"$_->[0]:$_->[1]"} sort {$a->[0] <=> $b->[0]} @data;
    Oha