heidi has asked for the wisdom of the Perl Monks concerning the following question:

I have a text file like this.
hi: 65 abcdefghijklmnopqrst 85 bye: 12 bcdefghijklmnopqrstu 32 hi: 86 sagfsdgsgwsehbbdgops 106 bye: 33 afasdfdfafasaafadfad 53 i just want to store the digits only as $hi_from=65 and $hi_to=106 $bye_from=12 and $bye_to=53
how do i go about it ?

Replies are listed 'Best First'.
Re: extracting from text
by bart (Canon) on Dec 10, 2007 at 11:24 UTC
    Your problem, as I see it, is twofold:
    1. How to parse the data and store it into a Perl data structure
    2. Your data consists of ranges... how to merge adjectant ranges into a single range

    Let's start with the first part... Your idea to generate variable names out of data is a very bad one (IMnsHO). See Why it's stupid to 'use a variable as a variable name' for the reasons.

    I'd rather use a single hash to store everything, something like:

    %presence = ( 'hi' => [ { from => 65, to => 85 }, { from => 86, to => 106} ], 'bye' => [ { from => 12, to => 32 }, { from => 33, to => 53 } ], );
    Instead of the 2 item hashes with "from" and "to" values, you could choose to use a 2 item array instead, which allegedly is better for resource usage (memory):
    %presence = ( 'hi' => [ [65, 85], [86, 106] ], 'bye' => [ [12, 32], [33, 53] ], );
    With constants
    use constant FROM => 0; use constant TO => 1;
    access code to dig into the data structure could look quite similar.

    Now, how do you process the data and put it into the hash? Something like this:

    my %presence; while(<INPUT>) { my($name, $from, $to) = /^(\w+):\s+(\d+)\s+.*?\s+(\d+)$/ or next; push @{$presence{$name}}, { from => $from, to => $to }; # or, with arrays: # push @{$presence{$name}}, [ $from, $to ]; }
    Yes, that really is all it takes to build the data structure I showed above from your data files.

    Part 2 of your problem is merging ranges that are adjectant, or possibly may overlap. For that, you'll have to loop through the collected values in the array for each name, and see if it touches any other range in the array, and if so, merge them.

    You could do that with nested loops, for each item, loop through all other all already selected ranges. However, I think this could prove bugprone, you may have to loop again after each merge to see if you can't merge them even more.

    Or you could use a module. Set::IntSpan::Fast looks like a good candidate, what's more: looking at its docs, it apparently internally uses the data structure I would have thought best for merging range sets. I don't know of an official name, but I'd call it a "toggle list". That would have been my 3rd suggestion. :)

    The internal representation used is extremely simple: a set is represented as a list of integers. Integers in even numbered positions (0, 2, 4 etc) represent the start of a run of numbers while those in odd numbered positions represent the ends of runs. As an example the set (1, 3-7, 9, 11, 12) would be represented internally as (1, 2, 3, 8, 11, 13).

    (Note: I was first introduced to this kind of representation by demerphq. Thanks for that. I understood he uses it to represent Unicode character classes in his updates to the perl5 regexp engine.)

    Once you get your final IntSpan object, you could store it as is, as the value for a name in the hash; or you could convert it back to the representation I showed you above.

      just for fun:
      my @data = ([1,30], [40, 50], [25, 37], [60, 70], [50, 60], [65, 99]); my @data = map { my ($a, $b) = split/:/; [$a, $b]; } split / /, join ':', map { my @d = split / /; scalar @d > 1 && $d[0]>=$d[1] ? ():"@ +d" } split /:/, join ' ', map {"$_->[0]:$_->[1]"} sort {$a->[0] <=> $b->[0]} @data;
      Oha
Re: extracting from text
by johngg (Canon) on Dec 10, 2007 at 11:48 UTC
    If you extract your data into a HoHoA keyed by 'hi'/'bye' then 'to'/'from' with the array elements being the values, you can then use min and max from List::Util to get what you want.

    use strict; use warnings; use Data::Dumper; use List::Util qw{ min max }; open my $textFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n}; hi: 65 abcdefghijklmnopqrst 85 bye: 12 bcdefghijklmnopqrstu 32 hi: 86 sagfsdgsgwsehbbdgops 106 bye: 33 afasdfdfafasaafadfad 53 END_OF_FILE my %dataItems = (); while ( <$textFH> ) { next unless m{^(hi|bye):\s+(\d+)\D+(\d+)}; push @{ $dataItems{ $1 }->{ from } }, $2; push @{ $dataItems{ $1 }->{ to } }, $3; } close $textFH or die qq{close: $!\n}; print Data::Dumper->Dumpxs( [ \ %dataItems], [ q{*dataItems} ] ); my $hiFrom = min( @{ $dataItems{ hi }->{ from } } ); my $hiTo = max( @{ $dataItems{ hi }->{ to } } ); my $byeFrom = min( @{ $dataItems{ bye }->{ from } } ); my $byeTo = max( @{ $dataItems{ bye }->{ to } } ); print qq{\$hiFrom = $hiFrom and \$hiTo = $hiTo\n}, qq{\$byeFrom = $byeFrom and \$byeTo = $byeTo\n};

    Here's the output (with Data::Dumper output included so you can see the data structure)

    %dataItems = ( 'hi' => { 'to' => [ '85', '106' ], 'from' => [ '65', '86' ] }, 'bye' => { 'to' => [ '32', '53' ], 'from' => [ '12', '33' ] } ); $hiFrom = 65 and $hiTo = 106 $byeFrom = 12 and $byeTo = 53

    I hope this is useful.

    Cheers,

    JohnGG

Re: extracting from text
by misc (Friar) on Dec 10, 2007 at 11:28 UTC
    Hi, I'd like to suggest a simple approach - It's more easy to change/understand if you read the code some time later.
    #!/usr/bin/perl -w use strict; my ($hi_from,$hi_to,$bye_from,$bye_to); while ( my $line = <DATA> ){ $line =~ /^(.?)\D*(\d*)\D*(\d*)/; next if ( !$1 ); # Empty line # print "1: $1 2: $2 3: $3\n"; if ( $1 eq 'h' ){ if ( !defined($hi_from) || $hi_from >= $2 ){ $hi_from = $2; } if ( !defined($hi_to) || $hi_to <= $3 ){ $hi_to = $3; } } if ( $1 eq 'b' ){ if ( !defined($bye_from) || $bye_from >= $2 ){ $bye_from = $2; } if ( !defined($bye_to) || $bye_to <= $3 ){ $bye_to = $3; } } } print "hi_from: $hi_from hi_to: $hi_to\n"; print "bye_from: $bye_from bye_to: $bye_to\n"; __DATA__ hi: 65 abcdefghijklmnopqrst 85 bye: 12 bcdefghijklmnopqrstu 32 hi: 26 sagfsdgsgwsehbbdgops 106 bye: 33 afasdfdfafasaafadfad 53 hi: 89 sagfsdgsgwsehbbdgops 188 bye: 37 afasdfdfafasaafadfad 89
    It would be possible to do some cool constructs with map, hashes, ...
    However, if you have to change something later, something like this is much more easy to understand and debug. (Although unelegant ).

    It's also a matter of the needs you have.
    • Which size could the file have ?
    • Is speed important ?
    • Do you need some warnings if the file is misformed ?
    • ...

    HTH, Michael
Re: extracting from text
by mwah (Hermit) on Dec 10, 2007 at 10:44 UTC

    I'd save an array of values under the hash key and sort afterwards (looks not very elegant but works):

    ... my @stuff= ( 'hi: 65 abcdefghijklmnopqrst 85', 'bye: 12 bcdefghijklmnopqrstu 32' +, 'hi: 86 sagfsdgsgwsehbbdgops 106', 'bye: 33 afasdfdfafasaafadfad 53 +'); my %cnt; /(\w+)\D+(\d+)\s+\w+\s+(\d+)/ && push @{$cnt{$1}},($2,$3) for @stuff; @$_ = sort { $a <=> $b } @$_ for values %cnt; print map "\$${_}_from=$cnt{$_}[0] and \$${_}_to=$cnt{$_}[-1]\n", keys %cnt; ...

    displays here:

    $hi_from=65 and $hi_to=106 $bye_from=12 and $bye_to=53

    Addendum: after reading the other posts I'd think one could also beam the values into predefined my-Variables by an eval. Im not sure if (w/error handling added) this would be that bad:

    use strict; # go for it use warnings; my @stuff= ( 'hi: 65 abcdefghijklmnopqrst 85', 'bye: 12 bcdefghijklmnopqrstu 32' +, 'hi: 86 sagfsdgsgwsehbbdgops 106', 'bye: 33 afasdfdfafasaafadfad 53 +'); my %cnt; /(\w+)\D+(\d+)\s+\w+\s+(\d+)/ && push @{$cnt{$1}},($2,$3) for @stuff; @$_ = sort { $a <=> $b } @$_ for values %cnt; my ($hi_from, $hi_to, $bye_from, $bye_to); # if we *know* what to exp +ect while( my ($k,$v) = each %cnt ) { my @evil_plan = ( "\$${k}_from=$$v[0]", "\$${k}_to=$$v[-1]" ); print join(' and ', @evil_plan), "\n"; # show what we prepared map eval, @evil_plan; # execute the unthinkable } print "\n$hi_from, $hi_to \n$bye_from, $bye_to\n"; # control

    Regards

    mwa

Re: extracting from text
by narainhere (Monk) on Dec 10, 2007 at 11:21 UTC
    You Have it here
    use strict; use warnings; my $contents; my $hi_from; my $hi_to; my $bye_from; my $bye_to; open (FH,"filename") or die ("Can't open"); while(<FH>) { chomp($_); $contents.=$_; } ($hi_from=$contents)=~s/hi:\s+(\d*)/$1/; $hi_from=$1; ($hi_to=$contents)=~s/(hi:\s+)(?!.*hi:)(.*\d+)(?=.*bye:)/$2/; ($hi_to=$2)=~s/\d+//; $hi_to=~s/.*\s(\d+)/$1/; ($bye_from=$contents)=~s/bye:\s+(\d*)/$1/; $bye_from=$1; ($bye_to=$contents)=~s/(bye:\s+)(?!.*bye:)(.*\d+)/$2/; ($bye_to=$2)=~s/\d+//; $bye_to=~s/.*\s(\d+)/$1/; print "\n\n\nHi From: $hi_from\n"; print "Hi to: $hi_to\n"; print "Bye from: $bye_from\n"; print "bye to: $bye_to\n";
    First I am reading from the file and creating a flat-string of the file contents.Then I go about doing a lot of regexe's to ensure we get what we want!!
    Sorry if you find the regexe's difficult.It will help you learn a lot of things about regexe's, if you can decipher it!!

    The world is so big for any individual to conquer

Re: extracting from text
by poolpi (Hermit) on Dec 10, 2007 at 11:46 UTC
    #!/usr/bin/perl use strict; use warnings; use Slurp; use Data::Dumper; my @lines = slurp('file.txt') or die; my $h = {}; my ($name, $from, $to ); for ( @lines) { next if /^$/; s/\s+//g; ( $name, $from, $to ) = /(\w+) : (\d{2,}) \w+? (\d{2,})/xms; $h->{$name}{'from'} = $from unless defined $h->{$name}{'from'}; $h->{$name}{'from'} = $from if $from < $h->{$name}{'from'}; $h->{$name}{'to'} = $to unless defined $h->{$name}{'to'}; $h->{$name}{'to'} = $to if $to > $h->{$name}{'to'}; } print Dumper($h);
    output :
    $VAR1 = { 'hi' => { 'to' => 106, 'from' => 65 }, 'bye' => { 'to' => 53, 'from' => 12 } };
    HTH PooLpi