extracting from text

Replies are listed 'Best First'.
Re: extracting from text by bart (Canon) on Dec 10, 2007 at 11:24 UTC
Your problem, as I see it, is twofold: How to parse the data and store it into a Perl data structure Your data consists of ranges... how to merge adjectant ranges into a single range Let's start with the first part... Your idea to generate variable names out of data is a very bad one (IMnsHO). See Why it's stupid to 'use a variable as a variable name' for the reasons. I'd rather use a single hash to store everything, something like: `%presence = ( 'hi' => [ { from => 65, to => 85 }, { from => 86, to => 106} ], 'bye' => [ { from => 12, to => 32 }, { from => 33, to => 53 } ], );` [download] Instead of the 2 item hashes with "from" and "to" values, you could choose to use a 2 item array instead, which allegedly is better for resource usage (memory): `%presence = ( 'hi' => [ [65, 85], [86, 106] ], 'bye' => [ [12, 32], [33, 53] ], );` [download] With constants `use constant FROM => 0; use constant TO => 1;` [download] access code to dig into the data structure could look quite similar. Now, how do you process the data and put it into the hash? Something like this: `my %presence; while(<INPUT>) { my($name, $from, $to) = /^(\w+):\s+(\d+)\s+.*?\s+(\d+)$/ or next; push @{$presence{$name}}, { from => $from, to => $to }; # or, with arrays: # push @{$presence{$name}}, [ $from, $to ]; }` [download] Yes, that really is all it takes to build the data structure I showed above from your data files. Part 2 of your problem is merging ranges that are adjectant, or possibly may overlap. For that, you'll have to loop through the collected values in the array for each name, and see if it touches any other range in the array, and if so, merge them. You could do that with nested loops, for each item, loop through all other all already selected ranges. However, I think this could prove bugprone, you may have to loop again after each merge to see if you can't merge them even more. Or you could use a module. Set::IntSpan::Fast looks like a good candidate, what's more: looking at its docs, it apparently internally uses the data structure I would have thought best for merging range sets. I don't know of an official name, but I'd call it a "toggle list". That would have been my 3rd suggestion. :) The internal representation used is extremely simple: a set is represented as a list of integers. Integers in even numbered positions (0, 2, 4 etc) represent the start of a run of numbers while those in odd numbered positions represent the ends of runs. As an example the set (1, 3-7, 9, 11, 12) would be represented internally as (1, 2, 3, 8, 11, 13). (Note: I was first introduced to this kind of representation by demerphq. Thanks for that. I understood he uses it to represent Unicode character classes in his updates to the perl5 regexp engine.) Once you get your final IntSpan object, you could store it as is, as the value for a name in the hash; or you could convert it back to the representation I showed you above.	[reply] [d/l] [select]
Re^2: extracting from text by oha (Friar) on Dec 10, 2007 at 13:44 UTC
just for fun: `my @data = ([1,30], [40, 50], [25, 37], [60, 70], [50, 60], [65, 99]); my @data = map { my ($a, $b) = split/:/; [$a, $b]; } split / /, join ':', map { my @d = split / /; scalar @d > 1 && $d[0]>=$d[1] ? ():"@ +d" } split /:/, join ' ', map {"$_->[0]:$_->[1]"} sort {$a->[0] <=> $b->[0]} @data;` [download] <Reveal this spoiler or all in this thread> Oha	[reply] [d/l]
Re: extracting from text by johngg (Canon) on Dec 10, 2007 at 11:48 UTC
If you extract your data into a HoHoA keyed by 'hi'/'bye' then 'to'/'from' with the array elements being the values, you can then use `min` and `max` from List::Util to get what you want. use strict; use warnings; use Data::Dumper; use List::Util qw{ min max }; open my $textFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n}; hi: 65 abcdefghijklmnopqrst 85 bye: 12 bcdefghijklmnopqrstu 32 hi: 86 sagfsdgsgwsehbbdgops 106 bye: 33 afasdfdfafasaafadfad 53 END_OF_FILE my %dataItems = (); while ( <$textFH> ) { next unless m{^(hi\|bye):\s+(\d+)\D+(\d+)}; push @{ $dataItems{ $1 }->{ from } }, $2; push @{ $dataItems{ $1 }->{ to } }, $3; } close $textFH or die qq{close: $!\n}; print Data::Dumper->Dumpxs( [ \ %dataItems], [ q{*dataItems} ] ); my $hiFrom = min( @{ $dataItems{ hi }->{ from } } ); my $hiTo = max( @{ $dataItems{ hi }->{ to } } ); my $byeFrom = min( @{ $dataItems{ bye }->{ from } } ); my $byeTo = max( @{ $dataItems{ bye }->{ to } } ); print qq{\$hiFrom = $hiFrom and \$hiTo = $hiTo\n}, qq{\$byeFrom = $byeFrom and \$byeTo = $byeTo\n}; [download] Here's the output (with Data::Dumper output included so you can see the data structure) `%dataItems = ( 'hi' => { 'to' => [ '85', '106' ], 'from' => [ '65', '86' ] }, 'bye' => { 'to' => [ '32', '53' ], 'from' => [ '12', '33' ] } ); $hiFrom = 65 and $hiTo = 106 $byeFrom = 12 and $byeTo = 53` [download] I hope this is useful. Cheers, JohnGG	[reply] [d/l] [select]
Re: extracting from text by misc (Friar) on Dec 10, 2007 at 11:28 UTC
Hi, I'd like to suggest a simple approach - It's more easy to change/understand if you read the code some time later. #!/usr/bin/perl -w use strict; my ($hi_from,$hi_to,$bye_from,$bye_to); while ( my $line = <DATA> ){ $line =~ /^(.?)\D(\d)\D(\d)/; next if ( !$1 ); # Empty line # print "1: $1 2: $2 3: $3\n"; if ( $1 eq 'h' ){ if ( !defined($hi_from) \|\| $hi_from >= $2 ){ $hi_from = $2; } if ( !defined($hi_to) \|\| $hi_to <= $3 ){ $hi_to = $3; } } if ( $1 eq 'b' ){ if ( !defined($bye_from) \|\| $bye_from >= $2 ){ $bye_from = $2; } if ( !defined($bye_to) \|\| $bye_to <= $3 ){ $bye_to = $3; } } } print "hi_from: $hi_from hi_to: $hi_to\n"; print "bye_from: $bye_from bye_to: $bye_to\n"; __DATA__ hi: 65 abcdefghijklmnopqrst 85 bye: 12 bcdefghijklmnopqrstu 32 hi: 26 sagfsdgsgwsehbbdgops 106 bye: 33 afasdfdfafasaafadfad 53 hi: 89 sagfsdgsgwsehbbdgops 188 bye: 37 afasdfdfafasaafadfad 89 [download] It would be possible to do some cool constructs with map, hashes, ... However, if you have to change something later, something like this is much more easy to understand and debug. (Although unelegant ). It's also a matter of the needs you have. Which size could the file have ? Is speed important ? Do you need some warnings if the file is misformed ? ... HTH, Michael	[reply] [d/l]
Re: extracting from text by mwah (Hermit) on Dec 10, 2007 at 10:44 UTC
I'd save an array of values under the hash key and sort afterwards (looks not very elegant but works): `... my @stuff= ( 'hi: 65 abcdefghijklmnopqrst 85', 'bye: 12 bcdefghijklmnopqrstu 32' +, 'hi: 86 sagfsdgsgwsehbbdgops 106', 'bye: 33 afasdfdfafasaafadfad 53 +'); my %cnt; /(\w+)\D+(\d+)\s+\w+\s+(\d+)/ && push @{$cnt{$1}},($2,$3) for @stuff; @$_ = sort { $a <=> $b } @$_ for values %cnt; print map "\$${_}_from=$cnt{$_}[0] and \$${_}_to=$cnt{$_}[-1]\n", keys %cnt; ...` [download] displays here: `$hi_from=65 and $hi_to=106 $bye_from=12 and $bye_to=53` [download] Addendum: after reading the other posts I'd think one could also beam the values into predefined my-Variables by an eval. Im not sure if (w/error handling added) this would be that bad: use strict; # go for it use warnings; my @stuff= ( 'hi: 65 abcdefghijklmnopqrst 85', 'bye: 12 bcdefghijklmnopqrstu 32' +, 'hi: 86 sagfsdgsgwsehbbdgops 106', 'bye: 33 afasdfdfafasaafadfad 53 +'); my %cnt; /(\w+)\D+(\d+)\s+\w+\s+(\d+)/ && push @{$cnt{$1}},($2,$3) for @stuff; @$_ = sort { $a <=> $b } @$_ for values %cnt; my ($hi_from, $hi_to, $bye_from, $bye_to); # if we know what to exp +ect while( my ($k,$v) = each %cnt ) { my @evil_plan = ( "\$${k}_from=$$v[0]", "\$${k}_to=$$v[-1]" ); print join(' and ', @evil_plan), "\n"; # show what we prepared map eval, @evil_plan; # execute the unthinkable } print "\n$hi_from, $hi_to \n$bye_from, $bye_to\n"; # control [download] Regards mwa	[reply] [d/l] [select]
Re: extracting from text by narainhere (Monk) on Dec 10, 2007 at 11:21 UTC
You Have it here use strict; use warnings; my $contents; my $hi_from; my $hi_to; my $bye_from; my $bye_to; open (FH,"filename") or die ("Can't open"); while(<FH>) { chomp($_); $contents.=$_; } ($hi_from=$contents)=~s/hi:\s+(\d)/$1/; $hi_from=$1; ($hi_to=$contents)=~s/(hi:\s+)(?!.hi:)(.\d+)(?=.bye:)/$2/; ($hi_to=$2)=~s/\d+//; $hi_to=~s/.\s(\d+)/$1/; ($bye_from=$contents)=~s/bye:\s+(\d)/$1/; $bye_from=$1; ($bye_to=$contents)=~s/(bye:\s+)(?!.bye:)(.\d+)/$2/; ($bye_to=$2)=~s/\d+//; $bye_to=~s/.\s(\d+)/$1/; print "\n\n\nHi From: $hi_from\n"; print "Hi to: $hi_to\n"; print "Bye from: $bye_from\n"; print "bye to: $bye_to\n"; [download] First I am reading from the file and creating a flat-string of the file contents.Then I go about doing a lot of regexe's to ensure we get what we want!! Sorry if you find the regexe's difficult.It will help you learn a lot of things about regexe's, if you can decipher it!! The world is so big for any individual to conquer*	[reply] [d/l]
Re: extracting from text by poolpi (Hermit) on Dec 10, 2007 at 11:46 UTC
`#!/usr/bin/perl use strict; use warnings; use Slurp; use Data::Dumper; my @lines = slurp('file.txt') or die; my $h = {}; my ($name, $from, $to ); for ( @lines) { next if /^$/; s/\s+//g; ( $name, $from, $to ) = /(\w+) : (\d{2,}) \w+? (\d{2,})/xms; $h->{$name}{'from'} = $from unless defined $h->{$name}{'from'}; $h->{$name}{'from'} = $from if $from < $h->{$name}{'from'}; $h->{$name}{'to'} = $to unless defined $h->{$name}{'to'}; $h->{$name}{'to'} = $to if $to > $h->{$name}{'to'}; } print Dumper($h);` [download] output : `$VAR1 = { 'hi' => { 'to' => 106, 'from' => 65 }, 'bye' => { 'to' => 53, 'from' => 12 } };` [download] HTH PooLpi	[reply] [d/l] [select]

How to parse the data and store it into a Perl data structure
Your data consists of ranges... how to merge adjectant ranges into a single range

Let's start with the first part... Your idea to generate variable names out of data is a very bad one (IMnsHO). See Why it's stupid to 'use a variable as a variable name' for the reasons.

I'd rather use a single hash to store everything, something like:

%presence = ( 'hi'  => [
                        { from => 65, to => 85 },
                        { from => 86, to => 106}
                       ],
              'bye' => [
                        { from => 12, to => 32 },
                        { from => 33, to => 53 }
                       ],
             );
[download]

%presence = ( 'hi'  => [
                        [65, 85],
                        [86, 106]                             
                       ],              
              'bye' => [
                        [12, 32],
                        [33, 53]                             
                       ],
             );
[download]

use constant FROM => 0;
use constant TO => 1;
[download]

Now, how do you process the data and put it into the hash? Something like this:

my %presence;
while(<INPUT>) {
    my($name, $from, $to) = /^(\w+):\s+(\d+)\s+.*?\s+(\d+)$/ or next;
    push @{$presence{$name}}, { from => $from, to => $to };
    # or, with arrays:
    # push @{$presence{$name}}, [ $from, $to ];
}
[download]

Part 2 of your problem is merging ranges that are adjectant, or possibly may overlap. For that, you'll have to loop through the collected values in the array for each name, and see if it touches any other range in the array, and if so, merge them.

You could do that with nested loops, for each item, loop through all other all already selected ranges. However, I think this could prove bugprone, you may have to loop again after each merge to see if you can't merge them even more.

Or you could use a module. Set::IntSpan::Fast looks like a good candidate, what's more: looking at its docs, it apparently internally uses the data structure I would have thought best for merging range sets. I don't know of an official name, but I'd call it a "toggle list". That would have been my 3rd suggestion. :)

The internal representation used is extremely simple: a set is represented as a list of integers. Integers in even numbered positions (0, 2, 4 etc) represent the start of a run of numbers while those in odd numbered positions represent the ends of runs. As an example the set (1, 3-7, 9, 11, 12) would be represented internally as (1, 2, 3, 8, 11, 13).

(Note: I was first introduced to this kind of representation by demerphq. Thanks for that. I understood he uses it to represent Unicode character classes in his updates to the perl5 regexp engine.)

Once you get your final IntSpan object, you could store it as is, as the value for a name in the hash; or you could convert it back to the representation I showed you above.

[reply]
[d/l]
[select]

my @data = ([1,30], [40, 50], [25, 37], [60, 70], [50, 60], [65, 99]);

my @data = map { my ($a, $b) = split/:/; [$a, $b]; }
        split / /,
        join ':',
        map { my @d = split / /; scalar @d > 1 && $d[0]>=$d[1] ? ():"@
+d" }
        split /:/,
        join ' ',
        map {"$_->[0]:$_->[1]"}
        sort {$a->[0] <=> $b->[0]}
        @data;
[download]

[reply]
[d/l]

min

max

List::Util

use strict;
use warnings;

use Data::Dumper;
use List::Util qw{ min max };

open my $textFH, q{<}, \ <<'END_OF_FILE' or die qq{open: $!\n};
hi:  65 abcdefghijklmnopqrst 85
bye: 12 bcdefghijklmnopqrstu 32

hi:  86 sagfsdgsgwsehbbdgops 106
bye: 33 afasdfdfafasaafadfad 53

END_OF_FILE

my %dataItems = ();

while ( <$textFH> )
{
    next unless m{^(hi|bye):\s+(\d+)\D+(\d+)};
    push @{ $dataItems{ $1 }->{ from } }, $2;
    push @{ $dataItems{ $1 }->{ to } }, $3;
}

close $textFH or die qq{close: $!\n};

print Data::Dumper->Dumpxs( [ \ %dataItems], [ q{*dataItems} ] );

my $hiFrom  = min( @{ $dataItems{ hi }->{ from } } );
my $hiTo    = max( @{ $dataItems{ hi }->{ to } } );
my $byeFrom = min( @{ $dataItems{ bye }->{ from } } );
my $byeTo   = max( @{ $dataItems{ bye }->{ to } } );

print
   qq{\$hiFrom = $hiFrom and \$hiTo = $hiTo\n},
   qq{\$byeFrom = $byeFrom and \$byeTo = $byeTo\n};
[download]

Here's the output (with Data::Dumper output included so you can see the data structure)

%dataItems = (
               'hi' => {
                         'to' => [
                                   '85',
                                   '106'
                                 ],
                         'from' => [
                                     '65',
                                     '86'
                                   ]
                       },
               'bye' => {
                          'to' => [
                                    '32',
                                    '53'
                                  ],
                          'from' => [
                                      '12',
                                      '33'
                                    ]
                        }
             );
$hiFrom = 65 and $hiTo = 106
$byeFrom = 12 and $byeTo = 53
[download]

I hope this is useful.

Cheers,

JohnGG

[reply]
[d/l]
[select]

#!/usr/bin/perl -w
use strict;

my ($hi_from,$hi_to,$bye_from,$bye_to);


while ( my $line = <DATA> ){
        $line =~ /^(.?)\D*(\d*)\D*(\d*)/;
        next if ( !$1 ); # Empty line

#        print "1: $1  2: $2   3: $3\n";


        if ( $1 eq 'h' ){
                if ( !defined($hi_from) || $hi_from >= $2 ){
                        $hi_from = $2;
                }
                if ( !defined($hi_to) || $hi_to <= $3 ){
                        $hi_to = $3;
                }
        }

            if ( $1 eq 'b' ){
                if ( !defined($bye_from) || $bye_from >= $2 ){
                        $bye_from = $2;
                }
                if ( !defined($bye_to) || $bye_to <= $3 ){
                        $bye_to = $3;
                }
        }
        
}

print "hi_from: $hi_from     hi_to:  $hi_to\n";
print "bye_from: $bye_from     bye_to:  $bye_to\n";


__DATA__
hi:  65 abcdefghijklmnopqrst 85
bye: 12 bcdefghijklmnopqrstu 32

hi:  26 sagfsdgsgwsehbbdgops 106
bye: 33 afasdfdfafasaafadfad 53

hi:  89 sagfsdgsgwsehbbdgops 188
bye: 37 afasdfdfafasaafadfad 89
[download]

Which size could the file have ?
Is speed important ?
Do you need some warnings if the file is misformed ?
...

[reply]
[d/l]

I'd save an array of values under the hash key and sort afterwards (looks not very elegant but works):

...
my @stuff= (
  'hi:  65 abcdefghijklmnopqrst 85', 'bye: 12 bcdefghijklmnopqrstu 32'
+,
  'hi:  86 sagfsdgsgwsehbbdgops 106', 'bye: 33 afasdfdfafasaafadfad 53
+');

my %cnt;
/(\w+)\D+(\d+)\s+\w+\s+(\d+)/ && push @{$cnt{$1}},($2,$3) for @stuff;

@$_ = sort { $a <=> $b } @$_ for values %cnt;

print
   map "\$${_}_from=$cnt{$_}[0] and \$${_}_to=$cnt{$_}[-1]\n",
      keys %cnt;
...
[download]

displays here:

$hi_from=65 and $hi_to=106
$bye_from=12 and $bye_to=53
[download]

Addendum: after reading the other posts I'd think one could also beam the values into predefined my-Variables by an eval. Im not sure if (w/error handling added) this would be that bad:

use strict;          # go for it
use warnings;
    
my @stuff= (
  'hi:  65 abcdefghijklmnopqrst 85', 'bye: 12 bcdefghijklmnopqrstu 32'
+,
  'hi:  86 sagfsdgsgwsehbbdgops 106', 'bye: 33 afasdfdfafasaafadfad 53
+');

my %cnt;
/(\w+)\D+(\d+)\s+\w+\s+(\d+)/ && push @{$cnt{$1}},($2,$3) for @stuff;

@$_ = sort { $a <=> $b } @$_ for values %cnt;

my ($hi_from, $hi_to, $bye_from, $bye_to);  # if we *know* what to exp
+ect

while( my ($k,$v) = each %cnt ) {
   my @evil_plan = ( "\$${k}_from=$$v[0]", "\$${k}_to=$$v[-1]" );
   print join(' and ', @evil_plan), "\n";   # show what we prepared
   map eval, @evil_plan;                    # execute the unthinkable
}

print "\n$hi_from, $hi_to \n$bye_from, $bye_to\n"; # control
[download]

Regards

mwa

[reply]
[d/l]
[select]

use strict;
use warnings;

my $contents;
my $hi_from;
my $hi_to;
my $bye_from;
my $bye_to;

open (FH,"filename") or die ("Can't open");
while(<FH>)
{
    chomp($_);
    $contents.=$_;
}
($hi_from=$contents)=~s/hi:\s+(\d*)/$1/;
$hi_from=$1;

($hi_to=$contents)=~s/(hi:\s+)(?!.*hi:)(.*\d+)(?=.*bye:)/$2/;
($hi_to=$2)=~s/\d+//;
$hi_to=~s/.*\s(\d+)/$1/;

($bye_from=$contents)=~s/bye:\s+(\d*)/$1/;
$bye_from=$1;
    
    
($bye_to=$contents)=~s/(bye:\s+)(?!.*bye:)(.*\d+)/$2/;
($bye_to=$2)=~s/\d+//;
$bye_to=~s/.*\s(\d+)/$1/;

print "\n\n\nHi From: $hi_from\n";
print "Hi to: $hi_to\n";
print "Bye from: $bye_from\n";
print "bye to: $bye_to\n";
[download]

The world is so big for any individual to conquer

[reply]
[d/l]

#!/usr/bin/perl 
use strict;
use warnings;
use Slurp;
use Data::Dumper;

my @lines = slurp('file.txt') or die; 
my $h = {}; 
my ($name, $from, $to );

for ( @lines) {
    next if /^$/;
    s/\s+//g;
    ( $name, $from, $to ) =  /(\w+) : (\d{2,}) \w+? (\d{2,})/xms;

    $h->{$name}{'from'} = $from unless defined $h->{$name}{'from'};
    $h->{$name}{'from'} = $from if $from <  $h->{$name}{'from'};
    
    $h->{$name}{'to'} = $to unless defined $h->{$name}{'to'};
    $h->{$name}{'to'} = $to if $to >  $h->{$name}{'to'};


}

print  Dumper($h);
[download]

$VAR1 = {
          'hi' => {
                    'to' => 106,
                    'from' => 65
                  },
          'bye' => {
                     'to' => 53,
                     'from' => 12
                   }
        };
[download]

[reply]
[d/l]
[select]