ada has asked for the wisdom of the Perl Monks concerning the following question:

Hi monks;

I want to be able to capture in a regex output say like this: Viridiplantae Crenarchaeota Fungi Crenarchaeota Fungi Fungi Fungi Metazoa Metazoa Euryarchaeota Fungi, using a regex:

$line=~ /\bkingdom\b\|(.*?)\|/).

The problem is I want to be able to capture the output not as a single scalar but as individual values to be used in an @array and stored in hash as individual keys say.

Much app help x.

Ok sorry I should have done;

my code is:
my $key; while($line=<>){ if($line=~/\bkingdom\b\|(.*?)\|/g){ $key.=$1; } }
and (part of) my data is: (a bit of a jumble)
species|Caragana arborescens | genus|Caragana | subfamily|Papilionoide +ae | family|Fabaceae | order|Fabales | no rank|eurosids I | subclass| +Rosidae | no rank|core eudicots | no rank|eudicotyledons | no rank|Ma +gnoliophyta | no rank|Spermatophyta | no rank|Euphyllophyta | no rank +|Tracheophyta | phylum|Embryophyta | no rank|Charophyta/Embryophyta g +roup | no rank|Streptophyta | kingdom|Viridiplantae | superkingdom|Eu +karyota | no rank|cellular organisms | no rank|root || species|syncytium endosymbiont of Diaphorina citri | no rank|unclassif +ied beta proteobacteria (miscellaneous) | no rank|unclassified beta p +roteobacteria | class|beta subdivision | phylum|Proteobacteria | supe +rkingdom|Bacteria | no rank|cellular organisms | no rank|root || subspecies|Trypanosoma brucei brucei | species|Trypanosoma brucei | su +bgenus|Trypanozoon | genus|Trypanosoma | family|Trypanosomatidae | or +der|Kinetoplastida | no rank|Euglenozoa | superkingdom|Eukaryota | no + rank|cellular organisms | no rank|root || species|unculturable Mariana archaeon no. 1 | no rank|environmental sa +mples | no rank|unclassified Crenarchaeota | kingdom|Crenarchaeota | +superkingdom|Archaea | no rank|cellular organisms | no rank|root || species|Suillus aeruginascens | genus|Suillus | family|Boletaceae | or +der|Boletales | subclass|Hymenomycetidae | class|Hymenomycetes | phyl +um|Basid
BUT I think I managed it using this now and this answers my question, I'm pretty sure:
@array= split(/\s+/, $key); print $array[0], $array[1], $arrray[2] etc..
so now I can access each element/key in the array and store it in a hash now as individual keys

Edit: g0n - code tags

Replies are listed 'Best First'.
Re: capturing separately
by mwah (Hermit) on Dec 10, 2007 at 15:57 UTC

    As far as I could deduce from your description, the following should do:

    # values to be used in an array ... my @array= map /\bkingdom\|([^|]+)\|/g, <>; # and stored in a hash as individual keys my %hash; $hash{$_}++ for @array; # show: print map "kingdom $_ was $hash{$_} x <br />\n", keys %hash;

    Regards

    mwa

      Ok thanks for this I will try this out xx
Re: capturing separately
by toolic (Bishop) on Dec 10, 2007 at 15:32 UTC
    ada,

    Please reformat your original post using code tags. Refer to Writeup Formatting Tips. If you have questions regarding posting, please ask.

    As you gather data from your regex capture, you can use push to build up an array. What do you want your hash to look like, i.e., what are the keys and what are the values? Perhaps if you provided more sample input data, output data and code, that would help.

Re: capturing separately
by dwm042 (Priest) on Dec 10, 2007 at 15:37 UTC
    Ada,

    In order to propose a solution that yields your proposed output, we'll need to know what your input is. At this point, it's hard to know whether a regex or a split command is a better choice.

Re: capturing separately
by johngg (Canon) on Dec 11, 2007 at 12:31 UTC
    The data doesn't look too much of a jumble. It appears to be records delimited by double pipe symbols surrounded by whitespace, each record consisting of fields delimited by single pipes surrounded by whitespace, each field being a key/value pair delimited by a pipe with no surrounding whitespace. This code parses the data into a HoH(oA) structure, the (oA) for the repeated 'no rank' key. I use Data::Dumper to show the structure.

    use strict; use warnings; use Data::Dumper; my $rawData = <<END_OF_DATA; species|Caragana arborescens | genus|Caragana | subfamily|Papilionoideae | family|Fabaceae | order|Fabales | no rank|eurosids I | subclass|Rosidae | no rank|core eudicots | no rank|eudicotyledons | no rank|Magnoliophyta | no rank|Spermatophyta | no rank|Euphyllophyta | no rank|Tracheophyta | phylum|Embryophyta | no rank|Charophyta/Embryophyta group | no rank|Streptophyta | kingdom|Viridiplantae | superkingdom|Eukaryota | no rank|cellular organisms | no rank|root || species|syncytium endosymbiont of Diaphorina citri | no rank|unclassified beta proteobacteria (miscellaneous) | no rank|unclassified beta proteobacteria | class|beta subdivision | phylum|Proteobacteria | superkingdom|Bacteria | no rank|cellular organisms | no rank|root || subspecies|Trypanosoma brucei brucei | species|Trypanosoma brucei | subgenus|Trypanozoon | genus|Trypanosoma | family|Trypanosomatidae | order|Kinetoplastida | no rank|Euglenozoa | superkingdom|Eukaryota | no rank|cellular organisms | no rank|root || species|unculturable Mariana archaeon no. 1 | no rank|environmental samples | no rank|unclassified Crenarchaeota | kingdom|Crenarchaeota | superkingdom|Archaea | no rank|cellular organisms | no rank|root || species|Suillus aeruginascens | genus|Suillus | family|Boletaceae | order|Boletales | subclass|Hymenomycetidae | class|Hymenomycetes | phylum|Basid}; END_OF_DATA $rawData =~ s{\n}{}g; my @rawRecords = split m{\s*\|\|\s*}, $rawData; print scalar @rawRecords, qq{ records found\n}; my %parsedRecords; foreach my $rawRecord ( @rawRecords ) { my ( $species ) = $rawRecord =~ m{(?<!\w)species\|(.+?)(?=\s+\|)}; print qq{Species: $species\n}; my @dataPairs = split m{\s+\|\s+}, $rawRecord; foreach my $dataPair ( @dataPairs ) { my ( $key, $value ) = split m{\|}, $dataPair; unless ( exists $parsedRecords{ $species }->{ $key } ) { $parsedRecords{ $species }->{ $key } = $value; } elsif ( ref $parsedRecords{ $species }->{ $key } eq q{ARRAY} ) { push @{ $parsedRecords{ $species }->{ $key } }, $value; } else { $parsedRecords{ $species }->{ $key } = [ $parsedRecords{ $species }->{ $key }, $value ]; } } } my $dd = Data::Dumper->new( [ \ %parsedRecords ], [ q{*parsedRecords} ] ); $dd->Indent( 1 ); print $dd->Dumpxs;

    Here's the output.

    5 records found Species: Caragana arborescens Species: syncytium endosymbiont of Diaphorina citri Species: Trypanosoma brucei Species: unculturable Mariana archaeon no. 1 Species: Suillus aeruginascens %parsedRecords = ( 'Trypanosoma brucei' => { 'genus' => 'Trypanosoma', 'species' => 'Trypanosoma brucei', 'superkingdom' => 'Eukaryota', 'subgenus' => 'Trypanozoon', 'order' => 'Kinetoplastida', 'subspecies' => 'Trypanosoma brucei brucei', 'family' => 'Trypanosomatidae', 'no rank' => [ 'Euglenozoa', 'cellular organisms', 'root' ] }, 'Caragana arborescens' => { 'kingdom' => 'Viridiplantae', 'genus' => 'Caragana', 'species' => 'Caragana arborescens', 'superkingdom' => 'Eukaryota', 'subfamily' => 'Papilionoideae', 'order' => 'Fabales', 'subclass' => 'Rosidae', 'phylum' => 'Embryophyta', 'family' => 'Fabaceae', 'no rank' => [ 'eurosids I', 'core eudicots', 'eudicotyledons', 'Magnoliophyta', 'Spermatophyta', 'Euphyllophyta', 'Tracheophyta', 'Charophyta/Embryophyta group', 'Streptophyta', 'cellular organisms', 'root' ] }, 'unculturable Mariana archaeon no. 1' => { 'kingdom' => 'Crenarchaeota', 'species' => 'unculturable Mariana archaeon no. 1', 'superkingdom' => 'Archaea', 'no rank' => [ 'environmental samples', 'unclassified Crenarchaeota', 'cellular organisms', 'root' ] }, 'syncytium endosymbiont of Diaphorina citri' => { 'phylum' => 'Proteobacteria', 'class' => 'beta subdivision', 'species' => 'syncytium endosymbiont of Diaphorina citri', 'superkingdom' => 'Bacteria', 'no rank' => [ 'unclassified beta proteobacteria (miscellaneous)', 'unclassified beta proteobacteria', 'cellular organisms', 'root' ] }, 'Suillus aeruginascens' => { 'subclass' => 'Hymenomycetidae', 'order' => 'Boletales', 'genus' => 'Suillus', 'phylum' => 'Basid};', 'class' => 'Hymenomycetes', 'species' => 'Suillus aeruginascens', 'family' => 'Boletaceae' } );

    I hope this is of use.

    Cheers,

    JohnGG