The data doesn't look too much of a jumble. It appears to be records delimited by double pipe symbols surrounded by whitespace, each record consisting of fields delimited by single pipes surrounded by whitespace, each field being a key/value pair delimited by a pipe with no surrounding whitespace. This code parses the data into a HoH(oA) structure, the (oA) for the repeated 'no rank' key. I use Data::Dumper to show the structure.

use strict; use warnings; use Data::Dumper; my $rawData = <<END_OF_DATA; species|Caragana arborescens | genus|Caragana | subfamily|Papilionoideae | family|Fabaceae | order|Fabales | no rank|eurosids I | subclass|Rosidae | no rank|core eudicots | no rank|eudicotyledons | no rank|Magnoliophyta | no rank|Spermatophyta | no rank|Euphyllophyta | no rank|Tracheophyta | phylum|Embryophyta | no rank|Charophyta/Embryophyta group | no rank|Streptophyta | kingdom|Viridiplantae | superkingdom|Eukaryota | no rank|cellular organisms | no rank|root || species|syncytium endosymbiont of Diaphorina citri | no rank|unclassified beta proteobacteria (miscellaneous) | no rank|unclassified beta proteobacteria | class|beta subdivision | phylum|Proteobacteria | superkingdom|Bacteria | no rank|cellular organisms | no rank|root || subspecies|Trypanosoma brucei brucei | species|Trypanosoma brucei | subgenus|Trypanozoon | genus|Trypanosoma | family|Trypanosomatidae | order|Kinetoplastida | no rank|Euglenozoa | superkingdom|Eukaryota | no rank|cellular organisms | no rank|root || species|unculturable Mariana archaeon no. 1 | no rank|environmental samples | no rank|unclassified Crenarchaeota | kingdom|Crenarchaeota | superkingdom|Archaea | no rank|cellular organisms | no rank|root || species|Suillus aeruginascens | genus|Suillus | family|Boletaceae | order|Boletales | subclass|Hymenomycetidae | class|Hymenomycetes | phylum|Basid}; END_OF_DATA $rawData =~ s{\n}{}g; my @rawRecords = split m{\s*\|\|\s*}, $rawData; print scalar @rawRecords, qq{ records found\n}; my %parsedRecords; foreach my $rawRecord ( @rawRecords ) { my ( $species ) = $rawRecord =~ m{(?<!\w)species\|(.+?)(?=\s+\|)}; print qq{Species: $species\n}; my @dataPairs = split m{\s+\|\s+}, $rawRecord; foreach my $dataPair ( @dataPairs ) { my ( $key, $value ) = split m{\|}, $dataPair; unless ( exists $parsedRecords{ $species }->{ $key } ) { $parsedRecords{ $species }->{ $key } = $value; } elsif ( ref $parsedRecords{ $species }->{ $key } eq q{ARRAY} ) { push @{ $parsedRecords{ $species }->{ $key } }, $value; } else { $parsedRecords{ $species }->{ $key } = [ $parsedRecords{ $species }->{ $key }, $value ]; } } } my $dd = Data::Dumper->new( [ \ %parsedRecords ], [ q{*parsedRecords} ] ); $dd->Indent( 1 ); print $dd->Dumpxs;

Here's the output.

5 records found Species: Caragana arborescens Species: syncytium endosymbiont of Diaphorina citri Species: Trypanosoma brucei Species: unculturable Mariana archaeon no. 1 Species: Suillus aeruginascens %parsedRecords = ( 'Trypanosoma brucei' => { 'genus' => 'Trypanosoma', 'species' => 'Trypanosoma brucei', 'superkingdom' => 'Eukaryota', 'subgenus' => 'Trypanozoon', 'order' => 'Kinetoplastida', 'subspecies' => 'Trypanosoma brucei brucei', 'family' => 'Trypanosomatidae', 'no rank' => [ 'Euglenozoa', 'cellular organisms', 'root' ] }, 'Caragana arborescens' => { 'kingdom' => 'Viridiplantae', 'genus' => 'Caragana', 'species' => 'Caragana arborescens', 'superkingdom' => 'Eukaryota', 'subfamily' => 'Papilionoideae', 'order' => 'Fabales', 'subclass' => 'Rosidae', 'phylum' => 'Embryophyta', 'family' => 'Fabaceae', 'no rank' => [ 'eurosids I', 'core eudicots', 'eudicotyledons', 'Magnoliophyta', 'Spermatophyta', 'Euphyllophyta', 'Tracheophyta', 'Charophyta/Embryophyta group', 'Streptophyta', 'cellular organisms', 'root' ] }, 'unculturable Mariana archaeon no. 1' => { 'kingdom' => 'Crenarchaeota', 'species' => 'unculturable Mariana archaeon no. 1', 'superkingdom' => 'Archaea', 'no rank' => [ 'environmental samples', 'unclassified Crenarchaeota', 'cellular organisms', 'root' ] }, 'syncytium endosymbiont of Diaphorina citri' => { 'phylum' => 'Proteobacteria', 'class' => 'beta subdivision', 'species' => 'syncytium endosymbiont of Diaphorina citri', 'superkingdom' => 'Bacteria', 'no rank' => [ 'unclassified beta proteobacteria (miscellaneous)', 'unclassified beta proteobacteria', 'cellular organisms', 'root' ] }, 'Suillus aeruginascens' => { 'subclass' => 'Hymenomycetidae', 'order' => 'Boletales', 'genus' => 'Suillus', 'phylum' => 'Basid};', 'class' => 'Hymenomycetes', 'species' => 'Suillus aeruginascens', 'family' => 'Boletaceae' } );

I hope this is of use.

Cheers,

JohnGG


In reply to Re: capturing separately by johngg
in thread capturing separately by ada

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.