I'm trying to parse an html output file from a program called mirdeep. I have gotten the script to the point where the data has been extracted into an array of arrays and I am trying to extract the specific information that I want from the appropriate columns. For some odd reason, I keep getting output that shouldn't be possible.

Bascially it should be pulling the counts from the mirbase miRNAs that were detected by mirdeep AND and the undetected ones, but leave the stat table and novel section alone.

However, the counts currently being pulled for the undetected section are strangely getting stored incorrectly. (Because of the way they were output, Some of the so called "star" and "mature" data were put into separate lines instead of being in the same line, so I am trying to recombine them.) I'm fairly certain that an error in the recombination is the source of the problem. (my main focus is on the use_data subroutine.)

I was checking my progress with print statements and noticed that some of the data that should only have either a mature count or a star count have both, and that the second number seems to be always equal to the mature count of the row below in the html file, which doesnt make sense to me. Can somebody help me track down whats gone wrong? This is my first time using packages for object orientation extensively, so there could be any kind of error in here.

If you would prefer to look at the code here, you can look below. I uploaded the script and an example data file at http://kotoro.com/stuff.tgz

--- Update1 @ 5:10 PM 12/22/2010:

I adjusted the code to use HTML::TableReader which eliminated the part of my code that used switching and made it a little more tolerable to look at.

Whats weird is I'm still getting "Duplicate detected messages thrown up when on items with no duplicates, and weirdly, it prints the name from the item identified one iteration in the past instead of the one being identified during that iteration.
example: (part of output)

hsa-miR-382 shortens to hsa-miR-382
hsa-miR-99b* shortens to hsa-miR-99b
Duplicate Detected! hsa-miR-382
hsa-miR-196b shortens to hsa-miR-196b

The program takes a single tab separated batch file as input. The first column being the label that will ultimately be used to describe the source of the counts during output, and the second column being the path to the file.

I changed the tgz so it only includes one layer of zipping and has a simpler name for the test file and an included batch set up for the one test file.

#!/usr/bin/perl use strict; use warnings; use diagnostics; use HTML::TableExtract; use FileHandle; my $usage; $usage = "Usage:\n"; $usage .= "get_mirbase_fromhtml.pl batchfile\n"; $usage .= "batch file line format: sample_label (tab) path_to_file\n"; our $batchfile = shift or die $usage; our %paths; #store tag=>file here our %data; #tag=>desired data for output; our %mirnas ; #master list of mirnas, each file will have a list with counts in +it, this one is just name=>1 to keep record #read batch file readbatch(); #execute parse on html files do_parsing(); #do any transformations/replacements (blanks with zeroes, etc.) #output data exit 0; sub readbatch { my $f = FileHandle->new( '<' . $batchfile ); $f or die "File $f did not open!\n"; my @lines = <$f>; chomp(@lines); my @splt; foreach (@lines) { @splt = split( /\t/, $_ ); if ( $#splt + 1 != 2 ) { die "Bad Batch File Format!\n"; } $paths{ $splt[0] } = $splt[1]; } undef $f; } sub do_parsing { foreach ( sort( keys(%paths) ) ) { print "Label: $_ File: $paths{$_}\n"; $data{$_} = getMirb( $_,$paths{$_} ); } } sub printTable { my $table = shift or die "No table given to printTable!\n"; foreach (@$table) { foreach (@$_) { print $_, "\t"; } print "\n"; } } sub getMirb { my $label = shift or die "Must give label as first arg to getMirb\ +n"; my $file = shift or die "Must give filename as second arg to getM +irb\n"; my ( %Predmirb, %UndetMirb, %Combined ); #hashes of data items my ( @Predmirb_t, @UndetMirb_t ); #tables @Predmirb_t = getPredMirb($file); # printTable( \@Predmirb_t ); convertPredmirb( \@Predmirb_t, \%Predmirb ); @UndetMirb_t = getUndetMirb($file); # printTable( \@UndetMirb_t ); convertUndetmirb( \@UndetMirb_t, \%UndetMirb ); combinePred_Undet( \%Predmirb, \%UndetMirb, \%Combined ); return %Combined; } sub getshortname { my $name = shift; my ( $last1, $last3 ); my $short; $last1 = substr( $$_[1], -1 ); $last3 = substr( $$_[1], -3 ); if ( $last1 eq '*' ) { $short = substr( $$_[1], 0, -1 ); #print $$_[1],"\t",$short,"\n"; } elsif ( $last3 eq '-5p' || $last3 eq '-3p' ) { $short = substr( $$_[1], 0, -3 ); #print $$_[1],"\t",$short,"\n"; } else { $short = $$_[1]; } # print $short, "\n"; return $short; } sub convertPredmirb { my ( $arr, $hash ) = @_; my $data; foreach (@$arr) { $data = miRNA_data->new(); $data->{ID} = $$_[0]; $data->{mirbase_name} = $$_[1]; $data->{total} = $$_[2]; $data->{mature} = $$_[3]; $data->{loop} = $$_[4]; $data->{star} = $$_[5]; $data->{short_name} = getshortname( $data->{mirbase_name} ); #store in format structure $hash->{ $data->{mirbase_name} . '|' . $data->{ID} } = $data; } } sub convertUndetmirb { my ( $arr, $hash ) = @_; my ( $star, $data ); my ( $last1, $last3, $name ); my %counts={}; foreach (@$arr) { $last1 = substr( $$_[1], -1 ); $last3 = substr( $$_[1], -3 ); if ( $last1 eq '*' || $last3 eq '-3p' ) { #mirdeep thinks 3p is star #print "This is a star seq!\n"; $star = 1; } elsif ( $last3 eq '-5p' || $last1 ne '*' ) { # normally would have to check for -3p but that was alread +y done #mirdeep thinks 5p is mature #print "This is a mature seq!\n"; $star = 0; } else { # should not be possible to get here print "Column 1: $$_[0]\tColumn 10: $$_[1]\n"; die( "Something is wrong!! Sequence found doesn't seem to be mature or star +\n" ); } $name = getshortname( $$_[1] ); print "$$_[1] \t shortens to \t $name\n"; #determine if precursor is already marked in hash, if it is, get refer +ence, if not, make new one and store it if ( !defined( $hash->{$name} ) ) { $data = miRNA_data->new(); $data->{ID} = $$_[0]; $data->{mirbase_name} = $$_[1]; $data->{total} = $$_[2]; $data->{loop} = $$_[4]; $data->{short_name} = $name; if ($star) { $data->{star} = $$_[3]; $data->{mature} = ''; } else { $data->{mature} = $$_[3]; $data->{star} = ''; } $hash->{ $data->{short_name} } = $data; $counts{ $data->{short_name} } = 1; } else { ++$counts{ $data->{short_name} }; if ( $counts{ $data->{short_name} } > 2 ) { print $data->{mirbase_name}, "\n"; print "Name used: $name\n"; die "Same miRNA detected more than 2 times! (star & ma +ture)\n"; } if ( $counts{ $data->{short_name} } == 2 ) {print "Duplica +te Detected! \t", $data->{mirbase_name},"\n";} #determine if line represents star or mature & store count in appropr +iate place if ( $star == 1 ) { $data->{star} = $$_[3]; } elsif ( $star == 0 ) { $data->{mature} = $$_[3]; } } } } sub combinePred_Undet { my ( $pred, $undet, $comb ) = @_; #TODO fill this out warn "warning:: combinePred_Undet not implemented yet!\n"; } sub getPredMirb { my $file = shift or die "Must give filename as arg to getPredMirb\ +n"; my $PredMirbParser = HTML::TableExtract->new( headers => ['tag id'], slice_columns => 0, keep_headers => 0 ); my @arr; $PredMirbParser->parse_file($file); for my $table ( $PredMirbParser->tables ) { for my $row ( $table->rows ) { push( @arr, parsePredMirbRow($row) ); } } return @arr; } sub parsePredMirbRow { my $row = shift; my @data = (); #ID push( @data, $$row[0] ); #mirbase name push( @data, $$row[9] ); #total push( @data, $$row[4] ); #mature push( @data, $$row[5] ); #loop push( @data, $$row[6] ); #star push( @data, $$row[7] ); # # foreach (@$row) { # if ( defined($_) ) { # print $_, "\t"; # } # else { # print "_\t"; # } # } # print "\n"; # print join("\t",@data),"\n"; return \@data; } sub getUndetMirb { my $file = shift or die "Must give filename as arg to getUndetMirb +\n"; my $UndetMirbParser = HTML::TableExtract->new( headers => ['miRBase precursor id'], slice_columns => 0, keep_headers => 0 ); my @arr; $UndetMirbParser->parse_file($file); for my $table ( $UndetMirbParser->tables ) { for my $row ( $table->rows ) { push( @arr, parseUndetPredMirbRow($row) ); } } return @arr; } sub parseUndetPredMirbRow { my $row = shift; my @data; #ID push( @data, $$row[0] ); #mirbase name push( @data, $$row[9] ); #total push( @data, $$row[4] ); #mature push( @data, $$row[5] ); #loop push( @data, '' ); #star push( @data, '' ); # foreach (@$row) { # if ( defined($_) ) { # print $_, "\t"; # } # else { # print "_\t"; # } # } # print "\n"; # print join("\t",@data),"\n"; return \@data; } package miRNA_data; sub new { my $class = shift; my $self = { ID => '', #tag ID / provisional ID/precursor name (for undete +cted seqs) mirbase_name => '', #name of miRNA short_name => '', #mirbase name with */-3p/-5p removed total => 0, mature => 0, loop => 0, star => 0, duplicates => 0 #bool will be 1 if duplicates are found }; bless $self, $class; return $self; } package DataFormat; sub new { my $class = shift; my $self = { label => shift, predmirb => {}, #will store miRNA_data entries from predict +ed mirbase set undetmirb => {}, #will store miRNA_data entries from undete +cted mirbase set mirb => {} #This will be the consolidation point keys construc +ted from mirbase_name|ID }; bless( $self, $class ); #turns var into an object } package mirdeep_html; sub new { my $class = shift; my $self = { stats => data_table->new(), novel => data_table->new(), predmirb => data_table->new(), undetmirb => data_table->new() }; bless $self, $class; return $self; } package data_table; sub new { my $class = shift; my $self = { table => getarray(), col => -1, row => -1 }; bless $self, $class; return $self; } sub getarray { my @arr; return \@arr; } sub addcell { if ( $#_ + 1 != 2 ) { die "Wrong number of arguments to addcell! Need: 2, Given: " . + @_ . " \n"; } my $self = shift; my $item = shift; if ( !defined( ${ $self->{table} }[ $self->{row} ] ) ) { die "Row referenced by index $self->{row} not defined!\n"; } push @{ $self->{table}[ $self->{row} ] }, $item; ++$self->{col}; } sub addrow { my $self = shift; my @row; push( @{ $self->{table} }, \@row ); ++$self->{row}; $self->{col} = -1; }


In reply to Having a problem with my code! by kotoroshinoto

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.