HTML Parser strange Null Character in data

caind has asked for the wisdom of the Perl Monks concerning the following question:

I’m stuck, I’ve been trying to clean up my mistakes and I have a few I just can’t seem to figure out. Maybe I’m too close, but I decided an extra set of eyes may be just what I need.

I have written a script that will monitor the output of a database that is pushed to a website. I’m just scraping the site and then parsing the table. This is where I seem to run into a problem. First the site can take a little while to respond every now and then. This will shut down the script; I may have the solved with the goto statement. But, if there is a better way, I’m always willing to learn. Next either the data I get or the way I parse the table is putting a (? Space, Null, non-printable character…..something It looks like the degree symbol followed by a middle dot °∙ ?) in front of the incident number only. Every now and then another column will have this happen and shutdown the script.. I’ve tried everything I can think of to the point of pulling out my hair. Now I’m just taking shots in the dark hoping something will hit. This will create a problem with naming and saving and recalling the file.

BTW I’m not a developer; I’m just dangerous with a little bit of knowledge of Perl. So that being said I have a lot of gaps in my knowledge but I’m willing to learn.

### Script trimmed down #####


    GET:
    getstore ('http://URL.com/tast.cfm?jump=true&dropboxvalue=0&nblock
+=50&Sort_By=INC_Incident.IncidentNumber&Sort_Type=ASC&displayColumnLi
+st=1,2,3,4,5,6,7', 'tempincid.html')or goto GET;


my $html = 'tempincid.html';

my $te = HTML::TableExtract->new( headers => [("",
                        "Incident # ",
                        "Dispatch Time",
                        "Incident Type",
                        "Address",
                        "Apt. #",
                        "Postal Code",
                        "Unit Dispatched")] );
$te->parse_file($html);

$config{'header'} =<<"EOF"; 

<html>
<head>
<meta http-equiv="Content-Type" content="text/html"><title>ESSI | Onli
+ne Home</title>
<meta http-equiv="cache-control" content="no-cache" />
<META HTTP-EQUIV="PRAGMA" CONTENT="NO-CACHE">
<META HTTP-EQUIV="REFRESH" CONTENT="15">
<link href="/Page_style/page.css" rel="stylesheet" type="text/css"></h
+ead> 
<body>   
EOF

my $file = "tempincid.html";

my $date = POSIX::strftime( 
             "%c", 
             localtime( 
                 ( stat $file )[9]
                 )
             );

my $row = @{$te->rows};
print $config{'header'};


print "<TABLE align=\"center\" border=\"1\" cellpadding=\"2\" cellspac
+ing=\"0\" width=\"100%\" style=\"font-size: 12px;\">";

 foreach my $ts ($te->tables) 
 {
          foreach my $row ($ts->rows) 
    {
       print "<tr><td>  ", join('   </td><td>   ', @$row), "\n"; #####
+####Every now and then the cell prints a stream of errors. I'm Guessi
+ng it is that character I can identify ####
   
    print "</td></tr>" , "\n";
    }
 }

print "</Table>", "Dispatcher data last read $date", "</body></html>";



my $numColumns = @{$te->rows->[0]};
 
my $numRows = @{$te->rows};

for my $rowIndex ( 0..$numRows-3 ) 
{ 


    for my $columnIndex ( 0..$numColumns-1 ) 
   {

    my $cellvalue = $te->rows->[$rowIndex][7];

    foreach ($cellvalue) {chomp;}
    {
    $cellvalue=uc($cellvalue);
            
        if (($cellvalue =~ /BT/)  || ($cellvalue =~ /FB/))
    {
                  
           my $path = "C:/incidentnum/";
                
        my $cellmatch = $te->rows->[$rowIndex][$columnIndex];
        
#*********************************************************************
+*********************############################## 
#---------------Read each cell and as they are opened trim all whitesp
+aces from the left and right side-----------------#
#------There is still some non-viewable (Null space, non-printable cha
+racter to the left of the incident number.--------#
######################################################################
+###################################################

        my $cell1=$te->rows->[$rowIndex][1];
        $cell1 =~ s/^\s+|\s+//g;
        $cell1 =~ s/^\s+|\s(?=\s)|\s+$//g;
        my $cell2=$te->rows->[$rowIndex][2];
        $cell2 =~ s/^\s+|\s+$//g;
        my $cell3=$te->rows->[$rowIndex][3];
        $cell3 =~ s/^\s+|\s+$//g;
        my $cell4=$te->rows->[$rowIndex][4];
        $cell4 =~ s/^\s+|\s+$//g;
        my $cell5=$te->rows->[$rowIndex][5];
        $cell5 =~ s/^\s+|\s+$//g;
        my $cell7=$te->rows->[$rowIndex][7];
        $cell7 =~ s/^\s+|\s+$//g;
        my $row="$cell1, $cell2, $cell3, $cell4, $cell5, $cell7";
        
        my $filename = $cell1."."."txt";
        $filename =~ s/^\s+|\s+$//g;
        #$filename =~ s/^S+|\S+$//g;
        $path =~ s/^\s+|\s+$//g;
        $path =~ s/^\s+|\s(?=\s)|\s+$//g;
        my $full ="$path$filename";
    
        open (FILE, '>', $filename) or die("Couldn't open $filename");
               print FILE "$row";
        close(FILE)or die $!;

        my $powershell = 'C:\Windows\System32\WindowsPowerShell\v1.0\p
+owershell.exe';
        my $mboxScript = 'C:\inetpub\cgi-bin\EmailAlert.ps1';
        my $result = `$powershell -command "$mboxScript"`;

        print "$powershell\n";
        print "$mboxScript\n";
        goto EEND;
    } 
    }

   }

}

    EEND:

    exit 0;
[download]

Comment on HTML Parser strange Null Character in data Download Code

Replies are listed 'Best First'.
Re: HTML Parser strange Null Character in data by choroba (Cardinal) on Mar 30, 2015 at 14:45 UTC
Your code is very long. Please, remove the lines that are not related to the problem. In other words, try to provide the Short, Self Contained, Correct Example. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply]
Re: HTML Parser strange Null Character in data by tangent (Parson) on Mar 30, 2015 at 17:42 UTC
Where you repeat `$cell1 =~ s/^\s+\|\s+//g;` I would suggest instead that you create a subroutine (or two) to strip the white space and any problematic characters and then pass each value to that subroutine. Have a look at the docs on Regular Expression Character Classes. For example, if the problem character is a control character then you can use the `[:cntrl:]` character class to remove it. `foreach my $row ($ts->rows) { @$row = map { clean_text( $_ ) } @$row; #... } #... my $cell1=$te->rows->[$rowIndex][1]; clean_text( $cell1 ); #... sub clean_text { # remove all control characters $_[0] =~ s/[[:cntrl:]]+//g; # remove white space at start/end $_[0] =~ s/^\s+//; $_[0] =~ s/\s+$//; # or remove all white space $_[0] =~ s/\s+//g; }` [download]	[reply] [d/l] [select]
Re^2: HTML Parser strange Null Character in data by caind (Initiate) on Mar 30, 2015 at 18:29 UTC
Thank you.... I was going to do the subroutine a bit later, but I'll take your advice and do it now. Looking at that document and I believe it might be what I was missing. I'll let you know if this gets it. Again Thank you.	[reply]
Re: HTML Parser strange Null Character in data by marinersk (Priest) on Mar 30, 2015 at 18:54 UTC
Degree signs and middle-dot characters smack of nested unordered lists. Just tossing that into the mix in case it helps an AHAH moment come to light.	[reply]