Perlbeginner1 has asked for the wisdom of the Perl Monks concerning the following question:

good evening dear community,
first of all - i am very very happy that i have found this great place. I like this forum very very much, since it has a great and supportive community! I learn alot form you folks here! Each question has got some great reviewers and

each thread is a rich value and learning asset.
Well i am farily new to Perl - and fairly new to this board here: i am currently workin out a little parser: i want to parse a table


http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=607.590098597145&SchulAdresseMapDO=154763

This page has a table: well a table with vaules and lables. We need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so we don't even need to do that. But, what about to provide anything to the constructor, I would provide the class of the table.
We do not want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, we should process the table row-by-row. Well - can this be done like so:

#!/usr/bin/perl use strict; use warnings; use HTML::TableExtract; use YAML; my $te = HTML::TableExtract->new( attribs => { class => 'bp_ergebnis_tab_info' }, ); $te->parse_file('t.html'); # here the file is stored http://www.schulministerium.nrw.de/BP/Schule +Suchen?action=559.5361066995808&SchulAdresseMapDO=143960 foreach my $table ( $te->tables ) { foreach my $row ($table->rows) { print " ", join(',', @$row), "\n"; } }


See the results:

martin@suse-linux:~/perl> perl parser_perl_nrw2.pl Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
Schuldaten,
�,�Schule hat Schulbetrieb
Schulnummer,�143960

Amtliche Bezeichnung,�Franziskusschule Kath. Hauptschule Ahaus - Sekundarstufe I -
Strasse,�Hof zum Ahaus 6

Plz und Ort,�48683 Ahaus
Telefon,�02561 4291990
Fax,�02561 42919920
E-Mail-Adresse,�143960@schule.nrw.de
Internet,�http://www.franziskusschule.de �,�Schule in �ffentlicher Tr�gerschaft
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
�,
Sch�lergesamtzahl,�648
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
�,
Ganztagsunterricht,�Ja (erweiterter Ganztagsbetrieb)
Sonstiges,�Teilnahme am Projekt 'Betrieb und Schule (BUS)'

Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
Unterrichtsangebote,
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17
. Schule erteilt Unterricht in Fremdsprache(n)..., �,�Englisch



Well. probably i can do the following, to get rid of the uninitialized value warnings?
Some of the table cells are empty so you may want to test for them or filter them out. Like this for example:

foreach my $table ( $te->tables ) { foreach my $row ($table->rows) { my @values = grep {defined} @$row; print " ", join(',', @values), "\n"; } }


Well another thing we can do: we could also outright and disable warnings for this particular blocks with no warnings ' uninitialized', but well it is generally not a good practice.


general Question: what eles can help to get rid of the unsanitized data. Note - i want to store all in a MySQL-Database!

You see - this parser works not bad - i only want to get rid of the unsanitized data (lines). I look forward to get some ideas and starting-points!

Any and all help will be greatly appreciatded.
Regards
pb1
  • Comment on optimizing a parser running HTML::TableExtract to fetch only some labels and values [row by row]
  • Select or Download Code

Replies are listed 'Best First'.
Re: optimizing a parser running HTML::TableExtract to fetch only some labels and values [row by row]
by Corion (Patriarch) on Dec 19, 2010 at 17:39 UTC

    If you don't want to print undefined values, don't call print on them. I'm not sure what the question is.

      hello Corion,

      many many thanks for the answer! I want to get the results out of the table that is shown here:

      http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=607.590098597145&SchulAdresseMapDO=154763

      Note: that are only some (about 11 ) labels and corresponding values. I want to get all those rows of data. But nothing more. See the overhead of lines of code and text in the result.... I want to get rid of this !


      do i have to be more descriptive or write more. Just let me know!


      i look forward to hear from you

      regards pb1

        Why don't you then just fetch the labels and values you want and leave the others alone?

        A reply falls below the community's threshold of quality. You may see it by logging in.
Re: optimizing a parser running HTML::TableExtract to fetch only some labels and values [row by row]
by Anonymous Monk on Dec 19, 2010 at 18:28 UTC

    From the documentation for HTML::TableExtract, emphasis added:

    rows()

    Return all rows within a matched table. Each row returned is a reference to an array containing the text, HTML, or reference to the HTML::Element object of each cell depending the mode of extraction. Tables with rowspan or colspan attributes will have some cells containing undef. Returns a list or a reference to an array depending on context.

    You need to decide how to handle the cases where cells span rows or columns. If you just want to ignore it, then you can use grep to filter them out, eg print "a cell: $_\n" for grep {defined} @$rows;.

      ... or disable the warnings no warnings 'uninitialized';

      print+qq(\L@{[ref\&@]}@{['@'x7^'!#2/"!4']});


      Well. probably i can do the following, to get rid of the uninitialized value warnings?
      Some of the table cells are empty so we can do a test for them or filter them out. Like this for example:

      foreach my $table ( $te->tables ) { foreach my $row ($table->rows) { my @values = grep {defined} @$row; print " ", join(',', @values), "\n"; } }


      Well another thing we can do: we could also outright and disable warnings for this particular blocks with no warnings ' uninitialized', but well it is generally not a good practice.

      Watcha think !?