good evening dear community,
first of all - i am very very happy that i have found this great place. I like this forum very very much, since it has a great and supportive community! I learn alot form you folks here! Each question has got some great reviewers and
each thread is a rich value and learning asset.
Well i am farily new to Perl - and fairly new to this board here: i am currently workin out a little parser: i want to parse a table
http://www.schulministerium.nrw.de/BP/SchuleSuchen?action=607.590098597145&SchulAdresseMapDO=154763
This page has a table: well a table with vaules and lables.
We need to provide something that uniquely identifies the table in question. This can be the content of its headers or the HTML attributes. In this case, there is only one table in the document, so we don't even need to do that. But, what about to provide anything to the constructor, I would provide the class of the table.
We do not want the columns of the table. The first column of this table consists of labels and the second column consists of values. To get the labels and values at the same time, we should process the table row-by-row.
Well - can this be done like so:
#!/usr/bin/perl
use strict; use warnings;
use HTML::TableExtract;
use YAML;
my $te = HTML::TableExtract->new(
attribs => { class => 'bp_ergebnis_tab_info' },
);
$te->parse_file('t.html');
# here the file is stored http://www.schulministerium.nrw.de/BP/Schule
+Suchen?action=559.5361066995808&SchulAdresseMapDO=143960
foreach my $table ( $te->tables ) {
foreach my $row ($table->rows) {
print " ", join(',', @$row), "\n";
}
}
See the results:
martin@suse-linux:~/perl> perl parser_perl_nrw2.pl
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
Schuldaten,
�,�Schule hat Schulbetrieb
Schulnummer,�143960
Amtliche Bezeichnung,�Franziskusschule Kath. Hauptschule Ahaus - Sekundarstufe I -
Strasse,�Hof zum Ahaus 6
Plz und Ort,�48683 Ahaus
Telefon,�02561 4291990
Fax,�02561 42919920
E-Mail-Adresse,�143960@schule.nrw.de
Internet,�http://www.franziskusschule.de
�,�Schule in �ffentlicher Tr�gerschaft
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
�,
Sch�lergesamtzahl,�648
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
�,
Ganztagsunterricht,�Ja (erweiterter Ganztagsbetrieb)
Sonstiges,�Teilnahme am Projekt 'Betrieb und Schule (BUS)'
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17.
Unterrichtsangebote,
Use of uninitialized value $row in join or string at parser_perl_nrw2.pl line 17
.
Schule erteilt Unterricht in Fremdsprache(n)...,
�,�Englisch
Well. probably i can do the following, to get rid of the uninitialized value warnings?
Some of the table cells are empty so you may want to test for them or filter them out. Like this for example:
foreach my $table ( $te->tables ) {
foreach my $row ($table->rows) {
my @values = grep {defined} @$row;
print " ", join(',', @values), "\n";
}
}
Well another thing we can do: we could also outright and disable warnings for this particular blocks with no warnings ' uninitialized', but well it is generally not a good practice.
general Question: what eles can help to get rid of the unsanitized data. Note - i want to store all in a MySQL-Database!
You see - this parser works not bad - i only want to get rid of the unsanitized data (lines). I look forward to get some ideas and starting-points!
Any and all help will be greatly appreciatded.
Regards
pb1
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.