karmas has asked for the wisdom of the Perl Monks concerning the following question:

I'm implementing an application which will extract information about computer parts and their prices from several websites. As most of them use awful (from parsers view) HTML, I'll have to use some sort of format description, which will then be used by HTML parser to extract data. I thought about using XML for this purpose. Something like:
<sites> <site> <name>Ralinga</name> <url>http://www.ralinga.lt/</url> <pages> <info type="memory">/kainos/index.php?rodyti=ram</info> <info type="processor">/kainos/index.php?rodyti=cpu</info> </pages> <regexp type="get_after"> <extract type="info">TD class=kain width=500</extract> <extract type="price">TD class=kain align=middle width=100 +</extract> </regexp> </site>
As this is my first data mining project maybe fellow monks could comment on correctness of this approach?

Replies are listed 'Best First'.
(jeffa) Re: Data mining
by jeffa (Bishop) on Mar 23, 2002 at 19:08 UTC
    Looks good to me. (don't forget the closing sites tag.)

    A small consideration, however, is this:

    <site> <name>Ralinga</name> <url>http://www.ralinga.lt/</url> ... </site>
    versus this:
    <site name="Ralinga" url="http://www.ralinga.lt/"> ... </site>
    It really is only a matter of personal choice, though. I find it very helpful to shove my XML through XML::Simple (with and without forcearray on) and then shove the resulting data structure through Data::Dumper to see the differences.

    UPDATE:
    After some reflection, i am curious why you didn't break this line up:

    <extract type="info">TD class=kain width=500</extract> # which would yield a data structure similar to 'extract' => [ { 'content' => 'TD class=kain width=500', 'type' => 'info' } ]
    Wrapping the content in a simple tag would do the job of breaking up the atoms for you:
    <extract type="info"><TD class="kain" width="500"/></extract> # yields something like: 'extract' => [ { 'TD' => { 'class' => 'kain', 'width' => '500', }, 'type' => 'info' } ]
    Let the XML parser do the work. ;)

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)