Oh Wise Ones,

In a sudden burst of sanity i created a module that can extract data from a bunch of generated or similar HTML pages. It's a bit template-like but in the wrong direction.

To use it, one has to edit the "template" file and replace any value they want with [% name_of_the_value %]. The module then returns a ref to a hash with these names and their corresponding values in the second document.

Edit: I just found out about Template::Extract so this can be moved to /dev/null


Example:
Lets's say i have a bunch of html documents that all look kinda like this:
<html> <head> <title>Mammals</title> </head> <body> <h1>Mammals</h1> <h2 id="1">Monkeys</h2> </body> </html>
Now i want to extract certain values from that html document.
From the html document i create a template that looks like this:
<html> <head> <title>[% title %]</title> </head> <body> <h1>Mammals</h1> <h2 id="[% myidentifier %]">[% animal %]</h2> </body> </html>
Now this piece of code:
#!/usr/bin/perl use strict; use warnings; use ExtractDiff; use File::Slurp; my $template = read_file('template.html'); my $document = read_file('document.html'); my $resultRef = ExtractDiff::getValues(\$template, \$document); foreach (keys %$resultRef) { print "$_: $$resultRef{$_}\n"; }
Would produce this:
myidentifier: 1 animal: Monkeys title: Mammals
The actual code is this:
package ExtractDiff; use strict; use warnings; use Algorithm::Diff qw(sdiff); use Data::Dumper; sub getValues { my $template = shift; my $document = shift; my %result; foreach my $item (sdiff(splitFile($template), splitFile($docum +ent))) { if (($item->[0] eq 'c') && ($item->[1] =~ m/\[ +\%\s*(.+?)\s*\%\]/)) { my $name = $1; my $templateString = $item->[1]; my $documentString = $item->[2]; if ($templateString =~ m/^(.*?)\[\%.*? +\%\](.*?)$/) { my $prefix = $1; my $postfix = $2; if ($documentString =~ m/^\Q$p +refix\E(.*)\Q$postfix\E$/) { #print "$name: $1\n"; $result{$name} = $1; } } } } return \%result; } sub splitFile { my $ref = shift; my @file; push (@file, grep { $_ } split(/\s*(<.+?>)\s*/, $$ref)); return \@file; } 1;
Does anybody have any comments on this? Is it handy enough to put on CPAN? What would be a good name?

In reply to RFC: Module for extracting data from generated HTML pages by Jaap

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.