punkish has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to write some code to parse CC licenses using RDFa, much like the Python parser shown at http://wiki.creativecommons.org/License_Properties#Using_RDFa but am not making much headway. I am using RDF::RDFa::Parser.

my $uri = ''; my $xhtml = '<a rel="license" href="http://creativecommons.org/license +s/by/3.0/us/"><img alt="Creative Commons License" style="border-width +:0" src="http://i.creativecommons.org/l/by/3.0/us/88x31.png" /></a><b +r /><span xmlns:dc="http://purl.org/dc/elements/1.1/" href="http://pu +rl.org/dc/dcmitype/Text" property="dc:title" rel="dc:type">My licenso +r</span> by <a xmlns:cc="http://creativecommons.org/ns#" href="www.ex +ample.com/My-Licensor" property="cc:attributionName" rel="cc:attribut +ionURL">Puneet Kishor</a> is licensed under a <a rel="license" href=" +http://creativecommons.org/licenses/by/3.0/us/">Creative Commons Attr +ibution 3.0 United States License</a>.<br />Based on a work at <a xml +ns:dc="http://purl.org/dc/elements/1.1/" href="www.example.com/Source +-Work" rel="dc:source">www.example/Source-Work</a>.<br />Permissions +beyond the scope of this license may be available at <a xmlns:cc="htt +p://creativecommons.org/ns#" href="www.example/More-Permissions" rel= +"cc:morePermissions">www.example.com/More-Permissions</a>.'; my $parser = RDF::RDFa::Parser->new($xhtml, $uri); $parser->consume; print Dumper $parser;

Needless to say, the above code croaks because the xhtml doesn't have a root node. So, I wrap it in a node like so <lic>$xhtml</lic> and now the code doesn't croak, but I really get nothing. Here is my result from a dumper

$VAR1 = bless( { 'named_graphs' => 0, 'bnodes' => 0, 'DOM' => bless( do{\(my $o = 4444944)}, 'XML::LibXML: +:Document' ), 'baseuri' => '', 'tdb' => 0, 'RDF' => {}, 'xhtml' => '<lic>blah blah blah...</lic>', 'Graphs' => {} }, 'RDF::RDFa::Parser' );

Truth be told, I am not sure what I should expect. I was expecting the RDF key and/or the Graph key to be populated with RDF triples. If I put my $xhtml fragment (without my fake root node) into the license validator at http://validator.creativecommons.org/, it parses it just fine and gives back a result that makes sense, so, I guess that is what I am after.

Any suggestions anyone?

Note: There seems to be a dearth of Perl code for this. On CC site, we have Python, PHP and Ruby, but no Perl. http://buzzword.org.uk/swignition/ is a new project by the creator of RDFa. Perhaps, I should try that at some point, but step by step here.

--

when small people start casting long shadows, it is time to go to bed

Replies are listed 'Best First'.
Re: Parsing CC licenses
by Your Mother (Archbishop) on Apr 12, 2009 at 17:46 UTC

    A casual look through the Pod and source leads me to believe you have to use callbacks to catch the triples; nothing special is done with the triples by the object other than passing them along. This isn't what I would call the most fun UI but you should be able to get what you want out of it by writing and inserting callbacks.

      In some of the earlier versions of the module, you did need to catch triples via the callbacks, but as of 0.10 that is no longer necessary (though it's still possible, and allows you to do some interesting things).

      In this case, I think the empty string URI is causing the problem. You need to plug in the URI of the page being parsed here. (If you don't know the URI, you could always put something like 'http://invalid.invalid/'.)

      Here's an example usage which should work:

      use Data::Dumper; use LWP::Simple; use RDF::RDFa::Parser; my $uri = 'http://www.w3.org/2006/07/SWD/RDFa/testsuite/xhtml1-testcas +es/0058.xhtml'; my $parser = RDF::RDFa::Parser->new(get($uri), $uri); $parser->consume; print Dumper( $parser->graph );
        Right tobyink, adding a uri fixed the problem. Many thanks for responding quickly. More documentation would definitely help, particularly with examples. As you say, adding custom callbacks would allow to do "interesting things" -- some examples would be great. I would be happy to write the documentation up if I knew where to start.

        You mention the module is beta, and its version number reflects that. Do you have a roadmap for the module, or are you focusing your energy on Swignition now?

        What is the plan for Swignition? Is that going to become a CPAN resident module or will it develop as it is right now, a standalone program? I downloaded it, but it didn't run... it required some dependencies that it couldn't find.

        I would definitely like to see RDF::RDFa::Parser develop further and become more robust and a viable alternative to the Python/Ruby/PHP stuff out there. Doesn't matter if it develops as it is, or in a new incarnation as Swignition.

        By the way, your Swignition link on your website is broken. The link for 0.15 leads to the older Cognition 0.14 version.

        --

        when small people start casting long shadows, it is time to go to bed
      I frankly have no starting idea about how to deal with the triples.. that is what I thought this module would do. Otherwise why even bother with this module.. just use XML::Parser or something similar and roll my own, no?

      In any case, the Pod of the module also says

      In place of either or both functions you can use the string 'print' which sets the callback to a built-in function which prints the triples to STDOUT as Turtle. Either or both can be set to undef, in which case, no callback is called when a triple is found.

      So, that is exactly what I did, but got nothing, nada, zilch.

      The relevant code in the module is

      if (lc($_[$n]) eq 'print') { $this->{'sub'}->[$n] = ($n==0 ? \&_print0 : \&_print1); +} .. a few lines down .. sub _print0 # Prints a Turtle triple. { my $this = shift; my $element = shift; my $subject = shift; my $pred = shift; my $object = shift; printf("# Triple on element %s.\n", $element->nodePath); printf("%s %s %s .\n", ($subject =~ /^_:/ ? $subject : "<$subject>"), "<$pred>", ($object =~ /^_:/ ? $object : "<$object>")); } .. and so on ..

      --

      when small people start casting long shadows, it is time to go to bed