g day dtoronto, hello cowboy - Hello to my dear perl-folks
many many thanks for the reply.
i fixed the code and made up my mind. HERE is what i try to accomplish.
first of all: i am happy to hear from you!! this is probably one of the best places to ask such questions. so i do it now.
first of - i have to explain something; I have to grab some data out of a phpBB in order to do some field reseach. I need the data out of a forum that is runned by a user community. I need the data to analyze the discussions.
to give an example - let us take this forum here. How can i grab all the data out of this forum - and get it local and then after wards put it in a local database - of a phpBB-forum - is this possible"?!"?
http://www.nukeforums.com/forums/viewforum.php?f=17
Nothing harmeful - nothing bad - nothing serious and angerous. But the issue is. i have to get the data - so what?
I need the data in a allmost full and complete formate. So i need all the data like
username .-
forum
thread
topic
text of the posting and so on and so on.
how to do that?
i need some kind of a grabbing tool - can i do it with that kind of tool. How do i sove the storing-issue into the local mysql-database.
Well you see that is a tricky work - and i am pretty sure taht i am getting help here. So for any and all help i am very very thankful
#many many thanks in advance
i am testing a code .- this is a proof of concept. Please do not bear with me as this is a perl-snippet. Can u help me. The question is; if i apply this to another forum - can i get any detailed results. thanks for any answer - thanks for any and all help
cheers
And here a codeexample that is runned against the forum.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::RobotUA;
use HTML::LinkExtor;
use HTML::TokeParser;
use URI::URL;
use Data::Dumper; # for show and troubleshooting
my $url = "http://www.phpBBhacks.com/forums/viewforum.php?f=17";
my $ua = LWP::RobotUA->new;
my $lp = HTML::LinkExtor->new(\&wanted_links);
my @links;
get_threads($url);
foreach my $page (@links) { # this loops over each link collected from
+ the index
my $r = $ua->get($page);
if ($r->is_success) {
my $stream = HTML::TokeParser->new(\$r->content) or die "Parse
+ error in $page: $!";
# just printing what was collected
print Dumper get_thread($stream);
# would instead have database insert statement at this point
} else {
warn $r->status_line;
}
}
sub get_thread {
my $p = shift;
my ($title, $name, @thread);
while (my $tag = $p->get_tag('a','span')) {
if (exists $tag->[1]{'class'}) {
if ($tag->[0] eq 'span') {
if ($tag->[1]{'class'} eq 'name') {
$name = $p->get_trimmed_text('/span');
} elsif ($tag->[1]{'class'} eq 'postbody') {
my $post = $p->get_trimmed_text('/span');
push @thread, {'name'=>$name, 'post'=>$post};
}
} else {
if ($tag->[1]{'class'} eq 'maintitle') {
$title = $p->get_trimmed_text('/a');
}
}
}
}
return {'title'=>$title, 'thread'=>\@thread};
}
sub get_threads {
my $page = shift;
my $r = $ua->request(HTTP::Request->new(GET => $url), sub {$lp->pa
+rse($_[0])});
# Expand URLs to absolute ones
my $base = $r->base;
return [map { $_ = url($_, $base)->abs; } @links];
}
sub wanted_links {
my($tag, %attr) = @_;
return unless exists $attr{'href'};
return if $attr{'href'} !~ /^viewtopic\.php\?t=/;
push @links, values %attr;
}
$VAR1 = {
'thread' => [
{
'post' => 'Hello, I\'m pretty new to PHPNuke
+. I\'ve got my site up and running great! I\'m now starting to make m
+odifications, add modules etc. I\'m using the most recent RavenPHP76.
+ I want to display the 5 most recent forum posts at the top of the fo
+rum page. I\'m not sure if this functionality is built in, if so, how
+ to activate. Or if there is a module or block made to do this. I loo
+ked at Raven\'s Collapsing Forum block but wasn\'t crazy about the fo
+rmat, and I don\'t want it to be collapsable. Thanks! mopho',
'name' => 'mopho'
},
{
'post' => 'hi there',
'name' => 'sail'
},
{
'post' => 'thanks for asking this; :not very
+ sure if i got you right; Do you want to have a feed of the last foru
+mthreads? guess the easiest way is to go to raven and ask how he did
+it. hth sail.',
'name' => 'sail'
},
{
'post' => 'Thanks. i found what I was lookin
+g for. It wasn\'t so easy to find! It\'s called glance_mod. mopho',
'name' => 'mopho'
},
{
'post' => 'hi there thx',
'name' => 'sail'
},
{
'post' => 'it sound interesting - i will hav
+e also a look i google after it - and try to find out more regards sa
+ilor',
'name' => 'sail'
}
],
'title' => 'Recent Forum Posts Module'
};
Hmm i want to grab data out of forum - (for my studies]
http://www.nukeforums.com/forums/viewforum.php?f=17
This is really preliminary. It just grabs the basic text from the threads and doesn't handle the quoted text right yet. hmmm would this be hard to fix. There are many parsing approaches that can be taken in perl,
we obviously also have to set up a database to capture information you want to store.
Additionally, this script just looped over the first index page, It didn't run over more than the first page
it is set up a loop to grab each of the index pages
Well, dtoronto and cowboy i am a true PERL NEWBIE - and i need your help.
what about the complete parsing (and harvesting of this both forum here
http://www.nukeforums.com/forums/viewforum.php?f=17
http://www.nukeforums.com/forums/viewforum.php?f=3
i look forward to hear form you both dtoronto and cowboy
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.