Re: Scrape a blog: a statistical approach

G'day epimenidecretese,

I'm not completely sure what you're after. A small, representative sample of data and expected output would have helped.

If by "pile up all the code" you mean process all of your HTML files in a single run, then yes, getting a frequency count of all the tags in all the files "is possibile in a few lines of code".

Here's pm_1082082_html_tag_count.pl:

#!/usr/bin/env perl -l

use strict;
use warnings;

my %tags;

while (<>) {
    ++$tags{$1} while /<(?!\/)([^ >]+)/g;
}

print "$_\t$tags{$_}" for sort { $tags{$b} <=> $tags{$a} } keys %tags;
[download]

Here's some dummy input data. You can view those in a browser if you want: they render OK but they're not very interesting.

$ ls -l pm_1082082_*.html
-rw-r--r--  1 ken  staff  237 13 Apr 12:13 pm_1082082_1.html
-rw-r--r--  1 ken  staff  237 13 Apr 12:12 pm_1082082_2.html
[download]

$ cat pm_1082082_1.html
<h1 id="H-1">Heading 1</h1>
<p class="sub-heading">
Some <strong>bold</strong> and <em>italic</em> text.</p>
<h2 id="H-1-2">Heading 1.2</h2>
<p>Para1 (1.2.1)</p><p>Para2 (1.2.2)</p>
<p><strong>Fake newlines:</strong><br /><br><br /></p>
[download]

$ cat pm_1082082_2.html
<h1 id="H-2">Heading 2</h1>
<p class="sub-heading">
Some <strong>bold</strong> and <em>italic</em> text.</p>
<h2 id="H-2-2">Heading 2.2</h2>
<p>Para1 (2.2.1)</p><p>Para2 (2.2.2)</p>
<p><strong>Fake newlines:</strong><br /><br><br /></p>
[download]

Here's a sample run:

$ pm_1082082_html_tag_count.pl pm_1082082_*.html
p       8
br      6
strong  4
h1      2
em      2
h2      2
[download]

If that's not what you're after, you'll need to clarify what you do want and, as already mentioned, sample input and expected output will help.

[If you're unsure of what information to provide, the guidelines in "How do I post a question effectively?" should help.]

-- Ken

Comment on Re: Scrape a blog: a statistical approach Select or Download Code

Replies are listed 'Best First'.
Re^2: Scrape a blog: a statistical approach by epimenidecretese (Acolyte) on Apr 13, 2014 at 12:26 UTC
I did a search and, if I understood it correctly, I am trying to remove the boilerplate from some html pages: is it the right term? So, I have a lot's of data (web pages from 2005 to 2014) from the same blog. And I only want the text of the post. So I have a lot's of data to provide as example of what I don't want. In fact, more or less, in all the html files I have, the post text (title,date, ecc.) should be the only things that change. More or less. So I am trying to figure out how to statistically identify those lines of code that don't change. So far I processed all the pages and got one big html file. There I have lot's of lines that are the same. I want to somehow count how frequently those lines occurs, so to be able to identify them as boilerplate. With the (bad) code I posted I've already been able to strip off lot's of code from the original html page. Now I want to clean the rest: but since I have lot's of pages I thought it would be a good idea to try to somehow weight the boilerplate lines (maybe with mutual information?) I'll post same examplef of code and the results I've got so far as soon as I can.	[reply]
Re^3: Scrape a blog: a statistical approach by soonix (Chancellor) on Apr 13, 2014 at 22:18 UTC
You could take a diff between consecutive pages instead of counting lines. You'd have to experiment with different modules like e.g. HTML::Diff or Text::Diff, but this approach could also help with style/layout changes.	[reply]