in reply to Scrape a blog: a statistical approach

G'day epimenidecretese,

I'm not completely sure what you're after. A small, representative sample of data and expected output would have helped.

If by "pile up all the code" you mean process all of your HTML files in a single run, then yes, getting a frequency count of all the tags in all the files "is possibile in a few lines of code".

Here's pm_1082082_html_tag_count.pl:

#!/usr/bin/env perl -l use strict; use warnings; my %tags; while (<>) { ++$tags{$1} while /<(?!\/)([^ >]+)/g; } print "$_\t$tags{$_}" for sort { $tags{$b} <=> $tags{$a} } keys %tags;

Here's some dummy input data. You can view those in a browser if you want: they render OK but they're not very interesting.

$ ls -l pm_1082082_*.html -rw-r--r-- 1 ken staff 237 13 Apr 12:13 pm_1082082_1.html -rw-r--r-- 1 ken staff 237 13 Apr 12:12 pm_1082082_2.html
$ cat pm_1082082_1.html <h1 id="H-1">Heading 1</h1> <p class="sub-heading"> Some <strong>bold</strong> and <em>italic</em> text.</p> <h2 id="H-1-2">Heading 1.2</h2> <p>Para1 (1.2.1)</p><p>Para2 (1.2.2)</p> <p><strong>Fake newlines:</strong><br /><br><br /></p>
$ cat pm_1082082_2.html <h1 id="H-2">Heading 2</h1> <p class="sub-heading"> Some <strong>bold</strong> and <em>italic</em> text.</p> <h2 id="H-2-2">Heading 2.2</h2> <p>Para1 (2.2.1)</p><p>Para2 (2.2.2)</p> <p><strong>Fake newlines:</strong><br /><br><br /></p>

Here's a sample run:

$ pm_1082082_html_tag_count.pl pm_1082082_*.html p 8 br 6 strong 4 h1 2 em 2 h2 2

If that's not what you're after, you'll need to clarify what you do want and, as already mentioned, sample input and expected output will help.

[If you're unsure of what information to provide, the guidelines in "How do I post a question effectively?" should help.]

-- Ken

Replies are listed 'Best First'.
Re^2: Scrape a blog: a statistical approach
by epimenidecretese (Acolyte) on Apr 13, 2014 at 12:26 UTC

    I did a search and, if I understood it correctly, I am trying to remove the boilerplate from some html pages: is it the right term?

    So, I have a lot's of data (web pages from 2005 to 2014) from the same blog. And I only want the text of the post. So I have a lot's of data to provide as example of what I don't want. In fact, more or less, in all the html files I have, the post text (title,date, ecc.) should be the only things that change. More or less. So I am trying to figure out how to statistically identify those lines of code that don't change.

    So far I processed all the pages and got one big html file. There I have lot's of lines that are the same. I want to somehow count how frequently those lines occurs, so to be able to identify them as boilerplate.

    With the (bad) code I posted I've already been able to strip off lot's of code from the original html page. Now I want to clean the rest: but since I have lot's of pages I thought it would be a good idea to try to somehow weight the boilerplate lines (maybe with mutual information?)

    I'll post same examplef of code and the results I've got so far as soon as I can.

      You could take a diff between consecutive pages instead of counting lines. You'd have to experiment with different modules like e.g. HTML::Diff or Text::Diff, but this approach could also help with style/layout changes.