Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Fellow monks,

I am trying to retrieve the number of chars of a HTML page.

This is my first shot

#!/usr/local/bin/perl510 use warnings; use strict; use WWW::Mechanize; my $mech = WWW::Mechanize->new( ); $mech->agent_alias( 'Windows IE 6' ); my $url = 'http://www.somewhere.tld/'; $mech->get( $url ); $mech->success() or die "Get $url failed. " , $mech->response->status_line(); my $content = $mech->content(format => 'text'); my $len = length $content; print "Number of Chars: $len\n";

This gives a number of chars, that is close to the one M$-Office reports. (That is copy and paste the content of the same page opened in a browser to M$-Word.)

There's about 3 % difference.

Is there a better way to count the chars?

Thanks.

Replies are listed 'Best First'.
Re: Counting chars in a HTML-page
by Perlbotics (Archbishop) on Aug 01, 2008 at 09:55 UTC
    Differences might occure due to:
    • encoding & conversion: UTF (more than 1 byte per character) or HTML entities
    • counting the HTTP-Header
    • HTML envelopes (content-type)
    • interpretation of 'k': 1000 vs. 1024
    • obscure issues when using the clipboard (OLE?)
    • update: linefeeds (cr/lf, lf)
    • ...
      Whether whitespace compression has occurred or not.
Re: Counting chars in a HTML-page
by moritz (Cardinal) on Aug 01, 2008 at 09:48 UTC
    You'd have to know how Office counts. Is & one character for offices? or 5? Does it count markup at all? How does it count multi byte charcters? Per byte, or per codepoint, or per grapheme?

    "Character" is a very broad term which isn't defined precisely in this context.

    And do you consider "your" result or the one from Office to be better? Why?

    (and, does it matter? What do you want to do with that result?)

Re: Counting chars in a HTML-page
by Corion (Patriarch) on Aug 01, 2008 at 09:57 UTC

    Maybe Microsoft Word is wrong?

Re: Counting chars in a HTML-page
by jethro (Monsignor) on Aug 01, 2008 at 10:31 UTC
    You might make a small test case where you can count by hand to get an absolute value and then experiment with small changes to see where the difference comes from

Re: Counting chars in a HTML-page
by Illuminatus (Curate) on Aug 01, 2008 at 15:20 UTC
    From the documentation on WWW:Mechanize:
    "Returns a text-only version of the page, with all HTML markup stripped. This feature requires HTML::TreeBuilder to be installed, or a fatal error will be thrown."
    My guess is that "all HTML markup stripped" is the cause of your discrepancy
      Update on size check
      You can use the stat command to look at the size, if you really want to know what the unparsed size.
Re: Counting chars in a HTML-page
by Anonymous Monk on Aug 01, 2008 at 11:41 UTC
    Count bytes instead
      Never mind counting chars or bytes per page - I recommend counting pages. Much easier.

      I'd also venture that the results are typically close to one page per page, plus or minus 3%.

Re: Counting chars in a HTML-page
by Ywleskvy (Initiate) on Aug 02, 2008 at 08:03 UTC
    Perl's length() will count spaces and tabs as "characters." It also counts newlines as either one or two characters, depending on your platform. You can probably dodge this by counting characters this way:
    my $temp_string = "put stuff here...\n\t...\n"; my $char_count = 0; $char_count++ while $temp_string =~ m/\S/g;
    Word ignores newlines and optionally whitespace, plus it auto-replaces some characters. You'll get 18 from the method above, which is also what Word reports -- if you make sure the ... isn't get replaced with a single-character Unicode elipsis.
Re: Counting chars in a HTML-page
by Anonymous Monk on Aug 02, 2008 at 14:34 UTC
    Remember that length() will give you the total number of bytes including control characters, while M$ Office will only give you the number of glyphs (visible, symbolic characters).
      perldoc -f length
      Note the *characters*: if the EXPR is in Unicode, you will get the number of characters, not the number of bytes. To get the length in bytes, use "do { use bytes; length(EXPR) }", see bytes.
Re: Counting chars in a HTML-page
by jimX11 (Friar) on Aug 04, 2008 at 14:58 UTC

    Is the html valid? If not, is it at least well formed?

    Consider using wget and wc as another measurement:
     wget -O - 'http://www.somewhere.tld/' | wc