Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

help for naming a module that aims latin utf8 coded corpus statistical analysis

by fernandes (Monk)
on Jun 19, 2007 at 15:28 UTC ( [id://622036]=perlquestion: print w/replies, xml ) Need Help??

fernandes has asked for the wisdom of the Perl Monks concerning the following question:

Hy,
First of all, English is not my mother language, so… (and the code below was originally coded in utf8, so...)
I have written a module and would like discuss the name before sending it to CPAN. The description at the top of module is:
# $Id: CStatiBR.pm,v 1.0 2007/06/12 09:17:36 rpfernandes Exp $ #Copyright (c) 2007 Rodrigo Panchiniak Fernandes. All rights reserved. + # #This program is free software; you can redistribute it and/or # modify it under the same terms as Perl itself. =head1 NAME Text::CStatiBR - performs corpora statistical analyses =head1 SYNOPSIS use CText::CStatiBR; &Text::CStatiBR::CSTATIBR(); =head1 DESCRIPTION Text::CStatiBR creates a seven column CSV file output with one line ea +ch token per text given as input a corpus that files names follows ' 1 (1). txt', '1 (2). txt', ..., '1 (n).txt' or 1 \(([1-9]|[1-9][0-9]+)\)\.txt Columns stores statistical information: (1) number of word forms in document d; (2) number of tokens in d; (3) Id number of d, ie., n; (4) frequency of term t in d; (5) corpus frequency of t ; (6) document frequency of t (number of documents where t occurs at lea +st once); (7) t, UTF8 latin coded token-string delimited by /[ -@]|[\[-`]|[{-¿]| +[ɐ-˩]|[ʹ-�]/ Main output file name is '1 (n + 5).txt' and it is stored in the same +directory as the corpus, together with residual files on each input file with .txu +and .txv ad hoc extensions. This code was written under CAPES BEX-09323-5 =head2 Methods Example: #!/usr/bin/perl use strict; use Text::CStatiBR; &Text::CStatiBR::CSTATIBR("5"); #5 files are analised. #Main output #file created is #1 (10).txt =over =cut
Thanks!
  • Comment on help for naming a module that aims latin utf8 coded corpus statistical analysis
  • Download Code

Replies are listed 'Best First'.
Re: help for naming a module that aims latin utf8 coded corpus statistical analysis
by GrandFather (Saint) on Jun 19, 2007 at 22:31 UTC

    I don't know about your choice of module name, but you do need to tidy up your pod. You must have blank lines before and after command paragraphs (lines starting with =). Even a single space on an otherwise blank line breaks some POD parsers.

    Blocks of text without intervening blank lines will get reflowed by POD parsers so that your numbered list and the text preceding it become a single blob of text. Similar nastiness happens to your code sample.

    For the full story see perlpod and actually test your pod with several different renderers.

    Immediate fixes are to indent each line by several spaces for your list, ensure paragraphs are separated from each other by a blank line and make sure that blank lines don't contain white space. Consider:

    =head1 NAME Text::CStatiBR - performs corpora statistical analyses =head1 SYNOPSIS use CText::CStatiBR; &Text::CStatiBR::CSTATIBR(); =head1 DESCRIPTION C<Text::CStatiBR> creates a seven column CSV file output with one line + each token per text given as input a corpus that files names follows ' 1 (1). txt', '1 (2). txt', ..., '1 (n).txt' or 1 \(([1-9]|[1-9][0-9]+)\)\.txt Columns stores statistical information: (1) number of word forms in document d; (2) number of tokens in d; (3) Id number of d, ie., n; (4) frequency of term t in d; (5) corpus frequency of t ; (6) document frequency of t (number of documents where t occurs at + least once); (7) t, UTF8 latin coded token-string delimited by /[ -@]|[\[-`]|[{ +-¿]|[&#592;-&#745;]|[&#884;-&#65533;]/ Main output file name is '1 (n + 5).txt' and it is stored in the same +directory as the corpus, together with residual files on each input file with .txu +and .txv ad hoc extensions. This code was written under CAPES BEX-09323-5 =head2 Methods Example: #!/usr/bin/perl use strict; use Text::CStatiBR; &Text::CStatiBR::CSTATIBR("5"); #5 files are analised. #Main output #file created is #1 (10).txt =cut

    renders using pod2html (and copying as plain text) as:

    * NAME * SYNOPSIS * DESCRIPTION o Methods NAME Text::CStatiBR - performs corpora statistical analyses SYNOPSIS use CText::CStatiBR; &Text::CStatiBR::CSTATIBR(); DESCRIPTION Text::CStatiBR creates a seven column CSV file output with one line ea +ch token per text given as input a corpus that files names follows ' 1 (1). txt', '1 (2). txt', ..., '1 (n).txt' or 1 \(([1-9]|[1-9][0-9]+)\)\.txt Columns stores statistical information: (1) number of word forms in document d; (2) number of tokens in d; (3) Id number of d, ie., n; (4) frequency of term t in d; (5) corpus frequency of t ; (6) document frequency of t (number of documents where t occurs at + least once); (7) t, UTF8 latin coded token-string delimited by C<< /[ -@]|[\[-` +]|[{-¿]|[&#592;-&#745;]|[&#884;-&#65533;]/ >> Main output file name is '1 (n + 5).txt' and it is stored in the s +ame directory as the corpus, together with residual files on each input file with . +txu and .txv ad hoc extensions. This code was written under CAPES BEX-09323-5 Methods Example: #!/usr/bin/perl use strict; use Text::CStatiBR; &Text::CStatiBR::CSTATIBR("5"); #5 files are analised. #Main output #file created is #1 (10).txt

    DWIM is Perl's answer to Gödel
Re: help for naming a module that aims latin utf8 coded corpus statistical analysis
by graff (Chancellor) on Jun 20, 2007 at 03:59 UTC
    I hope you are not really going to insist on using data file names that include spaces and parentheses ("1 (1).txt" and so on) as either input files or output files. That makes lots of things a lot more difficult for command-line usage involving file names (everything would need to be quoted and/or escaped). Please stick to alphanumerics, underscore, hyphen and period for file names.

    I understand the difficulty of trying to write documentation in a foreign language. I hope you will have a chance to go over it with someone who knows both your native language and English well enough so that you can discuss the module comfortably with them, and they can clarify the English description. As it is, I would have to read the program code to understand how to use the module. (You might want to consider posting (a pointer to) the code for preliminary review by other monks.)

    It may be worthwhile to create a CPAN layer called "Text::Corpus::", which at first would contain just a "Stats" module (Text::Corpus::Stats), and later could contain other support modules for building, maintaining and using text corpora.

      Thank you by your suggestions and comments. Text::Statistics::Latin is published and registred. Other languages (or unicode intervalls) are comming soon. Text::Statistics::Devanagari Text::Statistics::GreekAndCoptic Text::Statistics::Cyrillic and Text::Statistics::Arabic are indexed on CPAN. Enjoy. If you know someone would like to delliver me better english documentation, please stay in touch.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://622036]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others rifling through the Monastery: (5)
As of 2024-04-23 22:53 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found