Beefy Boxes and Bandwidth Generously Provided by pair Networks
Welcome to the Monastery
 
PerlMonks  

Discussing a name space

by fernandes (Monk)
on Sep 11, 2007 at 14:42 UTC ( [id://638311]=perlquestion: print w/replies, xml ) Need Help??

fernandes has asked for the wisdom of the Perl Monks concerning the following question:

I would like to register the name spaces Text::Statistics::Cyrillic, Text::Statistics::GreekAndCoptic, Text::Statistics::Devanagari and Text::Statistics::Arabic.
These modules do the same analysis the registered module Text::Statistics::Latin does, but on different UNICODE intervals.
They extract 7 statistical data from corpora written in Cyrillic, Greek or Coptic, Devanagari and Arabic characters. The data are: text ID, term, term frequency, collection frequency, document frequency, total number of types per text, total number of tokens per text. The output is a CSV file.
Does it look nice?
Thanks for voting yes or no and additional comments also.

Replies are listed 'Best First'.
Re: Discussing a name space
by moritz (Cardinal) on Sep 11, 2007 at 14:49 UTC
    You should provide a perl interface for the output as well (as a hash ref or something), not just a CSV output.

    If you provide only CSV output and somebody wants to access the statistics from within perl, it has to parsed again - which is rather ugly.

      Someone has told me this before you... But, there are a lot of modules for CSV parsing and the output may be very large. From 300 texts, I've obtained 25 MB of statistical data (some cells are reduntant, and can be reindexed in an inverted fashion at second stage). By the way, I use SPSS for generating meta-data, like tf-idf scores, and a CSV file is perfect, because SPSS can easily parse it.
      But I'm really interested in improving the module API. So, if you have time, I would love to receive your code.
        A CSV is just a representation of a two dimensional array, so each time you write print OUTFILE join(',', @row); you can just push @rows, \@row.

        The best idea is probably to let the user decide what to do with the data.

        At some point in your module you certainly have the data in an internal format - adding that to an array or hash should be trivial, as well as making that accessible to the user.

Re: Discussing a name space
by ikegami (Patriarch) on Sep 11, 2007 at 15:12 UTC

    Uh, you've already uploaded those to CPAN (Text::Statistics::)?!

    I took a peek at the latin one. It isn't a module in the common sense. Modules provide tools programs and other modules can use. Your module is actually an entire program that's useless outside of the example script in the documentation. A prime example is the module's claim that the program has ended. I'm suprised you didn't hardcode the source of the input as well.

      It can be used by programs and other modules, and can be used outside of the example script too. For example, you can have a program (or module used for a program) that cleans every tag from HTML files, deliver it to Text::Statistics, and processes the CSV output for obtaining some other data.
      And "hardcode the source of the input" appears to me senseless in this context. See modules for NLP, disambiguation, WordNet, parsing, and other linguistic stuff. Many of them work on files and generate files as output.
        I beg to differ. It can't because your function sends "fim de programa" (among other things) to STDOUT, even when it's not true. Your "module" assumes the script is there to serve it, while it should be the other way around.
Re: Discussing a name space
by leocharre (Priest) on Sep 11, 2007 at 15:59 UTC
    I'm confused. You mention they do the same as latin but on diff unicode intervals.

    They do not do the same as latin at all, in that they are diff languages completely- first that

    What you are pointing out is that the ones you have for these other languages, they do what Text::Statistics::Latin does for 'latin', only more (correct?)

    The only suggestion I could conjure here that does pop to mind, is could you suggest or contribute a patch to Text::Statistics::Latin so that it could work the same way as your other modules (perhaps just an option to). Perhaps as Text::Statistics::Latin::Ext even.

    It would seem to me that if I were using many of these modules that do such similar things and one happens to act different, it would be disturbing.

      Thank you very much for you relevant comments.
      But, there is a problem: if Latin and Devanagari are independent Unicode scripts, why Devanagari functionality will be embedded in Latin module?
      It is really a name space challenge!
      There are two sides: descriptivism and parsimony of the names.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://638311]
Approved by ww
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others taking refuge in the Monastery: (4)
As of 2024-04-20 10:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found