Cultural and Bibliometric Perl

Dear Bros. on code:
First of all, I'll say that they do only a few months that I use Perl, but how I'm using it intensively (a lot of hours every day) I suppose that soon I'll have enough level to participate more intensively ;)
I'm a degree on Librarianship and Information Sciences and I would like to talk about the possible use of Perl on the 'cultural' sciences, that is, linguistic, literature, history, sociology, etc (I'm sorry, I'm Spanish and I don't know how to denominate those sciences on English).
Traditionally, at least on my country, those scientific (i'm one of them really, but a bit different) have renegated from the computers and technologies, declaring that the machines 'dehumanize' the men etc. Well I think that we MUST make them understand that the computer is only a TOOL and can be very useful on their work.
And in the case of PERL a very powerful tool, thinking specially on regexp stuff and array and hash possible uses...
The title of this meditation says 'bibliometric', bibliometric is the sciencie that analizes the science advances, growing, relations etc basing on its results. On simple words: applying statistic methods to the scientifics works and using a serie of bibliometrical scientific laws analize the results... and what is better than perl to analize INFORMATION, that is THOUSANDS of records on plain text mainly, preformated by a database export... ummm suposse I have a database of 50000 records of scientific publications and It includes data about the cites to other authors (or the self author) in the scientific work... It would be possible using Perl recreate a 'net of cites' that said 'the author Smith is cited by Jones and cites to..." and if you use a graphical library as GD to VIEW the relations... you know what i mean ;)
Here I post a practical example made by myself to make use of the 'Law of Zipf', that analizes the frequency of each word on a text an concludes a serie of things very interesting (by example on automatic extraction of significant words of a text, automatic abstracts, determination of the 'empty words' on a language etc). It is'n very optimized (sure that it can be optimized severely) but it works, and if you pass as argument a plain text file (I've tried with texts from Project Gutemberg available on http://www.promo.net/pg/) it generates a CSV file with the same name that can be imported directly on Excel or similar and contains 3 columns: word;frequency of the word on the text;relative frequency of the word:

$file=$ARGV[0];
open LIBRO, "<$file" || die $!;
$file =~ /(.*)\.(.*)/;
$ar = $1;
@contenido = <LIBRO>;

foreach (@contenido)
{
    chop;
}

$contenido = "@contenido";

$contenido =~ tr/[\.;\,:\"\'\(\)\?\!\-_\*0123456789]/ /;
$contenido =~ tr/[a-z]/[A-Z]/;

@palabras = split /\s/, $contenido;

foreach $palabra (@palabras)
{
    if ($palabra ne "")
    {
        $PF{$palabra}++;
    }
}

@palabrasOK = keys %PF;
$npalabras = @palabrasOK;
 while (($k, $v)= each %PF)
{
    $freq = $v / $npalabras;
    $freq =~ tr/\./\,/;
    $transfor .= "$k;$v;$freq\n";
}
close LIBRO;
open LIBROOUT, ">$ar.csv";

print LIBROOUT $transfor;
[download]

Well, after all this boring stuff ;) I finish, comments and suggestions are welcome, of course
Byes
Ignatius Monk, The Ciberlibrarian Monk on the Perl Order ;)

Comment on Cultural and Bibliometric Perl Download Code

Replies are listed 'Best First'.
Re: Cultural and Bibliometric Perl by jeroenes (Priest) on Jun 29, 2001 at 15:26 UTC
Some comments: use strict warnings and diagnostics or die The file regex may produce some unwanted results with unix-like filenames: file.txt.bak . Just think which part you want to retain. You can find the first and last dot with index and rindex, respectively. You slurp the contents in array context only to join the array. You can also set $/ to undef: `{ local $/ = undef; $contenido = <LIBRO>; }` [download] You can leave the newlines intact, they will be catched with '\s'. Even better, tr will take care of that. Use lc or uc to change the case. You can simplify the translation, by complementing the list to the alphabetic range (see perlop): `$contenido = uc $contenido; $contenido =~ tr/A-Z/ /cs;` [download] Use '\s+' rather than '\s', so you don't have to test for empty cases. You can get the total number without array assignment: `$npalabras = keys %PF;`The scalar context will force immediate size return. I would print LIBROUT in the while loop, so the system will get the chance to buffer nicely. It's quite a list, but I hope it will give you the chance to learn new idiom. Result: `#.... my $contenido; { local $/ = undef; $contenido = <LIBRO>; } $contenido = uc $contenido; $contenido =~ tr/A-Z/ /cs; my %PF; $PF{$_}++ for( split /\s+/, $contenido); open LIBROUT, ">$ar.csv"; my $npalabras = keys %PF; while( keys %PF ){ print LIBROUT join ';', $_, my $f=$PF{$_}, $f/ $npalabras; print LIBROUT "\n"; }` [download] Well, you see how the use of $_ simplifies things.. Hope this helps, Jeroen "We are not alone"(FZ)	[reply] [d/l] [select]
Re: Re: Cultural and Bibliometric Perl by Ignatius Monk (Novice) on Jun 29, 2001 at 15:53 UTC
Dear friend: Thanks a lot by your suggestions :). I made this code at the beggining of my perl efforts and I hadn´t open It until I pasted It on the post, and of course the code was very simple :) I can see that your is much better. I'm just going to buy on the net the 3rd edition of the programming perl (because it hasn't arrived to the Spanish market by now) to have a better knowledge of the programming language :) the problem is that I'm totally self-teached on computer sciences (I program on VB C and now Perl but learning on my own way, buying books, downloading tutorials, etc) and I advance too slow sometimes, I suppose that with that reference book I'll learn better 'programming skills' from now on ;) Best regards Ignatius, the Ciberlibrarian Monk on the Perl Order ;)	[reply]
Re: Cultural and Bibliometric Perl by VSarkiss (Monsignor) on Jun 29, 2001 at 19:16 UTC
Brother Ignatius, I understand and appreciate your comments. If you haven't discovered it yet, run, don't walk, to CPAN, the Comprehensive Perl Archive Network. There you'll find an immense collection of tools to help you in this type of work. In particular, scan the list of modules by category. Categories that you may find of interest include "String, Language, and Text Processing", "Filehandles, Input, and Output", and possibly "Data Type Utilities". There are other interesting tools right here in the Monastery, in the Code Catacombs. That area is categorized also; you may want to scan the Text Processing area. HTH	[reply]