Quicksilver has asked for the wisdom of the Perl Monks concerning the following question:

I'm trying to adapt the simple Concordance Generator to insert the data into a database so that I can get another part of my programme to access it and use the data for various purposes.

use strict; use warnings; use DBI; my @theseWords; my @theseLines; my @found; my %Count; my %Line; my %theseWords; my ( $line, $word, $count, $LineNum ); my $file = "c:\\webroot\\dickens\\46.txt"; open ( IN, $file ) || die "$file not found\n"; @theseLines = <IN>; close (IN); chomp @theseLines; foreach $line ( @theseLines ) { $count++; $line = lc $line; $line =~ s/[.,:;?!]//g; while ( $line =~ /\b\w+\b/g ) { $word = $&; if ( $word =~ /\s/ || $word eq "" ) { next } $Count{$word}++; if ( defined $Line{$word} ) { $Line{$word} =~ m/(\d*?)$/; if ( $1 == $count ) { next; } else { $Line{$word} .= ", $count"; } } else { $Line{$word} = $count; } } } my $db = DBI->connect('dbi:mysql:lit:localhost', 'user', 'pw'); my $statement = qq( INSERT INTO dickens VALUES ($word, $Line{$word}) ); my $sth = $db->prepare($statement); @theseWords = keys %Line; @theseWords = sort @theseWords; foreach $word ( @theseWords ) { print ("$word: $Line{$word}\n\n"); $sth->execute(); } $sth->finish; $db->disconnect();

The data that is coming out at the moment is
yours: 1731, 2051, 2756
youth: 2742
zeal: 2111
and what I'm trying to achieve is
yours: 1731
yours: 2051
yours: 2756
youth: 2742
zeal: 2111
to insert into the db (as it chokes on the lines where there are more than one line reference) but so far I have been unsuccesful at doing this and would like to seek some guidance as to the best way of achieving this as the first step towards my goal (which is to also load a few surrounding words around the found word).

Many thanks for any guidance.

Replies are listed 'Best First'.
Re: Inserting the line references in a text file
by moritz (Cardinal) on Feb 27, 2008 at 09:43 UTC
    my $statement = qq( INSERT INTO dickens VALUES ($word, $Line{$word}) );

    No, don't do that. Use placeholders instead:

    my $sth = $dbh->prepare('INSERT INTO dickens VALUES(?, ?)'); # and then, as often as you want: $sth->execute($line, $count);

    I guess the database looks something like this:

    CREATE TABLE dickens ( word VARCHAR(30), line INTEGER, );

    If that's the case, you can't store multiple line numbers in one row. In that case you have to make multiple rows with the same word, or change the database layout.

    foreach $word ( @theseWords ) { my @lines = split m/ ,/, $Line{$word}; for (@lines){ $sth->execute($word, $_); } }

    Of course it's better to store the line numbers as arrays in the first place.

      Hi Moritz,
      Thanks for the pointers on the database. I am trying to split the lines so that each occurence of the word will have a separate line number as there is to be an extension to this work which requires them to be separate lines.
      Printing the data shows that it is printing out:
      yourself: 1791, 1792, 2440
      but what I'm trying to get is
      yourself: 1791
      yourself: 1792
      yourself: 2440
      to accommodate the creation of the new column later.
        That's what my last example inserts into the DB, isn't it?

        You can use the same schema to print it, of course.

Re: Inserting the line references in a text file
by thundergnat (Deacon) on Feb 27, 2008 at 16:05 UTC

    As an aside, that's a fairly poor regex to capture "words". It will have trouble with contractions, hyphenated words and words which contain characters not in the current locale. It may be good enough for what you are doing, but you might want to consider using something a little more robust.

    Still not perfect, but covering a lot more cases:

    use warnings; use strict; while ( my $line = <DATA> ) { while ( $line =~ /(\p{Alnum}+([-']\p{Alnum}+)*)/g ) { my $word = $1; print "$word\n"; # blah } } __DATA__ Tom's wouldn't, they're McCain-Feigngold maņana.
      Hi Thundergnat, This is still very much a work in progress and I have quite a lot of work to do on the word regex as well as parsing for various bits of grammar. What I was aiming for is something that works and inserts the data so that I can watch what the output is and tailor it to my needs. Thanks for the code, I'll keep it in mind as I continue development.