fanticla has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I am searching for inspiration. I have many different texts in a unique .txt file. Each text is annotated with a tag (at the headings of each text). The structure looks like the following:

<text>Meta information about text 1</text> This is text one. .... This was text one. <text>Meta information about text 2</text> This is text two. .... This was text two.

My software allows users to search for a word in the .txt file, and shows in a TK::Text Wiget the portion of the text containing that word. This works fine. What I would like now to do is to retrieve the content of the <text> annotation related to the text where the word has been found. Any idea how I could implement it. I am not searching for a peace -> piece ;) of code, but just for the right inspiration...Thanks. Cla

Replies are listed 'Best First'.
Re: Retrieving meta information from txt
by chrestomanci (Priest) on Dec 10, 2010 at 12:07 UTC

    The way I see it, you have two possible approaches.

    You could construct an index where each word of interest is stored with the meta information that it is associated with.

    open $srcFile, '<', 'source.txt' or die $! my $metaText = ""; while( my $line = <$srcFile> ) { if( $line =~ m/<text>(.*)<\/text>/ ) { # Replace the old meta text with the new one. $metaText = $1; } else { # Normal line. The old meta text applies. my @words = split /\s+/ $line; foreach my $word (@words) { store_in_index($metaText, $word); } } } close $srcFile;

    In the example above, the store_in_index() function could anything from storage into a simple hash, to a relational database. It would depend on how much data, and how long you want to keep that data.

    An alternative approach would be one where you read the file backwards. Firstly for the word the user is looking for, and then for the meta information line that it relates to. Off the top of my head, I am not sure how that would be done, but I am sure there are ways.

Re: Retrieving meta information from txt
by jethro (Monsignor) on Dec 10, 2010 at 12:18 UTC

    Instead of first searching and then looking for the tag you could read the file so that you already have separated each text and its tag

    For example you can change $/ to "<text>" so that when you read that file, you read it not line by line, but text by text.

    Or just parse line by line and store into a new index of an array whenever a line begins with '<text>'. Have an array for the tags and an array for the texts where corresponding tag and text have the same index and you have all the information "finely sliced at your fingertips" ;-)

      my @something; { my $before = []; my $after = []; my $curr = $before; my $meta = []; LOOP: while (<$fh>) { if (/^\Q..../) { $curr = $after; next LOOP; } if (/^</) { ## Something( { } ); push @something, { meta => $meta, before => $before, after => $after, }; $before = []; $after = []; $curr = $before; $meta = [$_]; next LOOP; } ## end if (/^</) push @$curr, $_; } ## end while (<$fh>) if ( grep { $_ or @$_ } $meta, $before, $after ) { push @something, { meta => $meta, before => $before, after => $after, }; } ## end if ( grep { $_ or @$_ ...}) }
Re: Retrieving meta information from txt
by bart (Canon) on Dec 10, 2010 at 12:00 UTC
    shows in a TK::Text Widget the portion of the text containing that word. This works fine.
    If that is working, then you almost have what you want. Likely, the program now searches back up from the found word to where a line containing "<text>" to find the beginning of the section, and look further down to the next occurrence.

    So hook into the first part: as it must have searched for the start, all you have to do is retrieve the contents:

    my($meta) = substr($everything, $sectionstart) =~ /<text>(.*?)<\/text> +/;
Re: Retrieving meta information from txt
by cdarke (Prior) on Dec 10, 2010 at 13:59 UTC
    If you are on Microsoft Windows using NTFS then you can use an Additional Data Stream (ADS), which is used for exactly your purpose with, for example, Notepad. ADS streams are marked with a :suffix. For example, from cmd.exe:
    echo secret > fred.txt:hidden more < fred.txt:hidden secret
    The hidden ADS is otherwise not normally visible. In Perl you can open, read, and write, ADS files in the same way as any other.
Re: Retrieving meta information from txt
by elef (Friar) on Dec 10, 2010 at 18:19 UTC
    If all you want to do is add the header (meta info) to every hit, this can probably be done in two very simple lines of code: always store the current (latest) header and add it to the results.
    If you do the main data lookup line by line with a while loop or similar construct and supposing that no line has more than one <text> tag pairs (i.e. new texts are on new lines) and the opening and closing text tags are always on the same line (i.e. there are no line breaks within the header), something like this should work:
    while (<FILE>){ if (/<text>(.*)<\/text>/) {$currentheader = $1} # if the current line +has a header, save it, otherwise, keep last saved header # your data search code goes here if (there was a hit) {print $currentheader along with the hit} }

    This way you only go through the file once. Of course this may not be doable if you use some exotic solution for your search word matching instead of a while loop.

    Edit: I just noticed that chrestomanci already proposed a more elaborate execution of the same idea.
Re: Retrieving meta information from txt
by Anonymous Monk on Dec 10, 2010 at 11:41 UTC
    I am not searching for a peace of code, but just for the right inspiration...Thanks. Cla

    Peace of code, is that like peace of mind? If peace doesn't inspire you, maybe you're not looking for inspiration?