Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Monks

I am reading files from a directory /../test1, each text file has an URL address (http://...) on its first line. I want to parse the URL and save all what follows the label "content" into an array + filename

Input:
a text file with the url on top

http://www.yyy.com/store/application/meraqf?origin=rrr.jsp&event=link( +goto)&content=/asp/administrative/catalog/products/Network/benefits.j +sp.

output:
<Textfile>
filename: {some.txt}
Keys: {asp,administrative,catalog,products,Network,benefits}
{some.txt,asp,administrative,catalog,products,Network,benefits}
</Textfile>

Read the files from a directory and put into an array was not a problem,

my @dirtextfiles=(); while (<*>) { push (@textfiles,$_) if (-f "$_"); }

but parsing the line and put it together with the filename (already in an array), I don't really know how... here is what I did...
# Formatting question my $label = "content="; while(defined($textline=<IN>)) { next unless $textline=~/\S/; # ignore blank lines next if $textline=~/^\s*".*"\s*$/; # ignore message lines chomp($textline); #Extract keys in an output file my @try = $textline =~ /$label(.*?)\/(.*?)/g; for (@try) { print "{" . $_. . "," . "}"; } }

Note: I think I should identify the first "http line" before extract words from the "content=", so I don't need to parse the whole document.
Can somebody help me?
Thanks in advance for you help

Replies are listed 'Best First'.
(jeffa) Re: Extracting info from URL into an array
by jeffa (Bishop) on May 26, 2003 at 22:35 UTC
    UPDATE:
    I didn't see that you are pulling the URI's out of files the first time i read your question. There really is no reason to use URI::Find if you have already found the URI's. ;) This code reads all text files (.txt extension) in your test1 folder. I used an absolute path in the glob instead of the .. metacharacter. I also assume that the files will always have the URI on the first line (that starts with a scheme) and will always end with the .txt extension.
    use strict; use warnings; use URI; use File::Basename; my @suffix = qw(.jsp .html .asp .htm); for (</path/to/test1/*.txt>) { open (FH,$_); my $uri = URI->new(<FH>); close FH; next unless $uri->scheme; my %q = $uri->query_form; my (undef,@key) = split( /\//, dirname($q{content}) ); push @key, basename($q{content},@suffix); print "<Textfile>\n", "filename: {", basename($_), "}\n", "Keys: {", join(',',@key), "}\n", "</Textfile>\n", ; }

    ORIGNAL POST:
    Well, you request is confusing at best. If you want to parse URI's, URI::Find is a fine tool for doing so. Simply pass it a reference to a scalar (in my example i use the built-in DATA filehandle) and it will find the URI's for you. You can also pass a reference to a subroutine (or an anonymous sub) and URI::Find will call it every time it encounters a URI. Here is some code that sort of Does What You Want. File::Basename is used to remove the extension ... but i am starting to think that a better approach would be to remove any extension and split on the forward slash. Anyways, it's a start:
    use strict; use warnings; use URI::Find; use File::Basename; # add more if needed my @suffix = qw(.jsp .html .asp .htm); # optionally open a file here and replace DATA # with the name of the filehandle you opened my $data = do {local $/;<DATA>}; my $finder = URI::Find->new(\&call_back); $finder->find(\$data); sub call_back { my $uri = shift; my %q = $uri->query_form; my $content = $q{content}; # using split like this is a hack ... improvements anyone? my (undef,@key) = split(/\//,dirname($content)); # this will add the file name minus its extension push @key, basename($content,@suffix); # you could push these to an array instead of printing print "Filename: {", basename($content), "}\n"; print "Keys: {", join(',',@key), "}\n\n"; } __DATA__ http://www.yyy.com/store/application/meraqf?origin=rrr.jsp&event=link( +goto)&content=/asp/administrative/catalog/products/Network/benefits.j +sp is this text automatically 'ignored'? yes, it is ;) http://foo.com/?content=/asp/management/catalog/products/Network/propa +ganda.asp http://foo.com/?content=/path/to/bar.html

    jeffa

    L-LL-L--L-LL-L--L-LL-L--
    -R--R-RR-R--R-RR-R--R-RR
    B--B--B--B--B--B--B--B--
    H---H---H---H---H---H---
    (the triplet paradiddle with high-hat)
    
      Jeffa
      Thank you very much for your help!! :). I will try it now!!