Extracting info from URL into an array

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Monks

I am reading files from a directory /../test1, each text file has an URL address (http://...) on its first line. I want to parse the URL and save all what follows the label "content" into an array + filename

Input:
a text file with the url on top

http://www.yyy.com/store/application/meraqf?origin=rrr.jsp&event=link(
+goto)&content=/asp/administrative/catalog/products/Network/benefits.j
+sp.
[download]

output:
<Textfile>
filename: {some.txt}
Keys: {asp,administrative,catalog,products,Network,benefits}
{some.txt,asp,administrative,catalog,products,Network,benefits}
</Textfile>

Read the files from a directory and put into an array was not a problem,


my @dirtextfiles=();

while (<*>) 
{
    push (@textfiles,$_) if (-f "$_");
}
[download]

but parsing the line and put it together with the filename (already in an array), I don't really know how... here is what I did...

# Formatting  question
my $label = "content=";

while(defined($textline=<IN>))
{
    next unless $textline=~/\S/; # ignore blank lines
    next if $textline=~/^\s*".*"\s*$/; # ignore message lines
    chomp($textline);
          
#Extract keys in an output file

my @try = $textline =~ /$label(.*?)\/(.*?)/g;


for (@try) {
  print "{" . $_. . "," . "}";
}


}
[download]

Note: I think I should identify the first "http line" before extract words from the "content=", so I don't need to parse the whole document.
Can somebody help me?
Thanks in advance for you help

Comment on Extracting info from URL into an array Select or Download Code

Replies are listed 'Best First'.
(jeffa) Re: Extracting info from URL into an array by jeffa (Bishop) on May 26, 2003 at 22:35 UTC
UPDATE: I didn't see that you are pulling the URI's out of files the first time i read your question. There really is no reason to use URI::Find if you have already found the URI's. ;) This code reads all text files (.txt extension) in your `test1` folder. I used an absolute path in the glob instead of the .. metacharacter. I also assume that the files will always have the URI on the first line (that starts with a scheme) and will always end with the .txt extension. `use strict; use warnings; use URI; use File::Basename; my @suffix = qw(.jsp .html .asp .htm); for (</path/to/test1/*.txt>) { open (FH,$_); my $uri = URI->new(<FH>); close FH; next unless $uri->scheme; my %q = $uri->query_form; my (undef,@key) = split( /\//, dirname($q{content}) ); push @key, basename($q{content},@suffix); print "<Textfile>\n", "filename: {", basename($_), "}\n", "Keys: {", join(',',@key), "}\n", "</Textfile>\n", ; }` [download] ORIGNAL POST: Well, you request is confusing at best. If you want to parse URI's, URI::Find is a fine tool for doing so. Simply pass it a reference to a scalar (in my example i use the built-in DATA filehandle) and it will find the URI's for you. You can also pass a reference to a subroutine (or an anonymous sub) and URI::Find will call it every time it encounters a URI. Here is some code that sort of Does What You Want. File::Basename is used to remove the extension ... but i am starting to think that a better approach would be to remove any extension and split on the forward slash. Anyways, it's a start: use strict; use warnings; use URI::Find; use File::Basename; # add more if needed my @suffix = qw(.jsp .html .asp .htm); # optionally open a file here and replace DATA # with the name of the filehandle you opened my $data = do {local $/;<DATA>}; my $finder = URI::Find->new(\&call_back); $finder->find(\$data); sub call_back { my $uri = shift; my %q = $uri->query_form; my $content = $q{content}; # using split like this is a hack ... improvements anyone? my (undef,@key) = split(/\//,dirname($content)); # this will add the file name minus its extension push @key, basename($content,@suffix); # you could push these to an array instead of printing print "Filename: {", basename($content), "}\n"; print "Keys: {", join(',',@key), "}\n\n"; } __DATA__ http://www.yyy.com/store/application/meraqf?origin=rrr.jsp&event=link( +goto)&content=/asp/administrative/catalog/products/Network/benefits.j +sp is this text automatically 'ignored'? yes, it is ;) http://foo.com/?content=/asp/management/catalog/products/Network/propa +ganda.asp http://foo.com/?content=/path/to/bar.html [download] jeffa L-LL-L--L-LL-L--L-LL-L-- -R--R-RR-R--R-RR-R--R-RR B--B--B--B--B--B--B--B-- H---H---H---H---H---H--- (the triplet paradiddle with high-hat)	[reply] [d/l] [select]
Re: (jeffa) Re: Extracting info from URL into an array by Anonymous Monk on May 27, 2003 at 15:07 UTC
Jeffa Thank you very much for your help!! :). I will try it now!!	[reply]