francisxavier1234 has asked for the wisdom of the Perl Monks concerning the following question:

I have a directory of html files. Each file has multiple references to txt files. I need to save all the filenames being referenced in them for further processing.

my @txtFiles = grep{/\.txt/}<$inputfile> ;
html file: ------------------
<a href= abc1.txt target="_blank">bla bla bla<a href= abc2.txt target= +"_blank"><b><Font face= "verdana" size = "0.5"> Click for snap shot < +/font></b></a></p></TD><TD> <Font face= "verdana" size = "2" >Report +is generated with selected values sorted in Ascending order</Font></T +D><TD align="center"> <Font face= "verdana" size = "2" </TD><TD align +="center"> <Font face= "verdana" size bla bla bla...
------------ I need abc1.txt and abc2.txt in @txtFiles;

Replies are listed 'Best First'.
Re: Get a list of all txt file names listed in a html file
by davido (Cardinal) on Jan 31, 2014 at 06:10 UTC

    Now that you've added code...

    Well, that's not going to work. Unless you've altered $/, <$inputfile> will process the input line by line. And grep will return all lines that have ".txt" anywhere (possibly several times) in the line.

    HTML files aren't typically too big to slurp. One problem with file names is that there are a lot of troublesome characters that are allowed. Your sample data doesn't seem to have any space characters embedded in file names, so maybe you could give this a shot:

    local $/ = undef; my $input = <$inputfile>; my( @txtFiles ) = $input =~ m/\s(\S+\.txt)\s/g;

    It's always fragile to deal with HTML using regular expressions, and also fragile to try to detect filenames using regular expressions, so test thoroughly.


    Dave

      Thank you! Thank you! Thank you Dave!
Re: Get a list of all txt file names listed in a html file
by Not_a_Number (Prior) on Jan 30, 2014 at 21:08 UTC

    That's very interesting. What parts of your code are you having problems with?

Re: Get a list of all txt file names listed in a html file
by 2teez (Vicar) on Jan 30, 2014 at 21:40 UTC

    Hi francisxavier1234,
    Welcome to the Monastery.
    Monks here are more than willing to help in so many ways, but you would have to show some effort on your part. Please read How do I post a question effectively? to start with.

    If you tell me, I'll forget.
    If you show me, I'll remember.
    if you involve me, I'll understand.
    --- Author unknown to me
Re: Get a list of all txt file names listed in a html file
by CountZero (Bishop) on Jan 30, 2014 at 21:52 UTC
    Don't forget to show us a significant part of your HTML file. It will assist us in helping you.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

    My blog: Imperial Deltronics