in reply to substitution of illegal chars in filename

Oh yeah, putting things like ampersands and quotes into file names is one of the "features" of wget that tends to put that tool on my "do not use" list. I'd rather spend a little more time probing a web site myself, and using a perl script with the LWP module to focus on the sets of urls I really want -- and as I fetch each page, assign a sensible file name (with no shell-magic characters) to save it locally.

But trying to maintain the linkages among the href's inside each file is a bit more challenging; jeffa's reply has the basic approach: convert all the wget-assigned file names to sensible names first (making sure to avoid collisions), rename the files, and keep the old-new relations in a hash; then, for each file in the harvest, replace all occurrences of a wget-style (cgi-based) file name string with the corresponding sensible name. Tedious, but not so difficult.

  • Comment on Re: substitution of illegal chars in filename

Replies are listed 'Best First'.
Re: Re: substitution of illegal chars in filename
by lahf (Initiate) on Oct 10, 2003 at 14:06 UTC
    I wouldn't say its so much of a feature, but an automatic filename, and wget has not been given the chance to be clever, and save say the address of this file with all those %20 s' which are sposed to be spaces, and %3A s' which are colons i think, and also the ?s' as well, which are not automatically replaced by its alt code. Maybe it would be better to isolate the code in wget to automaticallychange it itself. The only thing is, I'm not a coder. I can do the odd thing, but I feel like I'd have to learn the whole language first which I dont want to do. I just want to know what the things I need are, and also how to use them, and what other essential things I'd have ot put in the script. I already spend hours poreing thru html & php, and VB C C++ but not perl yet.