in reply to Re: Trouble with some of IDDB Public Methods
in thread Trouble with some of IMDB Public Methods
It may be easier/more maintainable if you parse the data you want yourself. See Re^2: running an example script with WWW::Mechanize* module for a mojo based example of scraping data from IMDB.
I was actually able to fix, or should I say add some better error catching to the IMDB::File modules. Given my level a little bit proud of myself.
Looks like WWW::Mechanize is the way to go.Looks like I'm in for a long road of learning. I can definitely parse the main page, but some of the stuff am going to have to find a way to follow links, yada yada yada. But hey, I've some this far!
Thanks again
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^3: Trouble with some of IDDB Public Methods
by marto (Cardinal) on Dec 30, 2020 at 09:00 UTC | |
"I was actually able to fix, or should I say add some better error catching to the IMDB::File modules. Given my level a little bit proud of myself." Well done, it seems there are various patches for different things in the rt.cpan queue, and the github repo seems to have some merges. This could be worth pursuing depending on how much time you want to devote to it, alternative you should be able to use the mojo based solution from earlier as a starting point to get just what you need. If you have any problems with that just post and I'll take a look. | [reply] |
by Aldebaran (Curate) on Jan 01, 2021 at 04:45 UTC | |
OP seems to have found what he wanted, so I thought I might use the opportunity to ask marto (or anyone else who can bake from scratch with mojo) to further explore the script he posted in Re^5: polishing up a json fetching script for weather data. It might be an improvement to a script that marto characterized as sub optimal. I certainly hope that we don't optimize away the comments and break up the logic as opposed to having just a train of arrows that online sources may have, with words whose provenance is unknown, like top in this example:
or json, there's nothing that makes keywords stand out, and where does one go to determine their provenance? How exactly are you going to disambiguate 'json'? The above came from link to Mojo/UserAgent. I understand that examples are selected for brevity. I would love to see a cache of them with many authors. It seemed to me that having to hardcode the movie title like this was an area that can be improved. my $imdburl = 'http://www.imdb.com/search/title?title=Caddyshack';I couldn't get titles with multiple words to work at all. The search replaces spaces with plusses in the url, but interpolation with a lexical variable is just beneath mojo, even if it worked, which it doesn't. What I want is a script that shows me what's at this site from a mojo point of view, and this does so naively:
What does it show?
First looks right...second is empty... The 3rd contains 61 k of javascript hell. The 4th and ultimate was empty. Javascript isn't meant for human eyes, or let me be specific, I find it illegible, so I used the browser tools to look closer. I realize that I simply don't understand the javascript, and that's not mojo's fault. The browser tools give me this upon inspection and right click inside the search box:
Then I remembered that you can use mojo to do this instead:
Now I thought I was really in hot pursuit. I thought, "aha, I can find this id and post to it." So I go to find find in Mojo::Dom, and I don't really understand the examples until I can work them myself and see them:
Finally, I got a usage for find that worked:
Anyways, this was my final push and I seem to come up short:
These are resources I drew from: Thanks for comments, | [reply] [d/l] [select] |
by marto (Cardinal) on Jan 01, 2021 at 09:54 UTC | |
"I certainly hope that we don't optimize away the comments and break up the logic as opposed to having just a train of arrows that online sources may have, with words whose provenance is unknown, like top in this example:"
"or json, there's nothing that makes keywords stand out, and where does one go to determine their provenance? How exactly are you going to disambiguate 'json'?" As with the cert attribute, just look at the post documentation. It's just encoding a perl value to JSON, and posting it to an example site with TLS cert auth. Consider the longhand example of just the JSON part:
Following the appropriate links in the Mojo docs takes you to the relevant places. "I couldn't get titles with multiple words to work at all. The search replaces spaces with plusses in the url, but interpolation with a lexical variable is just beneath mojo, even if it worked, which it doesn't. What I want is a script that shows me what's at this site from a mojo point of view, and this does so naively:" A lazy way (since it's early on New Years day) would be to take my example, prompt for a film title and replace spaces with the plus sign. If you want to go down the route of automating forms, as mentioned before, make life easy on yourself and use the browser 'developer tools' to find the data you need for the form fields you care about. This is more effective then grepping in the dark from dumped results.
Outputs: <Reveal this spoiler or all in this thread>
This example is only differs from my original by a few verbose lines, and again is sub optimal, and intended just to get you started. Obviously this is aimed at Films, and if you search for a series rather than a film the resulting page has differences that you'd need to cater for. If your intention is to take this further I'd strongly recommend using the browser developer tools, don't get hung up on how Mojo can dump the page data and all it's elements, this is mostly unimportant if you just want to automate an existing interface. Adding code to cater for different types of results (film, TV show), obvious error checking, perhaps better prompting of results rather than assuming the first one is what the user means, e.g. a search for 'Batman' returns "The Batman (2022)" rather than "Batman (1966)". Update: added spoiler tag explanation. | [reply] [d/l] [select] |
|
Re^3: Trouble with some of IDDB Public Methods
by Aldebaran (Curate) on Jan 01, 2021 at 06:16 UTC | |
I would love to see your source. I know that I've failed to get far with imdb changing over the years. Looks like WWW::Mechanize is the way to go.It's been my experience that WWW::Mechanize can't deal with javascript. This site is a 61 k clump of it. | [reply] |
by marto (Cardinal) on Jan 01, 2021 at 10:05 UTC | |
"It's been my experience that WWW::Mechanize can't deal with javascript. This site is a 61 k clump of it." None of the Mojo stuff understand JavaScript either, if you want that use WWW::Mechanized::Chrome, but you probably don't since you're interested in the data displayed on the site which isn't client side JavaScript. | [reply] |