Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Pixel scraping flash to extract text

by water (Deacon)
on Oct 22, 2005 at 11:24 UTC ( [id://502199]=perlquestion: print w/replies, xml ) Need Help??

water has asked for the wisdom of the Perl Monks concerning the following question:

Hi!

Sorry if this isn't a pure perl question; there might be perl involved in automating the solution....

I need to screen scrape a flash webapp -- yep, I can't get access to HTML.

The webapp presents tables of data I'm authorized to view, but I'd like to put them data in spreadsheet so I can sort and plot the data.

Maybe I can use perl to drive the flash app thru IE (haven't tried, but probably) using samie, but the flash app doesn't offer any way to dump data out of the darned thing...

ok, this is indeed horrible, but if I had to page thru the data screens and save screen shots as jpeg images or something, is there any way to pull text out of a jpeg using OCR or something? Quite horrible, indeed -- screen scraping at the pixel level -- but these data are worth it.

Ack!

Suggestions / ideas / comments most welcome --

Thanks!

water

(PS The folks providing this app wont take the time to modify it in any way or talk to me at this point, so the obvious "ask for a clean data dump" doesn't work here.)

(PPS The other fallback is have an employee type in data from the screen -- that might take a few days of effort -- so there's a reasonable human fallback soln.)

Replies are listed 'Best First'.
Re: Pixel scraping flash to extract text
by fizbin (Chaplain) on Oct 22, 2005 at 13:05 UTC

    I think that your best bet is to hope that the data that populates the flash ap. is being retrieved by the flash application separately.

    That is, it is very likely that when the flash application starts, it makes a second connection back to the server to get the data. Otherwise, the people who wrote the app would have to constantly update their flash application when the data changes.

    So here's what you do: go get a network sniffer like ethereal. Start capturing packets and have it monitor any connection from your desktop machine to the domain that hosts the flash app. Then start the flash app, and go through to where you're looking at the data. At this point, stop capturing packets and look at what you have, and see if you can figure out how the flash app is getting the data from the server.

    If that doesn't tell you anything (for example, if the flash application connected back via https), then things start to get nastier. You might try taking the app apart using swftools (assuming that doing so doesn't violate some license agreement with the company that made the flash app); I'd start with SWFStrings to see if there are any urls mentioned in the app.

    --
    @/=map{[/./g]}qw/.h_nJ Xapou cets krht ele_ r_ra/; map{y/X_/\n /;print}map{pop@$_}@/for@/
Re: Pixel scraping flash to extract text
by john_oshea (Priest) on Oct 22, 2005 at 13:08 UTC

    Long shot: if the originators of this data are in a country with reasonable disability laws, you could try the "well you have to make your data accessible" approach, which would then get you data scrapeable in a slightly less "ick" way...

    Failing that:

    If you can save the swf locally, I'd suggest looking at swfmill. Its swf2xml component will give you a (long) xml file with the source text accessible. This won't work if the text has been turned to outlines, or if the text is dynamically generated at run-time (which sounds like it might be the case).

    You could try one of the OCR packages/libraries available on Freshmeat, but I suspect you won't have sufficient resolution in your screenshots for them work effectively (if at all).

    The only other thing that springs to mind (which is probably going to be very tedious) would be to manually capture each character as it currently appears as a separate graphic, and use those to then match against your screengrabs. That, of course, is going to be very fragile if pretty much anything in the output changes, but it might be worth a go.

    Good luck!

    Update:

    Having read fizbin's post (and had a brief forehead-slapping moment...) you could also try setting up HTTP::Recorder which looks like it might help with the SSL issue too.

Re: Pixel scraping flash to extract text
by johnnywang (Priest) on Oct 22, 2005 at 19:00 UTC
    I had a similar situation once, I did the following: get a flash decompiler (google for it, I forgot which one I used), decompile the flash to see how it is getting the data, mostlikely calling some http site, then write a perl script to get the data directly. That worked fine for me. The ethereal approach is also good if the protocol is open like http, but you won't be able to see cleartext https traffic.
Re: Pixel scraping flash to extract text
by Joost (Canon) on Oct 23, 2005 at 01:53 UTC
Re: Pixel scraping flash to extract text
by Cody Pendant (Prior) on Oct 23, 2005 at 00:18 UTC
    If it was me, I'd be doing some very basic URL hacking. Flash these days is probably getting the data from an external XML file. There's a quiz application I knows which gets the questions and answers from a file in the same directory called "quiz.xml" and a couple of others have their data in "data.xml", or "data/data.xml".

    A bit of creative noodling around along those lines and you might get lucky. Or maybe download their application to your HD and run it, to see if it chucks any useful error when it can't find the file.



    ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss')
    =~y~b-v~a-z~s; print
Re: Pixel scraping flash to extract text
by spiritway (Vicar) on Oct 23, 2005 at 01:41 UTC

    Not sure if this will help, but it seems to me that when you get this material, it resides in a cache on your computer. I've had good luck using Opera's cache to get hold of elusive data, though I don't recall ever getting anything from Flash that way. Once you have some sort of file, you can no doubt figure out a script to snag the data.

    If you try this approach, I highly recommend emptying your cache just before accessing the pages you want, to eliminate the often massive amount of old stuff you don't care about.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://502199]
Approved by planetscape
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others cooling their heels in the Monastery: (4)
As of 2024-03-29 11:48 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found