Re: Overhead of Inline::Python?
by bliako (Abbot) on Jun 19, 2019 at 07:56 UTC
|
Shelling out or inlining once per file will cost you - that's common sense. If you can compile a list of files to transfer first, via Perl, which then you send *once* to the python script, shelling out or inlining cost will be much less / insignificant.
Alternatively: modify the python script to sit in a loop and wait until it receives via socket the fullpath to a file to be transfered. Once there is something in the queue, python will transfer it. Then via Perl you decide what files to send and just communicate their names to the python "server". You will have the cost of only opening a local socket connection for each file. I assume perl+python scripts sit locally and communicate to the provider's remote server. So, you communicate to python the filename on local disk.
bw, bliako
| [reply] |
Re: Overhead of Inline::Python?
by stevieb (Canon) on Jun 18, 2019 at 19:54 UTC
|
Can you please link to the provider's API documentation?
It seems as though using Python through Perl is way overkill. If it's just a REST API using JSON, that can be done natively in Perl.
In response to the original question here, I have no experience with Inline::Python, so I can't speak to its performance.
| [reply] |
Re: Overhead of Inline::Python?
by Corion (Patriarch) on Jun 19, 2019 at 07:25 UTC
|
If the provider offers Curl commands as "documentation", you can convert these Curl commands to Perl code using HTTP::Request::FromCurl resp. curl2lwp.
I haven't gotten around to writing generators for other HTTP backends. If performance is of interest to you, maybe HIJK is a bare-bones enough client.
Alternatively, maybe take a look at using their command line tool from within Perl in parallel so you can handle more files at the same time. Also, maybe the command line tool can handle multiple files/requests in one go, or maybe you can modify the tool to do so.
| [reply] |
Re: Overhead of Inline::Python?
by holli (Abbot) on Jun 19, 2019 at 09:52 UTC
|
| [reply] |
Re: Overhead of Inline::Python?
by bliako (Abbot) on Jun 20, 2019 at 10:00 UTC
|
From the cloud documentation you supplied it seems there is a way to do it via curl (albeit needs auth token) which can be done with one of Perl's user-agents (like LWP::UserAgent or Hijk etc.) after you translate the request parameters see Corion's answer Re: Overhead of Inline::Python?. That's probably the best way but it has the disadvantage of getting an auth token beforehand.
Here is my method. It is based on your Perl script selecting the files to upload and when ready it creates a file containing their details to a dir monitored by the following bash script. For each file appearing in that dir, the bash script will initiate a transfer using cloud provider's cmdline tool gsutil cp. The transfers can be done in parallel. If no files to be transfered the bash script just sits there and waits.
Edit: sure this is a bash script but I think it is worth a mention in PM because it demonstrates the use of GNU bash's sem, a Perl script to parallelise on N threads or less a list of tasks written by Ole Tange of Gnu::parallel fame. I am so happy I got reunited with my old friend Gnu::parallel I hope my deviation to bash land will be excused...
bw, bliako
#!/bin/bash
# Unix GNUbash script to monitor a dir for '*.txt' files
# containing a tab-separated pair of local-filename remote-object
# which will copy to the cloud
# if successful the file is moved to the done dir else to failed dir
# (subdirs of monitor dir)
# the process can be parallelised to up to NUMTHREADS threads using
# GNU's creme-de-la-creme sem (which uses Perl)
# The monitor dir is given as the only input param and should already
+exists
# The idea is that a separate Perl script will select the files to tra
+nsfer
# and create a signal file inside the monitor dir containing the detai
+ls
# of the transfer.
# by bliako
# for https://perlmonks.org/?node_id=11101534
# 20/06/2019
#####
NUMTHREADS=3
SLEEPTIME=2s
### nothing to change below
###
MONITOR_DIR=$1
if [ "${MONITOR_DIR}" == "" ] || [ ! -d "${MONITOR_DIR}" ]; then
echo "$0 : a 'monitor-dir' name must be given as 1st param pointin
+g to an existing, readable dir"
exit 1
fi
DONE_DIR="${MONITOR_DIR}/done"
mkdir -p "${DONE_DIR}" &> /dev/null
FAILED_DIR="${MONITOR_DIR}/failed"
mkdir -p "${FAILED_DIR}" &> /dev/null
if [ ! -d "${DONE_DIR}" ] || [ ! -d "${FAILED_DIR}" ]; then
echo "$0 : failed to create dir '${DONE_DIR}' and/or '${FAILED_DIR
+}'"
exit 1
fi
function execu {
local cmd="$1"
local asignalfile="$2"
local done_dir="$3"
local failed_dir="$4"
echo "execu() : called with cmd='${cmd}', asignalfile='${asignalfi
+le}', done_dir='${done_dir}', failed_dir='${failed_dir}'"
eval ${cmd}
if [ $? -eq 0 ]; then
echo "$0 : success executing ${cmd}" 1>&2
mv "${asignalfile}" "${done_dir}"
else
echo "$0 : command has failed ${cmd}" 1>&2
mv "${asignalfile}" "${failed_dir}"
fi
}; export -f execu
while true; do
nowdone=0
while IFS= read -r -d '' afwf; do
# we found a file in the dir we are monitoring
# it must containing the fullpath to the file to transfer
# then tab and then the remote
echo "checking '${afwf}'"
declare -a fde=($(head -1 "${afwf}" | cut -d$'\t' -f1,2))
CMD="gsutil cp '${fde[0]}' 'gs://${fde[1]}'"
echo "$0: executing ${CMD} ..."
if [ "${NUMTHREADS}" -gt 1 ]; then
echo "$0 : parallelising over ${NUMTHREADS} ..."
sem -j${NUMTHREADS} execu "'${CMD}'" "'${afwf}'" "'${DONE_
+DIR}'" "'${FAILED_DIR}'"
else
echo "$0 : executing ..."
execu "${CMD}" "${afwf}" "${DONE_DIR}" "${FAILED_DIR}"
fi
done < <(find "${MONITOR_DIR}" -maxdepth 1 -type f -name '*.txt' -
+print0)
totaldone=$((${totaldone}+${nowdone}))
echo "$0 : sleeping some before next monitor, done ${totaldone} so
+ far"
sleep ${SLEEPTIME} # sleep some
done
| [reply] [d/l] [select] |
Re: Overhead of Inline::Python?
by Anonymous Monk on Jun 18, 2019 at 20:01 UTC
|
It still requires a python interpreter, but it doesn't require forking new processes. I can't speak to whether that would be enough of a benefit for your use case. Consider Mojo::UserAgent and promises for managing complex HTTP request logic. | [reply] |
Re: Overhead of Inline::Python?
by karlgoethebier (Abbot) on Jun 19, 2019 at 08:57 UTC
|
"...get some sense of what the overhead is..."
The only thing i can contribute about inlining is that Inline::Java produces a lot of "overhead". But this insight doesn't help. May be it's time for some experiments? And some more information about your providers API might be helpful - as stevieb already mentioned. Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help
| [reply] [d/l] |
Re: Overhead of Inline::Python?
by bliako (Abbot) on Jun 19, 2019 at 09:59 UTC
|
can involve the transfer of many thousands of files and potentially terabytes of data.
btw, it's only the number of files which is relevant to the overall cost of shelling out or inlining not the total size of data. It does not matter if you have 1000 files of 10TB or 1000 files of 10KB, cost of shelling out for 1000 files is the same. (Edit: assuming that shelling out or inlining read file from disk and not receiving it from caller) | [reply] |
Re: Overhead of Inline::Python?
by Anonymous Monk on Jun 19, 2019 at 13:30 UTC
|
OP here.
We're using the Google Cloud Platform. They have tons of docs on uploading files, but the basic variants are all explained here, with variants under each tab: gsutil is their commandline tool, Code Samples shows APIs for various languages (Python, Ruby, Go, Java, Node.js, etc.), and REST APIs shows that.
We haven't yet run detailed profiling. Our original system simply copied files to a remote filesystem, and thus could use regular filesystem tools. It was vastly faster, even though the raw network speed isn't appreciably different.
I appreciate that some kind of system to batch the files would be better. The problem is that the underlying code is extremely complicated: merely selecting the files for uploading takes hours, and every upload requires a number of database updates (which ideally should take place at the same time as the upload, i.e. the location recorded in the database should match the location of the file, so selecting the files all together and then uploading them in one batch will be a problem).
Using the REST APIs would require a lot of manual housekeeping (checking for successful completion, retrying interrupted uploads, etc.), that is handled automatically by gsutil and the language APIs, which is why we felt that it was an advantage to use one of these tools. Perhaps we should just do this, but I'd hoped that some way of using one of these tools could still be more stable than rolling it ourselves with the REST API.
| [reply] [d/l] [select] |
Re: Overhead of Inline::Python?
by afoken (Chancellor) on Jun 19, 2019 at 19:34 UTC
|
Is this part 2 of using Inline Python with CGI?
Alexander
--
Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
| [reply] |
|
|
| [reply] |