SSH¶
It is easy to set up Dask on informally managed networks of machines using SSH.
This can be done manually using SSH and the
Dask command line interface,
or automatically using either the SSHCluster
Python command or the
dask-ssh
command line tool. This document describes both of these options.
Python Interface¶
-
distributed.deploy.ssh.
SSHCluster
(scheduler_addr, scheduler_port, worker_addrs, nthreads=0, nprocs=1, ssh_username=None, ssh_port=22, ssh_private_key=None, nohost=False, logdir=None, remote_python=None, memory_limit=None, worker_port=None, nanny_port=None)¶
Command Line¶
The convenience script dask-ssh
opens several SSH connections to your
target computers and initializes the network accordingly. You can
give it a list of hostnames or IP addresses:
$ dask-ssh 192.168.0.1 192.168.0.2 192.168.0.3 192.168.0.4
Or you can use normal UNIX grouping:
$ dask-ssh 192.168.0.{1,2,3,4}
Or you can specify a hostfile that includes a list of hosts:
$ cat hostfile.txt
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
$ dask-ssh --hostfile hostfile.txt
The dask-ssh
utility depends on the paramiko
:
python -m pip install paramiko
Note
The command line documentation here may differ depending on your installed
version. We recommend referring to the output of dask-ssh --help
.
dask-ssh¶
Launch a distributed cluster over SSH. A ‘dask-scheduler’ process will run on the first host specified in [HOSTNAMES] or in the hostfile (unless –scheduler is specified explicitly). One or more ‘dask-worker’ processes will be run each host in [HOSTNAMES] or in the hostfile. Use command line flags to adjust how many dask-worker process are run on each host (–nprocs) and how many cpus are used by each dask-worker process (–nthreads).
dask-ssh [OPTIONS] [HOSTNAMES]...
Options
-
--scheduler
<scheduler>
¶ Specify scheduler node. Defaults to first address.
-
--scheduler-port
<scheduler_port>
¶ Specify scheduler port number. Defaults to port 8786.
-
--nthreads
<nthreads>
¶ Number of threads per worker process. Defaults to number of cores divided by the number of processes per host.
-
--nprocs
<nprocs>
¶ Number of worker processes per host. Defaults to one.
-
--hostfile
<hostfile>
¶ Textfile with hostnames/IP addresses
-
--ssh-username
<ssh_username>
¶ Username to use when establishing SSH connections.
-
--ssh-port
<ssh_port>
¶ Port to use for SSH connections.
-
--ssh-private-key
<ssh_private_key>
¶ Private key file to use for SSH connections.
-
--nohost
¶
Do not pass the hostname to the worker.
-
--log-directory
<log_directory>
¶ Directory to use on all cluster nodes for the output of dask-scheduler and dask-worker commands.
-
--remote-python
<remote_python>
¶ Path to Python on remote nodes.
-
--memory-limit
<memory_limit>
¶ Bytes of memory that the worker can use. This can be an integer (bytes), float (fraction of total system memory), string (like 5GB or 5000M), ‘auto’, or zero for no memory management
-
--worker-port
<worker_port>
¶ Serving computation port, defaults to random
-
--nanny-port
<nanny_port>
¶ Serving nanny port, defaults to random
Arguments
-
HOSTNAMES
¶
Optional argument(s)