Dask Redis Multinode Example
Dask Multinode Example running docker
On main server with public IP address 172.16.2.210:
mkdir -p /home/$USER/docker/data ; chmod u+rwx /home/$USER/docker/data
mkdir -p /home/$USER/docker/log ; chmod u+rwx /home/$USER/docker/log
mkdir -p /home/$USER/docker/tmp ; chmod u+rwx /home/$USER/docker/tmp
mkdir -p /home/$USER/docker/license ; chmod u+rwx /home/$USER/docker/license
mkdir -p /home/$USER/docker/jupyter/notebooks
cp /home/$USER/.driverlessai/license.sig /home/$USER/docker/license/
export server=172.16.2.210
docker run \
--net host \
--runtime nvidia \
--rm \
--init \
--pid=host \
--gpus all \
--ulimit core=-1 \
--shm-size=2g \
-u `id -u`:`id -g` \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /home/$USER/docker/license:/license \
-v /home/$USER/docker/data:/data \
-v /home/$USER/docker/log:/log \
-v /home/$USER/docker/tmp:/tmp \
-v /home/$USER/docker/jupyter:/jupyter \
-e dai_dask_server_ip=$server \
-e dai_redis_ip=$server \
-e dai_redis_port=6379 \
-e dai_main_server_minio_address=$server:9001 \
-e dai_local_minio_port=9001 \
-e dai_ip=$server \
-e dai_main_server_redis_password="<REDIS_PASSWORD>" \
-e dai_worker_mode='multinode' \
-e dai_enable_dask_cluster=1 \
-e dai_enable_jupyter_server=1 \
-e dai_enable_jupyter_server_browser=1 \
-e NCCL_SOCKET_IFNAME="enp5s0" \
-e NCCL_DEBUG=WARN \
-e NCCL_P2P_DISABLE=1 \
docker_image
The preceding example launches the following:
DAI main server on 12345
MinIO data server on 9001
Redis server on 6379
H2O-3 MLI server on 12348
H2O-3 recipe server on 50361
Juypter on 8889
Dask CPU scheduler on 8786
Dask CPU scheduler’s dashboard on 8787
Dask GPU scheduler on 8790
Dask GPU scheduler’s dashboard on 8791
LightGBM Dask listening port on 12400
Notes:
$USERin bash gives the username.
Replace
<REDIS_PASSWORD>with default Redis password or new one.
Replace various ports with alternative values if required.
Replace
docker_imagewith the image (include repository if remote image).
For GPU usage,
--runtime nvidiais required. Systems without GPUs should remove this line.
Dask on cluster can be disabled by passing
dai_enable_dask_cluster=0. If Dask on cluster is disabled, thendai_dask_server_ipdoes not need to be set.
Dask dashboard ports (for example, 8787 and 8791) and H2O-3 ports 12348, 50361, and 50362 are not required to be exposed. These are for user-level access to H2O-3 or Dask behavior.
Jupyter can be disabled by passing
dai_enable_jupyter_server=0anddai_enable_jupyter_server_browser=0.
Dask requires the host network be used so scheduler can tell workers where to find other workers, so a subnet on new IP cannot be used, e.g. with
docker network create --subnet=192.169.0.0/16 dainet.
To isolate user access to single user, instead of doing
-v /etc/passwd:/etc/passwd:ro -v /etc/group:/etc/group:roone can map to user files with the same required information. These options ensure container knows who user is.
Directories created should have not existed or should be from a prior run by same user. Pre-existing directories should be moved or names changed to avoid conflicts.
Services like the Procsy server, H2O-3 MLI and Recipe servers, and Vis-data server are only used internally for each node.
The options
-p 12400:12400is only required to LightGBM Dask.
NCCL_SOCKET_IFNAMEshould specify the actual hardware device to use, as required due to issues with NCCL obtaining the correct device automatically from IP.
On any number of workers for server with public IP address 172.16.2.210:
mkdir -p /home/$USER/docker/log ; chmod u+rwx /home/$USER/docker/log
mkdir -p /home/$USER/docker/tmp ; chmod u+rwx /home/$USER/docker/tmp
export server=172.16.2.210
docker run \
--runtime nvidia \
--gpus all \
--rm \
--init \
--pid=host \
--net host \
--ulimit core=-1 \
--shm-size=2g \
-u `id -u`:`id -g` \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-v /home/$USER/docker/log:/log \
-v /home/$USER/docker/tmp:/tmp \
-e dai_dask_server_ip=$server \
-e dai_redis_ip=$server \
-e dai_redis_port=6379 \
-e dai_main_server_minio_address=$server:9001 \
-e dai_local_minio_port=9001 \
-e dai_ip=$server \
-e dai_main_server_redis_password="<REDIS_PASSWORD>" \
-e dai_worker_mode='multinode' \
-e dai_enable_dask_cluster=1 \
-e NCCL_SOCKET_IFNAME="enp4s0" \
-e NCCL_DEBUG=WARN \
-e NCCL_P2P_DISABLE=1 \
docker_image --worker
Notes:
If same disk is used for main server and worker, change “docker” to “docker_w1” for worker 1, etc.
NCCL_SOCKET_IFNAMEshould specify actual hardware name, in general different on each node.
Dask Multinode Example running tar
On main server with public IP address 172.16.2.210:
export DRIVERLESS_AI_LICENSE_FILE=/home/$$USER/.driverlessai/license.sig
export server=172.16.2.210
NCCL_SOCKET_IFNAME="enp5s0" \
NCCL_DEBUG=WARN \
NCCL_P2P_DISABLE=1 \
dai_dask_server_ip=$server dai_redis_ip=$server dai_redis_port=6379 \
dai_main_server_minio_address=$server:9001 dai_ip=$server dai_main_server_redis_password="<REDIS_PASSWORD>" \
dai_worker_mode='multinode' dai_enable_dask_cluster=1 \
dai_enable_jupyter_server=1 dai_enable_jupyter_server_browser=1 \
/opt/h2oai/dai/dai-env.sh python -m h2oai &> multinode_main.txt
On each worker node, run the exact same command but with --worker added at the end, i.e.:
export DRIVERLESS_AI_LICENSE_FILE=/home/$$USER/.driverlessai/license.sig
export server=172.16.2.210
NCCL_SOCKET_IFNAME="enp4s0" \
NCCL_DEBUG=WARN \
NCCL_P2P_DISABLE=1 \
dai_dask_server_ip=$server dai_redis_ip=$server dai_redis_port=6379 \
dai_main_server_minio_address=$server:9001 dai_ip=$server dai_main_server_redis_password="<REDIS_PASSWORD>" \
dai_worker_mode='multinode' dai_enable_dask_cluster=1 \
/opt/h2oai/dai/dai-env.sh python -m h2oai --worker &> multinode_worker.txt
Notes:
In this example, address 172.16.2.210 needs to be the public IP associated with the network device to use for communication.
$USERin bash gives the username.
Replace
<REDIS_PASSWORD>with default Redis password or new one.
Replace various ports with alternative values if required.
NCCL_SOCKET_IFNAMEshould be set to be actual hardware device name to use on each node.