Installation Guide

NOTE: DALiuGE is under heavy development and we are not regularily updating the version on PyPi and DockerHub right now. The currently best way to get going is to install and build from the latest sources which you can get from here:

git clone https://github.com/ICRAR/daliuge
cd daliuge

Docker images

The recommended and easiest way to get started is to use the docker container installation procedures provided to build and run the daliuge-engine and the daliuge-translator. We currently build the system in three images:

  1. icrar/daliuge-common contains all the basic DALiuGE libraries and dependencies.

  2. icrar/daliuge-engine is built on top of the :base image and includes the installation of the DALiuGE execution engine.

  3. icrar/daliuge-translator is also built on top of the :base image and includes the installation of the DALiuGE translator.

There are also pre-build images available on dockerHub.

This way we are trying to separate the requirements of the daliuge engine and translator from the rest of the framework, which has a less dynamic development cycle.

The daliuge-engine image by default runs a generic daemon, which allows to then start the Master Manager, Node Manager or DataIsland Manager. This approach allows to change the actual manager deployment configuration in a more dynamic way and adjusted to the actual requirements of the environment.

NOTE: This guide is meant for people who are experimenting with the system. It does not cover specific needs of more complex, distributed operational deployments.

Creating the images

Building the three images is easy, just start with the daliuge-common image by running:

cd daliuge-common && ./build_common.sh dev && cd ..

then build the runtime:

cd daliuge-engine&& ./build_engine.sh dev && cd ..

and last build the translator:

cd daliuge-translator && ./build_translator.sh dev && cd ..

Running the images

Running the engine and the translator is equally simple:

cd daliuge-engine && ./run_engine.sh dev && cd ..

and:

cd daliuge-translator && ./run_translator.sh dev && cd ..

You can use EAGLE on the URL: https://eagle.icrar.org and point your EAGLE configuration for the translator to http://localhost:8084. Congratulations! You now have access to a complete DALiuGE system on your local computer!

More detailed information about running and controlling the DALiuGE system can be found in the Startup and Shutdown Guide.

Direct Installation

NOTE: For most use cases the docker installation described above is recommended.

Requirements

The DALiuGE framework requires no packages apart from those listed in its

setup.py

file, which are automatically retrieved when running it. The spead2 library (one of the DALiuGE optional requirements) however requires a number of libraries installed on the system:

  1. boost-python

  2. boost-system

  3. boost-devel

  4. gcc >= 4.8

Installing into host Python

NOTE: DALiuGE requires python 3.7 or later. It is always recommended to install DALiuGE inside a python virtual environment. Make sure that you have on created and enabled. More often than not pip requries an update, else it will always issue a warning. Thus first run:

pip install --upgrade pip

Like for the docker installation the local installation also follows the same pattern.

Install from GitHub

The following commands are installing the DALiuGE parts directly from github. In this case you won’t have access to the sources, but the system will run. First install the daliuge-common part:

pip install 'git+https://github.com/ICRAR/daliuge.git#egg&subdirectory=daliuge-common'

then install the daliuge-engine:

pip install 'git+https://github.com/ICRAR/daliuge.git#egg&subdirectory=daliuge-engine'

and finally, if required also install the daliuge-translator:

pip install 'git+https://github.com/ICRAR/daliuge.git#egg&subdirectory=daliuge-translator'

Install from sources

If you want to have access to the sources you can run the installation in a slightly different way. Again this should be be done from within a virtual environment. First start with cloning the repository:

git clone https://github.com/ICRAR/daliuge

then install the individual parts:

cd daliuge
cd daliuge-common
pip install .
cd ../daliuge-engine
pip install .
cd ../daliuge-translator
pip install .

Notes of the merge project between DALiuGE and Ray

The objective of this activity was to investigate a feasible solution for the flexible and simple deployment of DALiuGE on various platforms. In particular the deployment of DAliuGE on AWS in an autoscaling environment is of interest to us.

Ray (https://docs.ray.io/en/master/) is a pretty complete execution engine all by itself, targeting DL and ML applications and integrating a number of the major ML software packages. What we are in particular interested in is the Ray core software, which states the following:

  1. Providing simple primitives for building and running distributed applications.

  2. Enabling end users to parallelize single machine code, with little to zero code changes.

  3. Including a large ecosystem of applications, libraries, and tools on top of the core Ray to enable complex applications.

Internally Ray is using a number of technologies we are also using or evaluating within DALiuGE and/or the SKA. The way Ray is managing and distributing computing is done very well and essentially covers a number of our target platforms including AWS, SLURM, Kubernetes, Azure and GC.

The idea thus was to use Ray to distribute DALiuGE on those platforms and on AWS to start with, but leave the rest of the two systems essentially independent. In future we may look into a tighter integration between the two.

Setup

Pre-requisites

First you need to install Ray into your local python virtualenv:

pip install ray

Ray uses a YAML file to configure a deployment and allows to run additional setup commands on both the head and the worker nodes. In general Ray is running inside docker containers on the target hosts and the initial setup thus is to get the Ray docker image from dockerhub. Getting DALiuGE runnning inside that container is pretty straight forward, but requires installation of gcc and that is quite an overhead. Thus we have created a daliuge-ray docker image, which is now available on the icrar dockerhub repo and is donwloaded instead of the standard Ray image.

The rest is then straight forward and just requires to configure a few AWS autoscale specific settings, which includes AWS region, type of head node and type and (maximum and minimum) number of worker nodes as well as whether this is using the Spot market or not. In addition it is required to specify the virtual machine AMI ID, which is a pain to get and different for the various AWS regions.

Starting the DALiuGE Ray cluster

To get DALiuGE up and running in addition to Ray requires just two additional lines for the HEAD and the worker nodes in the YAML file, but there are some caveats as outlined below. With the provided ray configuration YAML file starting a cluster running DALiuGE on AWS is super easy (provided you have your AWS environment set up in place):

cd <path_to_daliuge_git_clone>
ray up daliuge-ray.yaml

Stopping the cluster is equally simple:

ray down daliuge-ray.yaml

More for convenience both DALiuGE and Ray require a number of ports to be exposed in order to monitor and connect the various parts. In order to achieve that it is best to attach to a virtual terminal on the Head node and specify all the ports at that point as well:

ray attach -p 8265 -p 8001 -p 8000 -p 5555 -p 6666 daliuge-ray.yaml

More specifically the command above actually opens a shell inside to the docker container running on the head node AWS instance.

Issues

Bringing the cluster down by default only stops the instances and thus the next startup is quite a bit faster. There is just one ‘small’ issue: Ray v1.0 has a bug, which prevents the second start to work! That is why the current default setting in daliuge-ray.yaml is to terminate the instances:

cache_stopped_nodes: False

To stop and start a node manager use the following two commands, replacing the SSH key file with the one created when creating the cluster and the IP address with the public IP address of the AWS node where the NM should be restarted:

ssh -tt -o IdentitiesOnly=yes -i /Users/awicenec/.ssh/ray-autoscaler_ap-southeast-2.pem ubuntu@54.253.243.145 docker exec -it ray_container dlg nm -s
ssh -tt -o IdentitiesOnly=yes -i /Users/awicenec/.ssh/ray-autoscaler_ap-southeast-2.pem ubuntu@54.253.243.145 docker exec -it ray_container dlg nm -v -H 0.0.0.0 -d

The commands above also show how to connect to a shell inside the docker container on a worker node. Unfortunately this is not exposed as easily as the connection to the head node in Ray.

Submitting and executing a Graph

This configuration only deploys the DALiuGE engine. EAGLE and a translator need to be deployed somewhere else. When submitting the PG from a translator web interface, the IP address to be entered there is the public IP address of the DIM (Ray AWS head instance). After submitting, the DALiuGE monitoring page will pop up and show the progress bar. It is then also possible to click your way through to the sub-graphs running on the worker nodes.

Future thoughts

This implementation is the start of an integration between Ray and DALiuGE. Ray (like the AWS autoscaling) is a reactive execution framework and as such it uses the autoscaling feature just in time, when the load exceeds a certain threshold. DALiuGE on the other hand is a proactive execution framework and pre-allocates the resources required to execute a whole workflow. Both approaches have pros and cons. In particular in an environment where resources are charged by the second it is desireable to allocate them as dynamically as possible. On the other hand dynamic allocation comes with the overhead of provisioning additional resources during run-time and is thus non-deterministic in terms of completion time. This is even more obvious when using the spot market on AWS. Fully dynamic allocation also does not fit well with bigger workflows, which require lots of resources already at the beginning. The optimal solution very likely is somewhere in the middle between fully dynamic and fully static resource provisioning.

Dynamic workflow allocation

The first step in that direction is to connect the DALiuGE translator with the ray deployment. After the translator has performed the workflow partitioning the resource requirements are fixed and could be used in turn to startup the Ray cluster with the required number of worker nodes. Essentially This would also completely isolate one workflow from another. The next step could be to add workflow fragmentation to the DALiuGE translator and scale the Ray cluster according to the requirements of each of the fragments, rather than the whole workflow. It has to be seen how to trigger the scaling of the Ray cluster just enough ahead of time to be available for the previous workflow fragment to continue without delays.