Drops

Drops are at the center of the DALiuGE. Drops are representations of data and applications, making them manageable by DALiuGE.

Lifecycle

The lifecycle of a Drop is simple and follows the basic principle of writing once, read many times. Additionally, it also allows for data deletion.

A Drop starts in the INITIALIZED state, meaning that its data is not present yet. From there it jumps into COMPLETED once its data has been written, optionally passing through WRITING if the writing occurs through DALiuGE (see Input/Output). Once in the COMPLETED state the data can be read as many times as needed. Eventually, the Drop will transition to EXPIRED, denying any further reads. Finally the data is deleted and the Drop moves to the final DELETED state. If any I/O error occurs the Drop will be moved to the ERROR state.

Events

Changes in a Drop state, and other actions performed on a Drop, will fire named events which are sent to all the interested subscribers. Users can subscribe to particular named events, or to all events.

In particular the Node Drop Manager subscribes to all events generated by the Drops it manages. By doing so it can monitor all their activities and perform any appropriate action as required. The Node Drop Manager, or any other entity, can thus become a Graph Event Manager, in the sense that they can subscribe to all events sent by all Drops and make use of them.

Relationships

Drops are connected and create a dependency graph representing an execution plan, where inputs and outputs are connected to applications, establishing the following possible relationships:

  1. None or many data Drop(s) can be the input of an application Drop; and the application is the consumer of the data Drop(s).

  2. A data Drop can be a streaming input of an application Drop in which case the application is seen as a streaming consumer from the data Drop’s point of view.

  3. None or many Drop(s) can be the output of an application Drop, in which case the application is the producer of the data Drop(s).

  4. An application is never a consumer or producer of another application; conversely a data Drop never produces or consumes another data Drop.

The difference between normal inputs/consumers and their streaming counterpart is their granularity. In the normal case, inputs only notify their consumers when they have reached the COMPLETED state, after which the consumers can open the Drop and read their data. Streaming inputs on the other hand notify consumers each time data is written into them (alongside with the data itself), and thus allow for a continuous operation of applications as data gets written into their inputs. Once all the data has been written, the normal event notifying that the Drop has moved to the COMPLETED state is also fired.

Input/Output

I/O can be performed on the data that is represented by a Drop by obtaining a reference to its I/O object and calling the necessary POSIX-like methods. In this case, the data is passing through the Drop instance. The application is free to bypass the Drop interface and perform I/O directly on the data, in which case it uses the data Drop dataURL to find out the data location. It is the responsibility of the application to ensure that the I/O is occurring in the correct location and using the expected format for storage or subsequent upstream processing by other application Drops.

DALiuGE provides various commonly used data components with their associated I/O storage classes, including in-memory, Apache Arrow Plasma/Flight, file-base, S3 and NGAS storages. It is also possible to access the contant of a plain URL and use that as a data source.

When using and developing a DALiuGE workflow the details of the I/O mechanisms are completely hidden, but users just need to be aware of the differences and limitations of using either of them. Memory and Files or remote data objects are just not really the same in terms of I/O capabilities and performance. The most important difference is between memory and all the other methods, since plain memory really only works for Python and dynamic library based components. A bash component for example simply does not know how to deal with some memory block handed over to it. That is why EAGLE does prevent such connections between components in the first place.

When developing application components most of these details are also transparent, as long as the application component is using the provided POSIX-like access mechanisms. It is possible though to bypass those inside a component and perform all I/O independently of the framework. Even on that level there are still two ways, one is to use the provided data url from the framework, but not use the I/O methods. The even more extreme way is to just open some named file or channel without DALiuGE knowing anything about it. This latter way is strongly discouraged, since it will create unpredictable side-effects, which are almost impossible to identify in a large distributed environment. How to use the provided I/O methods from an application component is detailed in the DALiuGE Application Component Developers Guide chapter.

When developing a new data component the developer needs to implement the interface between the DALiuGE POSIX-like methods of the underlying data storage method. This is detailed in the DALiuGE Data Component Developers Guide chapter.

Drop Channels

During a DALiuGE workflow execution one application drop produces the data of a data drop, which in turn is consumed by another application drop. That means that data drops are essentially providing the data transfer methods between applications. The DALiuGE translator tries to minimise data movement and thus in many cases no transfer is actually happening, but the data drop transfers to COMPLETED state once it has received all data and passes that event on to the consumer application(s). The consumer applications in turn will use the provided read method to access the data directly.

In cases when data drops are accessed from separate nodes or islands the managers automatically produce a drop proxy on the remote nodes providing a remote method invocation (RMI) interface to allow the producers or consumers to execute the required I/O methods. It’s the job of the Master Drop and Island Managers to generate and exchange these proxies and connect them to the correct Drop instances when the graph is deployed to potentially multiple data islands and nodes. If there is no Drop separation within a physical graph partition then its implied that the Drops are going to be executed within a single address space, and, as a result, basic method calls are used between Drop instances.

In addition to the hand-over of the handle to the consumer once the data drop is COMPLETED DALiuGE also supports streaming data directly from one application drop to another during run-time. Like for most streaming applications this is based on the completion of a block of bytes transferred, thus the intermediate data drop still has a meaning and could in priciple be any standard data drop. In practice the only viable solutions are memory based drops, like plain memory, shared memory or Plasma.

Drop Component Interface

The DALiuGE framework uses Docker containers as its primary interface to 3rd party applications. Docker containers have the following benefits over traditional tools management:

  1. Portability.

  2. Versioning and component reuse.

  3. Lightweight footprint.

  4. Simple maintenance.

The application programmer can make use of the DockerApp which is the interface between a Docker container and the Drop framework. Refer to the documentation for details.

Other applications not based on Docker containers can be written as well. Any application must derive at least from AppDrop, but an easier-to-use base class is the BarrierAppDrop, which simply requires a run method to be written by the developer (see dlg.rpc for details). DALiuGE ships with a set of pre-existing applications to perform common operations, like a TCP socket listener and a bash command executor, among others. See dlg.apps for more examples. In addition we have developed a stand-alone tool (dlg_paletteGen), which enables the automatic generation of DALiuGE compatible component descriptions from existing code. In this way it is possible to enable to usage of big existing public or propietary libraries of algorithms, like e.g. Astropy within the DALiuGE eco-system.