dlg.data

This package contains several general-purpose data stores in form of DROPs that we have developed as examples and for real-life use. Most of them are based on the DataDROP.

dlg.data.io

class dlg.data.io.DataIO

A class used to read/write data stored in a particular kind of storage in an abstract way. This base class simply declares a number of methods that deriving classes must actually implement to handle different storage mechanisms (e.g., a local filesystem or an NGAS server).

An instance of this class represents a particular piece of data. Thus at construction time users must specify a storage-specific unique identifier for the data that this object handles (e.g., a filename in the case of a DataIO class that works with local filesystem storage, or a host:port/fileId combination in the case of a class that works with an NGAS server).

Once an instance has been created it can be opened via its open method indicating an open mode. If opened with OpenMode.OPEN_READ, only read operations will be allowed on the instance, and if opened with OpenMode.OPEN_WRITE only writing operations will be allowed.

buffer() Union[memoryview, bytes, bytearray, <MagicMock id='140509702317904'>]

Gets a buffer protocol compatible object of the drop data. This may be a zero-copy view of the data or a copy depending on whether the drop stores data in cpu memory or not.

close(**kwargs)

Closes the underlying storage where the data represented by this instance is stored, freeing underlying resources.

abstract delete()

Deletes the data represented by this DataIO

abstract exists() bool

Returns True if the data represented by this DataIO exists indeed in the underlying storage mechanism

isOpened()

Returns true if the io is currently opened for read or write.

open(mode: OpenMode, **kwargs)

Opens the underlying storage where the data represented by this instance is stored. Depending on the value of mode subsequent calls to self.read or self.write will succeed or fail.

read(count: int, **kwargs)

Reads count bytes from the underlying storage.

size(**kwargs) int

Returns the current total size of the underlying stored object. If the storage class does not support this it is supposed to return -1.

write(data, **kwargs) int

Writes data into the storage

class dlg.data.io.ErrorIO

An DataIO method that throws exceptions if any of its methods is invoked

class dlg.data.io.FileIO(filename, **kwargs)

A file-based implementation of DataIO

getFileName()

Returns the drop filename

dlg.data.io.IOForURL(url)

Returns a DataIO instance that handles the given URL for reading. If no suitable DataIO class can be found to handle the URL, None is returned.

class dlg.data.io.MemoryIO(buf: BytesIO, **kwargs)

A DataIO class that reads/write from/into the BytesIO object given at construction time

class dlg.data.io.NgasIO(hostname, fileId, port=7777, ngasConnectTimeout=2, ngasTimeout=2, length=-1, mimeType='application/octet-stream')

A DROP whose data is finally stored into NGAS. Since NGAS doesn’t support appending data to existing files, we store all the data temporarily in a file on the local filesystem and then move it to the NGAS destination

delete()

Deletes the data represented by this DataIO

class dlg.data.io.NgasLiteIO(hostname, fileId, port=7777, ngasConnectTimeout=2, ngasTimeout=2, length=-1, mimeType='application/octet-stream')

An IO class whose data is finally stored into NGAS. It uses the ngaslite module of DALiuGE instead of the full client-side libraries provided by NGAS itself, since they might not be installed everywhere.

The ngaslite module doesn’t support the STATUS command yet, and because of that this class will throw an error if its exists method is invoked.

exists() bool

Returns True if the data represented by this DataIO exists indeed in the underlying storage mechanism

class dlg.data.io.NullIO

A DataIO that stores no data

class dlg.data.io.OpenMode

Open Mode for Data Drops

class dlg.data.io.PlasmaFlightIO(object_id: <MagicMock name='mock.ObjectID' id='140509701408656'>, plasma_path='/tmp/plasma', flight_path: ~typing.Optional[str] = None, expected_size: ~typing.Optional[int] = None, use_staging=False)

A plasma drop managed by an arrow flight network protocol

class dlg.data.io.PlasmaIO(object_id: <MagicMock name='mock.ObjectID' id='140509701408656'>, plasma_path='/tmp/plasma', expected_size: ~typing.Optional[int] = None, use_staging=False)

A shared-memory IO reader/writer implemented using plasma store memory buffers. Note: not compatible with PlasmaClient put()/get() which performs data pickling before writing.

class dlg.data.io.SharedMemoryIO(uid, session_id, **kwargs)

A DataIO class that writes to a shared memory buffer

dlg.data.drops.data_base

class dlg.data.drops.data_base.DataDROP(oid, uid, **kwargs)

A DataDROP is a DROP that stores data for writing with an AppDROP, or reading with one or more AppDROPs.

DataDROPs have two different modes: “normal” and “streaming”. Normal DataDROPs will wait until the COMPLETED state before being available as input to an AppDROP, while streaming AppDROPs may be read simutaneously with writing by chunking drop bytes together.

This class contains two methods that need to be overrwritten: getIO, invoked by AppDROPs when reading or writing to a drop, and dataURL, a getter for a data URI uncluding protocol and address parsed by function IOForURL.

property checksum

The checksum value for the data represented by this DROP. Its value is automatically calculated if the data was actually written through this DROP (using the self.write() method directly or indirectly). In the case that the data has been externally written, the checksum can be set externally after the DROP has been moved to COMPLETED or beyond.

See

self.checksumType

property checksumType

The algorithm used to compute this DROP’s data checksum. Its value if automatically set if the data was actually written through this DROP (using the self.write() method directly or indirectly). In the case that the data has been externally written, the checksum type can be set externally after the DROP has been moved to COMPLETED or beyond.

See

self.checksum

close(descriptor, **kwargs)

Closes the given DROP descriptor, decreasing the DROP’s internal reference count and releasing the underlying resources associated to the descriptor.

abstract property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

decrRefCount()

Decrements the reference count of this DROP by one atomically.

delete()

Deletes the data represented by this DROP.

exists()

Returns True if the data represented by this DROP exists indeed in the underlying storage mechanism

abstract getIO() DataIO

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

incrRefCount()

Increments the reference count of this DROP by one atomically.

isBeingRead()

Returns True if the DROP is currently being read; False otherwise

open(**kwargs)

Opens the DROP for reading, and returns a “DROP descriptor” that must be used when invoking the read() and close() methods. DROPs maintain a internal reference count based on the number of times they are opened for reading; because of that after a successful call to this method the corresponding close() method must eventually be invoked. Failing to do so will result in DROPs not expiring and getting deleted.

read(descriptor, count=65536, **kwargs)

Reads count bytes from the given DROP descriptor.

write(data: Union[bytes, memoryview], **kwargs)

Writes the given data into this DROP. This method is only meant to be called while the DROP is in INITIALIZED or WRITING state; once the DROP is COMPLETE or beyond only reading is allowed. The underlying storage mechanism is responsible for implementing the final writing logic via the self.writeMeta() method.

class dlg.data.drops.data_base.EndDROP(oid, uid, **kwargs)

A DROP that ends the session when reached

class dlg.data.drops.data_base.NullDROP(oid, uid, **kwargs)

A DROP that doesn’t store any data.

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

class dlg.data.drops.data_base.PathBasedDrop

Base class for data drops that handle paths (i.e., file and directory drops)

get_dir(dirname)

dirname will be based on the current working directory If we have a session, it goes into the path as well (most times we should have a session BTW, we should expect not to have one only during testing)

Parameters

dirname – str

:returns dir

dlg.data.drops.directorycontainer

class dlg.data.drops.directorycontainer.DirectoryContainer(oid, uid, **kwargs)

A ContainerDROP that represents a filesystem directory. It only allows FileDROPs and DirectoryContainers to be added as children. Children can only be added if they are placed directly within the directory represented by this DirectoryContainer.

delete()

Deletes the data represented by this DROP.

exists()

Returns True if the data represented by this DROP exists indeed in the underlying storage mechanism

initialize(**kwargs)

Performs any specific subclass initialization.

kwargs contains all the keyword arguments given at construction time, except those used by the constructor itself. Implementations of this method should make sure that arguments in the kwargs dictionary are removed once they are interpreted so they are not interpreted by accident by another method implementations that might reside in the call hierarchy (in the case that a subclass implementation calls the parent method implementation, which is usually the case).

dlg.data.drops.file

class dlg.data.drops.file.FileDROP(*args, **kwargs)

A DROP that points to data stored in a mounted filesystem.

Users can fix both the path and the name of a FileDrop using the filepath parameter for each FileDrop. We distinguish four cases and their combinations.

  1. If not specified the filename will be generated.

  2. If it has a ‘/’ at the end it will be treated as a directory name and the filename will the generated.

  3. If it does not end with a ‘/’ and it is not an existing directory, it is treated as dirname plus filename.

  4. If filepath points to an existing directory, the filename will be generated

In all cases above, if filepath does not start with ‘/’ (relative path) then the session directory will be pre-pended to make the path absolute.

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

delete()

Deletes the data represented by this DROP.

generate_reproduce_data()

Provides a list of Reproducibility data (specifically). The default behaviour is to return nothing. Per-class behaviour is to be achieved by overriding this method. :return: A dictionary containing runtime exclusive reproducibility data.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

initialize(**kwargs)

FileDROP-specific initialization.

sanitize_paths(filepath: str) Union[None, str]

Expand ENV_VARS, but also deal with relative and absolute paths. filepath can be either just be a directory, a directory including a file name, only a directory (both relative and absolute), or just a file name.

Parameters

filepath – string, path and or directory

:returns filepath

setCompleted()

Override this method in order to get the size of the drop set once it is completed.

dlg.data.drops.memory

class dlg.data.drops.memory.InMemoryDROP(*args, **kwargs)

A DROP that points data stored in memory.

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

generate_reproduce_data()

Provides a list of Reproducibility data (specifically). The default behaviour is to return nothing. Per-class behaviour is to be achieved by overriding this method. :return: A dictionary containing runtime exclusive reproducibility data.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

initialize(**kwargs)

Performs any specific subclass initialization.

kwargs contains all the keyword arguments given at construction time, except those used by the constructor itself. Implementations of this method should make sure that arguments in the kwargs dictionary are removed once they are interpreted so they are not interpreted by accident by another method implementations that might reside in the call hierarchy (in the case that a subclass implementation calls the parent method implementation, which is usually the case).

class dlg.data.drops.memory.SharedMemoryDROP(oid, uid, **kwargs)

A DROP that points to data stored in shared memory. This drop is functionality equivalent to an InMemory drop running in a concurrent environment. In this case however, the requirement for shared memory is explicit.

@WARNING Currently implemented as writing to shmem and there is no backup behaviour.

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

initialize(**kwargs)

Performs any specific subclass initialization.

kwargs contains all the keyword arguments given at construction time, except those used by the constructor itself. Implementations of this method should make sure that arguments in the kwargs dictionary are removed once they are interpreted so they are not interpreted by accident by another method implementations that might reside in the call hierarchy (in the case that a subclass implementation calls the parent method implementation, which is usually the case).

dlg.data.drops.ngas

class dlg.data.drops.ngas.NgasDROP(oid, uid, **kwargs)

A DROP that points to data stored in an NGAS server

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

generate_reproduce_data()

Provides a list of Reproducibility data (specifically). The default behaviour is to return nothing. Per-class behaviour is to be achieved by overriding this method. :return: A dictionary containing runtime exclusive reproducibility data.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

initialize(**kwargs)

Performs any specific subclass initialization.

kwargs contains all the keyword arguments given at construction time, except those used by the constructor itself. Implementations of this method should make sure that arguments in the kwargs dictionary are removed once they are interpreted so they are not interpreted by accident by another method implementations that might reside in the call hierarchy (in the case that a subclass implementation calls the parent method implementation, which is usually the case).

setCompleted()

Override this method in order to get the size of the drop set once it is completed.

dlg.data.drops.plasma

class dlg.data.drops.plasma.PlasmaDROP(oid, uid, **kwargs)

A DROP that points to data stored in a Plasma Store

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

initialize(**kwargs)

Performs any specific subclass initialization.

kwargs contains all the keyword arguments given at construction time, except those used by the constructor itself. Implementations of this method should make sure that arguments in the kwargs dictionary are removed once they are interpreted so they are not interpreted by accident by another method implementations that might reside in the call hierarchy (in the case that a subclass implementation calls the parent method implementation, which is usually the case).

class dlg.data.drops.plasma.PlasmaFlightDROP(oid, uid, **kwargs)

A DROP that points to data stored in a Plasma Store

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

initialize(**kwargs)

Performs any specific subclass initialization.

kwargs contains all the keyword arguments given at construction time, except those used by the constructor itself. Implementations of this method should make sure that arguments in the kwargs dictionary are removed once they are interpreted so they are not interpreted by accident by another method implementations that might reside in the call hierarchy (in the case that a subclass implementation calls the parent method implementation, which is usually the case).

dlg.data.drops.rdbms

class dlg.data.drops.rdbms.RDBMSDrop(oid, uid, **kwargs)

A Drop that stores data in a table of a relational database

property dataURL: str

A URL that points to the data referenced by this DROP. Different DROP implementations will use different URI schemes.

generate_reproduce_data()

Provides a list of Reproducibility data (specifically). The default behaviour is to return nothing. Per-class behaviour is to be achieved by overriding this method. :return: A dictionary containing runtime exclusive reproducibility data.

getIO()

Returns an instance of one of the dlg.io.DataIO instances that handles the data contents of this DROP.

initialize(**kwargs)

Performs any specific subclass initialization.

kwargs contains all the keyword arguments given at construction time, except those used by the constructor itself. Implementations of this method should make sure that arguments in the kwargs dictionary are removed once they are interpreted so they are not interpreted by accident by another method implementations that might reside in the call hierarchy (in the case that a subclass implementation calls the parent method implementation, which is usually the case).

insert(vals: dict)

Inserts the values contained in the vals dictionary into the underlying table. The keys of vals are used as the column names.

select(columns=None, condition=None, vals=())

Returns the selected values from the table. Users can constrain the result set by specifying a list of columns to be returned (otherwise all table columns are returned) and a condition to be applied, in which case a list of vals to be applied as query parameters can also be given.

dlg.data.drops.json_drop

A DROP for a JSON file

dlg.data.drops.s3_drop

Drops that interact with S3