Graph Certification

‘Certifying’ a graph involves generating and publishing reproducibility signatures. These signatures can be integrated into a CI/CD pipeline, used during executions for verification or during late-stage development when fine-tuning graphs.

By producing and sharing these signatures, subsequent changes to execution environment, processing components, overall graph design and data artefacts can be easily and efficiently tested.

Certifying a Graph

The process of generating and storing workflow signatures is relatively straightforward.

From the root of the graph-storing directory (usually a repository) create a /reprodata/[GRAPH_NAME] directory.
Run the graph with the ALL reproducibility flag, and move the produced reprodata.out file to the previously created directory.
(optional) Run from dlg.common.reproducibility.reprodata_compare.py script with this file as input to generate a summary-csv file

In subsequent executions or during CI/CD scripts: * Note the reprodata.out file generated during the test execution * Run dlg.common.reproduciblity.reprodata_compare.py with the published reprodata/[GRAPH_NAME] directory and newly generated signature file * The resulting [SESSION_NAME]-comparison.csv will contain a simple True/False summary for each RMode, for use at your discretion.

What is to be expected?

In general, all but Recomputation and Replicate_Computational rmodes should match, moreover:

A failed Rerun indicates some fundamental structure is different
A failed Repeat indicates changes to component parameters or a different execution scale
A failed Recomputation~ indicates some runtime environment changes have been made
A failed Reproduction indicates data artefacts have changed
A failed Scientific Replication indicates a change in data artefacts or fundamental structure
A failed Computational Replication indicates a change in data artefacts or runtime environment
A failed Total Replica indicates a change in data artefacts, component parameters or different execution scale

When attempting to re-create some known graph-derived result, Replication is the goal. In an operational context, where data changes constantly, Reruning is the goal When conducting science across multiple trials, Repeating is necessary to use the derived data arte-facts in concert.

Tips on Making Graphs Robust

The most common ‘brittle’ aspect of graphs are hard-coded paths to data resources and access to referenced data. This can be ameliorated by:

Using the $DLG_ROOT keyword in component parameters as a base path.
Providing comments on where to find referenced data artefacts
Providing instructions on how to build referenced runtime libraries (in the case of Dynlib drops).