Graph Certification
‘Certifying’ a graph involves generating and publishing reproducibility signatures. These signatures can be integrated into a CI/CD pipeline, used during executions for verification or during late-stage development when fine-tuning graphs.
By producing and sharing these signatures, subsequent changes to execution environment, processing components, overall graph design and data artefacts can be easily and efficiently tested.
Certifying a Graph
The process of generating and storing workflow signatures is relatively straightforward.
From the root of the graph-storing directory (usually a repository) create a
/reprodata/[GRAPH_NAME]directory.Run the graph with the
ALLreproducibility flag, and move the produced reprodata.out file to the previously created directory.(optional) Run from
dlg.common.reproducibility.reprodata_compare.pyscript with this file as input to generate a summary-csv file
In subsequent executions or during CI/CD scripts:
* Note the reprodata.out file generated during the test execution
* Run dlg.common.reproduciblity.reprodata_compare.py with the published reprodata/[GRAPH_NAME] directory and newly generated signature file
* The resulting [SESSION_NAME]-comparison.csv will contain a simple True/False summary for each RMode, for use at your discretion.
What is to be expected?
In general, all but Recomputation and Replicate_Computational rmodes should match, moreover:
A failed
Rerunindicates some fundamental structure is differentA failed
Repeatindicates changes to component parameters or a different execution scaleA failed
Recomputation~indicates some runtime environment changes have been madeA failed
Reproductionindicates data artefacts have changedA failed
Scientific Replicationindicates a change in data artefacts or fundamental structureA failed
Computational Replicationindicates a change in data artefacts or runtime environmentA failed
Total Replicaindicates a change in data artefacts, component parameters or different execution scale
When attempting to re-create some known graph-derived result, Replication is the goal.
In an operational context, where data changes constantly, Reruning is the goal
When conducting science across multiple trials, Repeating is necessary to use the derived data arte-facts in concert.
Tips on Making Graphs Robust
The most common ‘brittle’ aspect of graphs are hard-coded paths to data resources and access to referenced data. This can be ameliorated by:
Using the
$DLG_ROOTkeyword in component parameters as a base path.Providing comments on where to find referenced data artefacts
Providing instructions on how to build referenced runtime libraries (in the case of Dynlib drops).