Graph Certification¶
‘Certifying’ a graph involves generating and publishing reproducibility signatures. These signatures can be integrated into a CI/CD pipeline, used during executions for verification or during late-stage development when fine-tuning graphs.
By producing and sharing these signatures, subsequent changes to execution environment, processing components, overall graph design and data artefacts can be easily and efficiently tested.
Certifying a Graph¶
The process of generating and storing workflow signatures is relatively straightforward.
From the root of the graph-storing directory (usually a repository) create a
/reprodata/[GRAPH_NAME]
directory.Run the graph with the
ALL
reproducibility flag, and move the produced reprodata.out file to the previously created directory.(optional) Run from
dlg.common.reproducibility.reprodata_compare.py
script with this file as input to generate a summary-csv file
In subsequent executions or during CI/CD scripts:
* Note the reprodata.out file generated during the test execution
* Run dlg.common.reproduciblity.reprodata_compare.py
with the published reprodata/[GRAPH_NAME]
directory and newly generated signature file
* The resulting [SESSION_NAME]-comparison.csv
will contain a simple True/False summary for each RMode, for use at your discretion.
What is to be expected?¶
In general, all but Recomputation
and Replicate_Computational
rmodes should match, moreover:
A failed
Rerun
indicates some fundamental structure is differentA failed
Repeat
indicates changes to component parameters or a different execution scaleA failed
Recomputation~
indicates some runtime environment changes have been madeA failed
Reproduction
indicates data artefacts have changedA failed
Scientific Replication
indicates a change in data artefacts or fundamental structureA failed
Computational Replication
indicates a change in data artefacts or runtime environmentA failed
Total Replica
indicates a change in data artefacts, component parameters or different execution scale
When attempting to re-create some known graph-derived result, Replication
is the goal.
In an operational context, where data changes constantly, Reruning
is the goal
When conducting science across multiple trials, Repeating
is necessary to use the derived data arte-facts in concert.
Tips on Making Graphs Robust¶
The most common ‘brittle’ aspect of graphs are hard-coded paths to data resources and access to referenced data. This can be ameliorated by:
Using the
$DLG_ROOT
keyword in component parameters as a base path.Providing comments on where to find referenced data artefacts
Providing instructions on how to build referenced runtime libraries (in the case of Dynlib drops).