Registering a dataset

Table of contents

Registering a dataset

Users start the dataset registration process by logging into Provena. They then select Data Store from the side menu

Selecting Data Store from menu
menu_button

then they click on the Register Dataset button.

Registering a dataset
drawing

When registering a dataset with Provena’s Data Store you are initially required to complete a metadata record. The inputs requested are listed below, make sure you have these prior to filling out the form. After submitting, the system generates a persistent unique identifier that can be used similarly to a Digital Object Identifier (DOI). The generation of metadata records will facilitate the sharing and discovery of data.


Filling out form fields

User entered metadata fields are listed below, some of which are pre-populated, others are selected or searched for with the help of form widgets.

Record Creator Organisation

  • Record Creator Organisation*: The registered Organisation which is registering the data. This is searchable by typing the Organisation’s name in the search bar or manually entering the known ID of the Organisation.
  • Dataset Custodian: The registered Person who could best be described as the custodian of this data. This is searchable by typing the Person’s name in the search bar or manually entering the known ID of the Person.
  • Point of Contact: Please provide a point of contact for enquiries about this data e.g. email address. Please ensure you have sought consent to include these details in the record.

Dataset Approvals

Warning! The Dataset Approvals section must be carefully considered by the registrant of the data. If you believe the dataset is subject to any of the below concerns, but the necessary consents, approvals or permissions have not been granted and/or provided, the dataset should not be registered in Provena's Data Store. Feel free to contact us if you are uncertain about registering a dataset.

  • Dataset Registration Ethics and Privacy*: Does this dataset include any human data or require ethics/privacy approval for its registration? If so, have you included any required ethics approvals, consent from the participants and/or appropriate permissions to register this dataset in this information system? Required consents or permissions can be reposited as part of the dataset files where appropriate.
    • Subject to ethics and privacy concerns for registration?: Use the tick box to specify whether this dataset is subject to ethical and privacy concerns for registration.
    • Necessary consents and permissions required?: This tick box will only appear if the dataset is subject to the aforementioned concerns. If you have not acquired the necessary consents and permissions, the dataset should not be registered and submission will fail.
  • Dataset Access Ethics and Privacy*: Does this dataset include any human data or require ethics/privacy approval for enabling its access by users of the information system? If so, have you included any required consent from the participants and/or appropriate permissions to facilitate access to this dataset in this information system? Required consents or permissions can be reposited as part of the dataset files where appropriate.
    • Subject to ethics and privacy concerns for data access?: Use the tick box to specify whether this dataset is subject to ethical and privacy concerns for data access.
    • Necessary consents and permissions required?: This tick box will only appear if the dataset is subject to the aforementioned concerns. If you have not acquired the necessary consents and permissions, the dataset should not be registered and submission will fail.
  • Indigenous Knowledge and Consent*: Does this dataset contain Indigenous Knowledge? If so, do you have consent from the relevant Indigenous communities for its use and access via this data store?
    • Contains Indigenous Knowledge?: Use the tick box to specify whether this dataset contains Indigenous Knowledge.
    • Necessary permission acquired?: This tick box will only appear if the dataset contains Indigenous Knowledge. If you have not acquired the necessary consents and permissions, the dataset should not be registered and submission will fail.
  • Export Controls*: Is this dataset subject to any export controls permits? If so, has this dataset cleared any required due diligence checks and have you obtained any required permits?
    • Subject to export controls?: Use the tick box to specify whether this dataset is subject to export controls.
    • Cleared due diligence checks and obtained required permits?: This tick box will only appear if the dataset was marked as subject to export controls. If you have not performed the necessary due diligence checks and acquired the relevant permits, the dataset should not be registered and submission will fail.

Dataset Information

  • Dataset name*: A title to identify the dataset well enough to disambiguate it from other datasets. i.e. “Coral reef locations with turtle activity in the Capricorn Group (Great Barrier Reef)”
  • Dataset description*: Short description of the dataset. This should include the nature of the data, the intended usage, and any other relevant information. i.e. The dataset Coral reef locations in the Capricorn Group (Great Barrier Reef), contains polygons of 150 reefs and islands that have turtle activity. The data was collected from satellite and survey information. Please see the readme.txt file for details on data processing steps undertaken. The data was obtained as part of the Reef Turtle monitoring program.
  • Access Info*: Provides information about whether the dataset files will be stored in the Data Store, or hosted externally. Externally hosted datasets can be described and registered to enable data and activity provenance without enabling file upload or download. Use the checkbox “Store data in the Provena Data Store” to toggle this setting (checked indicates that the dataset is to be stored on the Provena Data Store, unchecked indicates that the dataset is stored externally). If the data is externally hosted, you must provide two additional fields:
    • URI: Provide a valid RFC3986 URI which describes the location of the data. You should provide information about how to use this URI to access the data in the description below. Examples of valid URIs include: http://website.com/file/path, https://website.com/file/path, ftp://ftp.server.com/file/path,file:///path/to/file.
    • Description: Provide a description of how the above URI can be used to access the dataset files.
  • Publisher*: The registered Organisation which is publishing/produced the data. This is searchable by typing the Organisation’s name in the search bar.
  • Dataset creation date*: The date on which this version of the dataset was produced or generated.
  • Dataset publish date*: The date on which this version of the dataset was first published. If the data has never been published before, please select today’s date.
  • Usage licence*: Select a licence from the dropdown list. The default will be ‘Copyright’. A list of licences is available here.
  • Dataset purpose: A brief description of the reason a data asset was created. Should be a good guide to the potential usefulness of a data asset to other users.
  • Dataset rights holder: Specify the party owning or managing rights over the resource. Please ensure you have sought consent to include these details in the record.
  • Usage limitations: A statement that provides information on any caveats or restrictions on access or on the use of the data asset, including legal, security, privacy, commercial or other limitations.
  • Preferred Citation: Optionally specify a citation which users of this dataset should use when referencing this dataset. To provide a preferred citation, tick the “Provide preferred citation” checkbox and enter your citation into the textfield.
  • Spatial Information: If your dataset includes spatial data, you can provide more information about the extent and resolution of this data.
    • Spatial Coverage: The geographic area applicable to the data asset. Please specify spatial coverage using the EWKT format.
    • Spatial Resolution: The spatial resolution applicable to the data asset. Please use the Decimal Degrees standard.
    • Spatial Extent: The range of spatial coordinates applicable to the data asset. Please provide a bounding box extent using the EWKT format.
  • Temporal Information: If your dataset includes data spanning a period of time, you can provide more information about the duration and resolution of this data.
    • Temporal Duration: The start and end date of the time period applicable to the data asset (note that a start and end date both must be provided if a temporal duration is to be specified).
    • Temporal Resolution: The temporal resolution (i.e. time step) of the data. Please use the ISO8601 duration format e.g. “P1Y2M10DT2H30M”.
  • Dataset File Formats: What file formats are present in this dataset? E.g. “pdf”, “csv” etc. You can use the plus and minus symbol to add and remove file formats.
  • Keywords: List of keywords which describe the dataset. These keywords are searchable.
  • Custom User Metadata: If you would like to include additional custom annotations to describe your dataset, you can do so here. Please tick “Include Custom User Metadata” and then click the “Add a new entry” plus icon to add a row. Your metadata is composed of a set of key value pairs. Click and enter a key, and value, for example “my_special_dataset_id” and “1234”. You can add another row using the plus icon on the right, or remove an entry using the red minus sign on the right. To remove all custom metadata, untick the “Include Custom User Metadata” box.

* Denotes required field.


Auto generated metadata fields

After you Submit the newly registered dataset, Provena will mint a Handle and allocate a directory in the Provena online data storage. The Handle identifier and data directory path will be included in the metadata record.

Auto generated dataset details

  • Handle: Persistent identifier [Auto generated]
  • URL: Path to the online dataset [Auto generated]

Missing fields

After clicking Submit if a mandatory/required field is not given a popup will appear indicating that important information is missing.

You will not be able to progress unless all required fields are entered.


Usage licence

You can attribute the appropriate licence from the dropdown list. There are ten licences to choose from. For details of each licence please see the Licenses page.


What happens during the minting dataset process?

A Handle Identifier is minted with each dataset that is registered and will associated with the dataset metadata. This minted identifier can be used to persistently locate the dataset in the future. See Digital Object Identifiers for further details.


File types and maximum file size

The Provena Data Store can store a variety of data files e.g. text, csv, netCDF, word documents, images, video etc… Users can upload files or folders using either the AWS web console (GUI), AWS command line interface (AWS CLI) or a program like WinSCP.

While the AWS CLI can handle large files (>100GB) and the AWS Console GUI can handle up to 160GB uploads, please contact the Provena team if you know you will be uploading large or numerous files. For technical information about the storage limitations of the S3 service (which the data store is built on) you can review the AWS FAQ here.

The maximum file size that can be uploaded is 5TB.
To upload files, see Uploading a dataset.