Datasets#

The Encord SDK allows you to interact with the datasets you have added to Encord. The attributes and member functions of the Dataset class can be found here: Dataset.

Each dataset can have a list of DataRows which are the individual videos, image groups, images, or DICOM series within the Dataset.

Below, you can find tutorials on how to interact with your datasets when you have associated a public-private key pair with Encord.

Creating a dataset#

To create a dataset, first select where your data will be hosted with the appropriate StorageLocation. The following example will create a dataset called “Example Title” that will expect data hosted on AWS S3. If you just with to upload your data from local storage to Encord, CORD_STORAGE would be the appropriate choice.

Listing existing datasets#

Using the EncordUserClient, you can easily query and list all the available datasets of a given user. In the example below, a user authenticates with Encord and then fetches all datasets available.

from typing import Dict, List

from encord import EncordUserClient

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

datasets: List[Dict] = user_client.get_datasets()
print(datasets)

Note: the type attribute in the output refers to the StorageLocation used when Creating a dataset.

Note

EncordUserClient.get_datasets() has multiple optional arguments that allow you to query datasets with specific characteristics.

For example, if you only want datasets with titles starting with “Validation”, you could use user_client.get_datasets(title_like="Validation%"). Other keyword arguments such as created_before or edited_after may also be of interest.

Managing a dataset#

Your default choice for interacting with a dataset is via the User authentication.

from encord import Dataset, EncordUserClient

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

dataset: Dataset = user_client.get_dataset("<dataset_hash>")

Data#

Adding data#

You can add data to datasets in multiple ways. You can both use Encord storage, as described next, and you can add data from a private cloud to integrate any pre-existing data.

Note

The following examples assume that you have a Dataset initialised as variable dataset and authenticated.

Adding data to Encord-hosted storage#

Uploading videos#

Use the method upload_video() to upload a video to a dataset using Encord storage.

dataset.upload_video("path/to/your/video.mp4")

This will upload the given video file to the dataset associated with the dataset.

Uploading images#

Use the method create_image_group() to upload images and create an image group using Encord storage.

dataset.create_image_group(
    [
        "path/to/your/img1.jpeg",
        "path/to/your/img2.jpeg",
    ]
)

This method will upload the given list of images to the dataset associated with the dataset and create an image group.

You can also upload individual images to a dataset using Encord storage with the method upload_image().

dataset.upload_image("path/to/your/img1.jpeg")

Note

Image groups are images of the same resolution, so if img1.jpeg and img2.jpeg from the example above are of shape [1920, 1080] and [1280, 720], respectively, they will end up in each of their own image group.

Note

Images in an image group will be assigned a data_sequence number, which is based on the order or the files listed in the argument to create_image_group above. If the ordering is important, make sure to provide a list with filenames in the correct order.

Adding data from a private cloud#

  1. Use user_client.get_cloud_integrations() method to retrieve a list of available Cloud Integrations

  2. Grab the id from the integration of your choice and call dataset.add_private_data_to_dataset() on the dataset with either the absolute path to a json file or a python dictionary in the format specified in the Private cloud section of the web-app datasets documentation

from typing import List

from encord import Dataset, EncordUserClient
from encord.orm.cloud_integration import CloudIntegration
from encord.orm.dataset import (
    AddPrivateDataResponse,
    DatasetDataLongPolling,
    LongPollingStatus,
)

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)
dataset: Dataset = user_client.get_dataset("<dataset_hash>")

# Choose integration
integrations: List[CloudIntegration] = user_client.get_cloud_integrations()
print("Integration Options:")
print(integrations)

integration_idx: int = [i.title for i in integrations].index("AWS")
integration: str = integrations[integration_idx].id

use_simple_api: bool = True

# Check our documentation here: https://docs.encord.com/datasets/private-cloud-integration/#json-format
# to make sure you upload your data in the correct format

if use_simple_api:
    # using add_private_data_to_dataset will start upload job and await for it to finish,
    # then raise exception in case errors occur
    response: AddPrivateDataResponse = dataset.add_private_data_to_dataset(
        integration, "path/to/json/file.json"
    )
else:
    # using add_private_data_to_dataset_start will only initialize job
    upload_job_id: str = dataset.add_private_data_to_dataset_start(
        integration, "path/to/json/file.json"
    )

    # at this point user can save upload_job_id externally, exit python process and
    # check for status with add_private_data_to_dataset_get_result at any point in the future

    # one can get job status without awaiting the final response, with timeout_seconds=0
    # this will perform one quick call to encord backend for status check
    print(
        dataset.add_private_data_to_dataset_get_result(
            upload_job_id,
            timeout_seconds=0,
        )
    )

    # using add_private_data_to_dataset_get_result without
    # timeout_seconds will await for job to finish
    res = dataset.add_private_data_to_dataset_get_result(upload_job_id)

    if res.status == LongPollingStatus.DONE:
        response = AddPrivateDataResponse(
            dataset_data_list=res.data_hashes_with_titles
        )
    elif res.status == LongPollingStatus.ERROR:
        raise Exception(res.errors)  # one can specify custom error handling
    else:
        raise ValueError(f"res.status={res.status}, this should never happen")

print(response.dataset_data_list)

Reading and updating data#

To inspect data within a dataset you can use the Dataset.data_rows method. You will get a list of DataRows. Check the auto-generated documentation for the DataRow class for more information on which fields you can access and update.

Deleting data#

You can remove both videos and image group from datasets created using both the web-app and the Encord SDK. Use the method dataset.delete_data() to delete from a dataset.

dataset.delete_data(
    [
        "<video1_data_hash>",
        "<image_group1_data_hash>",
    ]
)

In case the video or image group belongs to Encord-hosted storage, the corresponding file will be removed from the Encord-hosted storage.

Please ensure that the list contains videos/image groups from the same dataset which is used to initialise the dataset. Any videos or image groups which do not belong to the dataset used for initialisation will be ignored.

Re-encoding videos#

As videos come in various formats, frame rates, etc., one may - in rare cases - experience some frame-syncing issues on the web-app. For example, it might be the case that, e.g., frame 100 on the web-app does not correspond to the hundredth frame that you load with python. We provide a browser test in the web-app that can tell you if you are at risk of experiencing this issue.

To mitigate such issues, you can re-encode your videos to get a new version of your videos that do not exhibit these issues.

Trigger a re-encoding task#

You re-encode a list of videos by triggering a task for the same using the Encord SDK. Use the method dataset.re_encode_data() to re-encode the list of videos specified

task_id = dataset.re_encode_data(
    [
        "video1_data_hash",
        "video2_data_hash",
    ]
)
print(task_id)

On completion, a task_id is returned which can be used for monitoring the progress of the task.

Please ensure that the list contains videos from the same dataset that was used to initialise the EncordClient. Any videos which do not belong to the dataset used for initialisation will be ignored.

Check the status of a re-encoding task#

Use the method dataset.re_encode_data_status(task_id) to get the status of an existing re-encoding task.

from encord.orm.dataset import ReEncodeVideoTask
task: ReEncodeVideoTask = (
    dataset.re_encode_data_status(task_id)
)
print(task)

The ReEncodeVideoTask contains a field called status which can take the values

  1. "SUBMITTED": the task is currently in progress and the status should be checked back again later

  2. "DONE": the task has been completed successfully and the field ‘result’ would contain metadata about the re-encoded video

  3. "ERROR": the task has failed and could not complete the re-encoding

API keys#

We recommend using a Dataset as described in Managing a dataset. This will be simpler than dealing with the soon to be deprecated API keys which should only be used under specific circumstances as described in Resource authentication.

Creating a master API key with full rights#

It is also possible to create or get a master API key with both read and write access (both values of DatasetScope). The following example show how to get hold of this key:

from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

dataset_api_key: DatasetAPIKey = user_client.get_or_create_dataset_api_key(
    "<dataset_hash>"
)
print(dataset_api_key)

Creating a dataset API key with specific rights#

Resource authentication using an API key allows you to control which capabilities the dataset client will have. This can be useful if you, for example, want to share read-only access with some third-party. You need to provide the <dataset_hash>, which uniquely identifies a dataset (see, for example, the Listing existing datasets to get such hash). If you haven’t created a dataset already, you can have a look at Creating a dataset.

from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey, DatasetScope

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

dataset_api_key: DatasetAPIKey = user_client.create_dataset_api_key(
    "<dataset_hash>",
    "Full Access API Key",
    [DatasetScope.READ, DatasetScope.WRITE],
)
print(dataset_api_key)

Note

This capability is only available to an admin of a dataset.

With the API key in hand, you can now use Resource authentication.

Fetching dataset API keys#

To get a list of all API keys of a dataset, you need to provide the <dataset_hash> which uniquely identifies the dataset (see, for example, the Listing existing datasets to get such hash). If you haven’t created a dataset already, you can have a look at Creating a dataset.

from typing import List

from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

keys: List[DatasetAPIKey] = user_client.get_dataset_api_keys("<dataset_hash>")
print(keys)

You can now use this API key for Resource authentication.