Datasets#

The Encord SDK allows you to interact with the datasets you have added to Encord. The attributes and member functions of the Dataset class can be found here: Dataset.

Each dataset can have a list of DataRows which are the individual videos, image groups, images, or DICOM series within the Dataset.

Below, you can find tutorials on how to interact with your datasets when you have associated a public-private key pair with Encord.

Creating a dataset#

To create a dataset, first select where your data will be hosted with the appropriate StorageLocation. The following example will create a dataset called “Example Title” that will expect data hosted on AWS S3. If you just with to upload your data from local storage to Encord, CORD_STORAGE would be the appropriate choice.

Listing existing datasets#

Using the EncordUserClient, you can easily query and list all the available datasets of a given user. In the example below, a user authenticates with Encord and then fetches all datasets available.

from typing import Dict, List

from encord import EncordUserClient

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

datasets: List[Dict] = user_client.get_datasets()
print(datasets)

Note: the type attribute in the output refers to the StorageLocation used when Creating a dataset.

Note

EncordUserClient.get_datasets() has multiple optional arguments that allow you to query datasets with specific characteristics.

For example, if you only want datasets with titles starting with “Validation”, you could use user_client.get_datasets(title_like="Validation%"). Other keyword arguments such as created_before or edited_after may also be of interest.

Managing a dataset#

Your default choice for interacting with a dataset is via the User authentication.

from encord import Dataset, EncordUserClient

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

dataset: Dataset = user_client.get_dataset("<dataset_hash>")

Data#

Adding data#

You can add data to datasets in multiple ways. You can both use Encord storage, as described next, and you can add data from a private cloud to integrate any pre-existing data.

Note

The following examples assume that you have a Dataset initialised as variable dataset and authenticated.

Adding data to Encord-hosted storage#

Uploading videos#

Use the method upload_video() to upload a video to a dataset using Encord storage.

dataset.upload_video("path/to/your/video.mp4")

This will upload the given video file to the dataset associated with the dataset.

Uploading images#

Use the method create_image_group() to upload images and create an image group using Encord storage.

dataset.create_image_group(
    [
        "path/to/your/img1.jpeg",
        "path/to/your/img2.jpeg",
    ]
)

This method will upload the given list of images to the dataset associated with the dataset and create an image group.

You can also upload individual images to a dataset using Encord storage with the method upload_image().

dataset.upload_image("path/to/your/img1.jpeg")

Note

Image groups are images of the same resolution, so if img1.jpeg and img2.jpeg from the example above are of shape [1920, 1080] and [1280, 720], respectively, they will end up in each of their own image group.

Note

Images in an image group will be assigned a data_sequence number, which is based on the order or the files listed in the argument to create_image_group above. If the ordering is important, make sure to provide a list with filenames in the correct order.

Adding data from a private cloud#

  1. Use user_client.get_cloud_integrations() method to retrieve a list of available Cloud Integrations

  2. Grab the id from the integration of your choice and call dataset.add_private_data_to_dataset() on the dataset with either the absolute path to a json file or a python dictionary in the format specified in the Private cloud section of the web-app datasets documentation

from typing import List

from encord import Dataset, EncordUserClient
from encord.orm.cloud_integration import CloudIntegration
from encord.orm.dataset import AddPrivateDataResponse

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)
dataset: Dataset = user_client.get_dataset("<dataset_hash>")

# Choose integration
integrations: List[CloudIntegration] = user_client.get_cloud_integrations()
print("Integration Options:")
print(integrations)

integration_idx: int = [i.title for i in integrations].index("AWS")
integration: str = integrations[integration_idx].id

response: AddPrivateDataResponse = dataset.add_private_data_to_dataset(
    integration, "path/to/json/file.json"
)
print(response.dataset_data_list)

Note

We strongly encourage you to follow and adapt the below script for an idiomatic way to upload data to your dataset.

import logging
import time

import requests.exceptions
from encord import Dataset, EncordUserClient
from encord.exceptions import EncordException


def upload_item(object_url: str, integration_id: str, dataset: Dataset) -> None:
    """Adding data one by one to a dataset."""
    add_private_data_response = dataset.add_private_data_to_dataset(
        integration_id,
        # Check the https://docs.encord.com/datasets/private-cloud-integration/#json-format documentation to build
        # the correct format for your upload.
        {
            "images": [
                {"objectUrl": object_url},
            ]
        },
    )
    if len(add_private_data_response.dataset_data_list) != 1:
        # If the request returns but there is not item added, there might be something wrong with the uploaded file.
        # You can reach out to the Encord team for support.
        logging.error(
            f"Error adding private data for object_url {object_url}. The add private data response "
            f"was: {add_private_data_response}"
        )


object_urls = [
    "https://bucket/object1.jpeg",
    "https://bucket/object2.jpeg",
]
# TODO: reset these ^

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "YOUR PRIVATE SSH KEY"
)
# TODO: set this ^
dataset: Dataset = user_client.get_dataset("YOUR DATASET RESOURCE ID")
# TODO: set this ^

integration_title = "YOUR INTEGRATION TITLE"  # TODO: set this

integration = None
for available_integration in dataset.get_cloud_integrations():
    if available_integration.title == integration_title:
        integration = available_integration
        break
if integration is None:
    logging.error(
        f"Integration with title {integration_title} not found - aborting"
    )
    exit()

# Set the timeout as uploads of large files can take a long time
TIMEOUT = 60 * 60  # 1 h
dataset._client._config.read_timeout = TIMEOUT
dataset._client._config.write_timeout = TIMEOUT
dataset._client._config.connect_timeout = TIMEOUT

for object_url in object_urls:
    retries = 5
    backoff_seconds = 2
    while retries > 0:
        try:
            upload_item(object_url, integration.id, dataset)
            logging.info(f"Successfully uploaded {object_url}")
            break
        except requests.exceptions.ReadTimeout:
            logging.exception(
                f"Your request timed out. The file with object url {object_url} might be processed in "
                "the background. Check the upload at a later time. Do not retry it for now. You might want to adjust "
                "your set timeout.",
                stack_info=True,
            )
            break
        except EncordException:
            logging.exception(
                f"Caught exception when adding {object_url} to Encord. Sleeping for {backoff_seconds}",
                stack_info=True,
            )
            if retries == 0:
                logging.error(
                    f"All retries exhausted. - Upload for object url {object_url} will be skipped."
                )
                break
            time.sleep(backoff_seconds)
            retries -= 1
            backoff_seconds *= 2

Reading and updating data#

To inspect data within a dataset you can use the Dataset.data_rows method. You will get a list of DataRows. Check the auto-generated documentation for the DataRow class for more information on which fields you can access and update.

Deleting data#

You can remove both videos and image group from datasets created using both the web-app and the Encord SDK. Use the method dataset.delete_data() to delete from a dataset.

dataset.delete_data(
    [
        "<video1_data_hash>",
        "<image_group1_data_hash>",
    ]
)

In case the video or image group belongs to Encord-hosted storage, the corresponding file will be removed from the Encord-hosted storage.

Please ensure that the list contains videos/image groups from the same dataset which is used to initialise the dataset. Any videos or image groups which do not belong to the dataset used for initialisation will be ignored.

Re-encoding videos#

As videos come in various formats, frame rates, etc., one may - in rare cases - experience some frame-syncing issues on the web-app. For example, it might be the case that, e.g., frame 100 on the web-app does not correspond to the hundredth frame that you load with python. We provide a browser test in the web-app that can tell you if you are at risk of experiencing this issue.

To mitigate such issues, you can re-encode your videos to get a new version of your videos that do not exhibit these issues.

Trigger a re-encoding task#

You re-encode a list of videos by triggering a task for the same using the Encord SDK. Use the method dataset.re_encode_data() to re-encode the list of videos specified

task_id = dataset.re_encode_data(
    [
        "video1_data_hash",
        "video2_data_hash",
    ]
)
print(task_id)

On completion, a task_id is returned which can be used for monitoring the progress of the task.

Please ensure that the list contains videos from the same dataset that was used to initialise the EncordClient. Any videos which do not belong to the dataset used for initialisation will be ignored.

Check the status of a re-encoding task#

Use the method dataset.re_encode_data_status(task_id) to get the status of an existing re-encoding task.

from encord.orm.dataset import ReEncodeVideoTask
task: ReEncodeVideoTask = (
    dataset.re_encode_data_status(task_id)
)
print(task)

The ReEncodeVideoTask contains a field called status which can take the values

  1. "SUBMITTED": the task is currently in progress and the status should be checked back again later

  2. "DONE": the task has been completed successfully and the field ‘result’ would contain metadata about the re-encoded video

  3. "ERROR": the task has failed and could not complete the re-encoding

API keys#

We recommend using a Dataset as described in Managing a dataset. This will be simpler than dealing with the soon to be deprecated API keys which should only be used under specific circumstances as described in Resource authentication.

Creating a master API key with full rights#

It is also possible to create or get a master API key with both read and write access (both values of DatasetScope). The following example show how to get hold of this key:

from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

dataset_api_key: DatasetAPIKey = user_client.get_or_create_dataset_api_key(
    "<dataset_hash>"
)
print(dataset_api_key)

Creating a dataset API key with specific rights#

Resource authentication using an API key allows you to control which capabilities the dataset client will have. This can be useful if you, for example, want to share read-only access with some third-party. You need to provide the <dataset_hash>, which uniquely identifies a dataset (see, for example, the Listing existing datasets to get such hash). If you haven’t created a dataset already, you can have a look at Creating a dataset.

from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey, DatasetScope

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

dataset_api_key: DatasetAPIKey = user_client.create_dataset_api_key(
    "<dataset_hash>",
    "Full Access API Key",
    [DatasetScope.READ, DatasetScope.WRITE],
)
print(dataset_api_key)

Note

This capability is only available to an admin of a dataset.

With the API key in hand, you can now use Resource authentication.

Fetching dataset API keys#

To get a list of all API keys of a dataset, you need to provide the <dataset_hash> which uniquely identifies the dataset (see, for example, the Listing existing datasets to get such hash). If you haven’t created a dataset already, you can have a look at Creating a dataset.

from typing import List

from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey

user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
    "<your_private_key>"
)

keys: List[DatasetAPIKey] = user_client.get_dataset_api_keys("<dataset_hash>")
print(keys)

You can now use this API key for Resource authentication.