Datasets#
The Encord SDK allows you to interact with the datasets you have added to Encord.
The attributes and member functions of the Dataset class can be found here: Dataset
.
Each dataset can have a list of DataRows
which are the individual videos, image groups, images, or DICOM series within the Dataset.
Below, you can find tutorials on how to interact with your datasets when you have associated a public-private key pair with Encord.
Creating a dataset#
To create a dataset, first select where your data will be hosted with the appropriate StorageLocation
.
The following example will create a dataset called “Example Title” that will expect data hosted on AWS S3.
If you just with to upload your data from local storage to Encord, CORD_STORAGE
would be the appropriate choice.
from encord import EncordUserClient
from encord.orm.dataset import CreateDatasetResponse, StorageLocation
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset: CreateDatasetResponse = user_client.create_dataset(
"Example Title", StorageLocation.AWS
)
print(dataset)
{
"title": "Example Title",
"type": 1,
"dataset_hash": "<dataset_hash>",
"user_hash": "<user_hash>",
}
Listing existing datasets#
Using the EncordUserClient
, you can easily query and list all the available datasets of a given user.
In the example below, a user authenticates with Encord and then fetches all datasets available.
from typing import Dict, List
from encord import EncordUserClient
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
datasets: List[Dict] = user_client.get_datasets()
print(datasets)
[
{
"dataset": DatasetInfo(
dataset_hash="<dataset_hash>",
user_hash="<user_hash>",
title="Example title",
description="Example description ... ",
type=0, # encord.orm.dataset.StorageLocation
created_at=datetime.datetime(...),
last_edited_at=datetime.datetime(...)
),
"user_role": DatasetUserRole.ADMIN
},
# ...
]
Note: the type
attribute in the output refers to the StorageLocation
used when Creating a dataset.
Note
EncordUserClient.get_datasets()
has multiple optional arguments that allow you to query datasets with specific characteristics.
For example, if you only want datasets with titles starting with “Validation”, you could use user_client.get_datasets(title_like="Validation%")
.
Other keyword arguments such as created_before
or edited_after
may also be of interest.
Managing a dataset#
Your default choice for interacting with a dataset is via the User authentication.
from encord import Dataset, EncordUserClient
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset: Dataset = user_client.get_dataset("<dataset_hash>")
Data#
Adding data#
You can add data to datasets in multiple ways. You can both use Encord storage, as described next, and you can add data from a private cloud to integrate any pre-existing data.
Note
The following examples assume that you have a Dataset
initialised as variable dataset
and authenticated.
Adding data to Encord-hosted storage#
Uploading videos#
Use the method upload_video()
to upload a video to a dataset using Encord storage.
dataset.upload_video("path/to/your/video.mp4")
This will upload the given video file to the dataset associated with the dataset
.
Uploading images#
Use the method create_image_group()
to upload images and create an image group using Encord storage.
dataset.create_image_group(
[
"path/to/your/img1.jpeg",
"path/to/your/img2.jpeg",
]
)
This method will upload the given list of images to the dataset associated with the dataset
and create an image group.
You can also upload individual images to a dataset using Encord storage with the method upload_image()
.
dataset.upload_image("path/to/your/img1.jpeg")
Note
Image groups are images of the same resolution, so if img1.jpeg
and img2.jpeg
from the example above are of shape [1920, 1080]
and [1280, 720]
, respectively, they will end up in each of their own image group.
Note
Images in an image group will be assigned a data_sequence
number, which is based on the order or the files listed in the argument to create_image_group
above.
If the ordering is important, make sure to provide a list with filenames in the correct order.
Adding data from a private cloud#
Use
user_client.get_cloud_integrations()
method to retrieve a list of available Cloud IntegrationsGrab the id from the integration of your choice and call
dataset.add_private_data_to_dataset()
on thedataset
with either the absolute path to a json file or a python dictionary in the format specified in the Private cloud section of the web-app datasets documentation
from typing import List
from encord import Dataset, EncordUserClient
from encord.orm.cloud_integration import CloudIntegration
from encord.orm.dataset import AddPrivateDataResponse
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset: Dataset = user_client.get_dataset("<dataset_hash>")
# Choose integration
integrations: List[CloudIntegration] = user_client.get_cloud_integrations()
print("Integration Options:")
print(integrations)
integration_idx: int = [i.title for i in integrations].index("AWS")
integration: str = integrations[integration_idx].id
response: AddPrivateDataResponse = dataset.add_private_data_to_dataset(
integration, "path/to/json/file.json"
)
print(response.dataset_data_list)
Note
We strongly encourage you to follow and adapt the below script for an idiomatic way to upload data to your dataset.
import logging
import time
import requests.exceptions
from encord import Dataset, EncordUserClient
from encord.exceptions import EncordException
def upload_item(object_url: str, integration_id: str, dataset: Dataset) -> None:
"""Adding data one by one to a dataset."""
add_private_data_response = dataset.add_private_data_to_dataset(
integration_id,
# Check the https://docs.encord.com/datasets/private-cloud-integration/#json-format documentation to build
# the correct format for your upload.
{
"images": [
{"objectUrl": object_url},
]
},
)
if len(add_private_data_response.dataset_data_list) != 1:
# If the request returns but there is not item added, there might be something wrong with the uploaded file.
# You can reach out to the Encord team for support.
logging.error(
f"Error adding private data for object_url {object_url}. The add private data response "
f"was: {add_private_data_response}"
)
object_urls = [
"https://bucket/object1.jpeg",
"https://bucket/object2.jpeg",
]
# TODO: reset these ^
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"YOUR PRIVATE SSH KEY"
)
# TODO: set this ^
dataset: Dataset = user_client.get_dataset("YOUR DATASET RESOURCE ID")
# TODO: set this ^
integration_title = "YOUR INTEGRATION TITLE" # TODO: set this
integration = None
for available_integration in dataset.get_cloud_integrations():
if available_integration.title == integration_title:
integration = available_integration
break
if integration is None:
logging.error(
f"Integration with title {integration_title} not found - aborting"
)
exit()
# Set the timeout as uploads of large files can take a long time
TIMEOUT = 60 * 60 # 1 h
dataset._client._config.read_timeout = TIMEOUT
dataset._client._config.write_timeout = TIMEOUT
dataset._client._config.connect_timeout = TIMEOUT
for object_url in object_urls:
retries = 5
backoff_seconds = 2
while retries > 0:
try:
upload_item(object_url, integration.id, dataset)
logging.info(f"Successfully uploaded {object_url}")
break
except requests.exceptions.ReadTimeout:
logging.exception(
f"Your request timed out. The file with object url {object_url} might be processed in "
"the background. Check the upload at a later time. Do not retry it for now. You might want to adjust "
"your set timeout.",
stack_info=True,
)
break
except EncordException:
logging.exception(
f"Caught exception when adding {object_url} to Encord. Sleeping for {backoff_seconds}",
stack_info=True,
)
if retries == 0:
logging.error(
f"All retries exhausted. - Upload for object url {object_url} will be skipped."
)
break
time.sleep(backoff_seconds)
retries -= 1
backoff_seconds *= 2
Reading and updating data#
To inspect data within a dataset you can use the Dataset.data_rows
method.
You will get a list of DataRows
.
Check the auto-generated documentation for the DataRow
class for more information on
which fields you can access and update.
Deleting data#
You can remove both videos and image group from datasets created using both the web-app and the Encord SDK.
Use the method dataset.delete_data()
to delete from a dataset.
dataset.delete_data(
[
"<video1_data_hash>",
"<image_group1_data_hash>",
]
)
In case the video or image group belongs to Encord-hosted storage, the corresponding file will be removed from the Encord-hosted storage.
Please ensure that the list contains videos/image groups from the same dataset which is used to initialise the dataset
.
Any videos or image groups which do not belong to the dataset used for initialisation will be ignored.
Re-encoding videos#
As videos come in various formats, frame rates, etc., one may - in rare cases - experience some frame-syncing issues on the web-app. For example, it might be the case that, e.g., frame 100 on the web-app does not correspond to the hundredth frame that you load with python. We provide a browser test in the web-app that can tell you if you are at risk of experiencing this issue.
To mitigate such issues, you can re-encode your videos to get a new version of your videos that do not exhibit these issues.
Trigger a re-encoding task#
You re-encode a list of videos by triggering a task for the same using the Encord SDK.
Use the method dataset.re_encode_data()
to re-encode the list of videos specified
task_id = dataset.re_encode_data(
[
"video1_data_hash",
"video2_data_hash",
]
)
print(task_id)
1337 # Some integer
On completion, a task_id
is returned which can be used for monitoring the progress of the task.
Please ensure that the list contains videos from the same dataset that was used to initialise the EncordClient
.
Any videos which do not belong to the dataset used for initialisation will be ignored.
Check the status of a re-encoding task#
Use the method dataset.re_encode_data_status(task_id)
to get the status of an existing re-encoding task.
from encord.orm.dataset import ReEncodeVideoTask
task: ReEncodeVideoTask = (
dataset.re_encode_data_status(task_id)
)
print(task)
ReEncodeVideoTask(
status="DONE",
result=[
ReEncodeVideoTaskResult(
data_hash="<data_hash>",
signed_url="<signed_url>",
bucket_path="<bucket_path>",
),
...
]
)
The ReEncodeVideoTask
contains a field called status
which can take the values
"SUBMITTED"
: the task is currently in progress and the status should be checked back again later"DONE"
: the task has been completed successfully and the field ‘result’ would contain metadata about the re-encoded video"ERROR"
: the task has failed and could not complete the re-encoding
API keys#
We recommend using a Dataset
as described in Managing a dataset.
This will be simpler than dealing with the soon to be deprecated API keys which should only be used under specific circumstances as described in Resource authentication.
Creating a master API key with full rights#
It is also possible to create or get a master API key with both read and write access (both values of DatasetScope
).
The following example show how to get hold of this key:
from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset_api_key: DatasetAPIKey = user_client.get_or_create_dataset_api_key(
"<dataset_hash>"
)
print(dataset_api_key)
DatasetAPIKey(
dataset_hash="<dataset_hash>",
api_key="<api_key>",
title="",
scopes=[
DatasetScope.READ,
DatasetScope.WRITE,
]
)
Creating a dataset API key with specific rights#
Resource authentication using an API key allows you to control which capabilities the dataset client will have.
This can be useful if you, for example, want to share read-only access with some third-party.
You need to provide the <dataset_hash>
, which uniquely identifies a dataset (see, for example, the Listing existing datasets to get such hash).
If you haven’t created a dataset already, you can have a look at Creating a dataset.
from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey, DatasetScope
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset_api_key: DatasetAPIKey = user_client.create_dataset_api_key(
"<dataset_hash>",
"Full Access API Key",
[DatasetScope.READ, DatasetScope.WRITE],
)
print(dataset_api_key)
DatasetAPIKey(
dataset_hash="<dataset_hash>",
api_key="<api_key>",
title="Example api key title",
scopes=[
DatasetScope.READ,
DatasetScope.WRITE,
]
)
Note
This capability is only available to an admin of a dataset.
With the API key in hand, you can now use Resource authentication.
Fetching dataset API keys#
To get a list of all API keys of a dataset, you need to provide the <dataset_hash>
which uniquely identifies the dataset (see, for example, the Listing existing datasets to get such hash).
If you haven’t created a dataset already, you can have a look at Creating a dataset.
from typing import List
from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
keys: List[DatasetAPIKey] = user_client.get_dataset_api_keys("<dataset_hash>")
print(keys)
[
DatasetAPIKey(
dataset_hash="<dataset_hash>",
api_key="<dataset_api_key>",
title="Full Access API Key",
scopes=[
DatasetScope.READ,
DatasetScope.WRITE,
]
),
# ...
]
You can now use this API key for Resource authentication.