Datasets#
The Encord SDK allows you to interact with the datasets you have added to Encord.
The attributes and member functions of the Dataset class can be found here: Dataset
.
Each dataset can have a list of DataRows
which are the individual videos, image groups, images, or DICOM series within the Dataset.
Below, you can find tutorials on how to interact with your datasets when you have associated a public-private key pair with Encord.
Creating a dataset#
To create a dataset, first select where your data will be hosted with the appropriate StorageLocation
.
The following example will create a dataset called “Example Title” that will expect data hosted on AWS S3.
If you just with to upload your data from local storage to Encord, CORD_STORAGE
would be the appropriate choice.
from encord import EncordUserClient
from encord.orm.dataset import CreateDatasetResponse, StorageLocation
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset: CreateDatasetResponse = user_client.create_dataset(
"Example Title", StorageLocation.AWS
)
print(dataset)
{
"title": "Example Title",
"type": 1,
"dataset_hash": "<dataset_hash>",
"user_hash": "<user_hash>",
}
Listing existing datasets#
Using the EncordUserClient
, you can easily query and list all the available datasets of a given user.
In the example below, a user authenticates with Encord and then fetches all datasets available.
from typing import Dict, List
from encord import EncordUserClient
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
datasets: List[Dict] = user_client.get_datasets()
print(datasets)
[
{
"dataset": DatasetInfo(
dataset_hash="<dataset_hash>",
user_hash="<user_hash>",
title="Example title",
description="Example description ... ",
type=0, # encord.orm.dataset.StorageLocation
created_at=datetime.datetime(...),
last_edited_at=datetime.datetime(...)
),
"user_role": DatasetUserRole.ADMIN
},
# ...
]
Note: the type
attribute in the output refers to the StorageLocation
used when Creating a dataset.
Note
EncordUserClient.get_datasets()
has multiple optional arguments that allow you to query datasets with specific characteristics.
For example, if you only want datasets with titles starting with “Validation”, you could use user_client.get_datasets(title_like="Validation%")
.
Other keyword arguments such as created_before
or edited_after
may also be of interest.
Managing a dataset#
Your default choice for interacting with a dataset is via the User authentication.
from encord import Dataset, EncordUserClient
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset: Dataset = user_client.get_dataset("<dataset_hash>")
Data#
Adding data#
You can add data to datasets in multiple ways. You can both use Encord storage, as described next, and you can add data from a private cloud to integrate any pre-existing data.
Note
The following examples assume that you have a Dataset
initialised as variable dataset
and authenticated.
Adding data to Encord-hosted storage#
Uploading videos#
Use the method upload_video()
to upload a video to a dataset using Encord storage.
dataset.upload_video("path/to/your/video.mp4")
This will upload the given video file to the dataset associated with the dataset
.
Uploading images#
Use the method create_image_group()
to upload images and create an image group using Encord storage.
dataset.create_image_group(
[
"path/to/your/img1.jpeg",
"path/to/your/img2.jpeg",
]
)
This method will upload the given list of images to the dataset associated with the dataset
and create an image group.
You can also upload individual images to a dataset using Encord storage with the method upload_image()
.
dataset.upload_image("path/to/your/img1.jpeg")
Note
Image groups are images of the same resolution, so if img1.jpeg
and img2.jpeg
from the example above are of shape [1920, 1080]
and [1280, 720]
, respectively, they will end up in each of their own image group.
Note
Images in an image group will be assigned a data_sequence
number, which is based on the order or the files listed in the argument to create_image_group
above.
If the ordering is important, make sure to provide a list with filenames in the correct order.
Adding data from a private cloud#
Use
user_client.get_cloud_integrations()
method to retrieve a list of available Cloud IntegrationsGrab the id from the integration of your choice and call
dataset.add_private_data_to_dataset()
on thedataset
with either the absolute path to a json file or a python dictionary in the format specified in the Private cloud section of the web-app datasets documentation
from typing import List
from encord import Dataset, EncordUserClient
from encord.orm.cloud_integration import CloudIntegration
from encord.orm.dataset import (
AddPrivateDataResponse,
DatasetDataLongPolling,
LongPollingStatus,
)
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset: Dataset = user_client.get_dataset("<dataset_hash>")
# Choose integration
integrations: List[CloudIntegration] = user_client.get_cloud_integrations()
print("Integration Options:")
print(integrations)
integration_idx: int = [i.title for i in integrations].index("AWS")
integration: str = integrations[integration_idx].id
use_simple_api: bool = True
# Check our documentation here: https://docs.encord.com/datasets/private-cloud-integration/#json-format
# to make sure you upload your data in the correct format
if use_simple_api:
# using add_private_data_to_dataset will start upload job and await for it to finish,
# then raise exception in case errors occur
response: AddPrivateDataResponse = dataset.add_private_data_to_dataset(
integration, "path/to/json/file.json"
)
else:
# using add_private_data_to_dataset_start will only initialize job
upload_job_id: str = dataset.add_private_data_to_dataset_start(
integration, "path/to/json/file.json"
)
# at this point user can save upload_job_id externally, exit python process and
# check for status with add_private_data_to_dataset_get_result at any point in the future
# one can get job status without awaiting the final response, with timeout_seconds=0
# this will perform one quick call to encord backend for status check
print(
dataset.add_private_data_to_dataset_get_result(
upload_job_id,
timeout_seconds=0,
)
)
# using add_private_data_to_dataset_get_result without
# timeout_seconds will await for job to finish
res = dataset.add_private_data_to_dataset_get_result(upload_job_id)
if res.status == LongPollingStatus.DONE:
response = AddPrivateDataResponse(
dataset_data_list=res.data_hashes_with_titles
)
elif res.status == LongPollingStatus.ERROR:
raise Exception(res.errors) # one can specify custom error handling
else:
raise ValueError(f"res.status={res.status}, this should never happen")
print(response.dataset_data_list)
Reading and updating data#
To inspect data within a dataset you can use the Dataset.data_rows
method.
You will get a list of DataRows
.
Check the auto-generated documentation for the DataRow
class for more information on
which fields you can access and update.
Deleting data#
You can remove both videos and image group from datasets created using both the web-app and the Encord SDK.
Use the method dataset.delete_data()
to delete from a dataset.
dataset.delete_data(
[
"<video1_data_hash>",
"<image_group1_data_hash>",
]
)
In case the video or image group belongs to Encord-hosted storage, the corresponding file will be removed from the Encord-hosted storage.
Please ensure that the list contains videos/image groups from the same dataset which is used to initialise the dataset
.
Any videos or image groups which do not belong to the dataset used for initialisation will be ignored.
Re-encoding videos#
As videos come in various formats, frame rates, etc., one may - in rare cases - experience some frame-syncing issues on the web-app. For example, it might be the case that, e.g., frame 100 on the web-app does not correspond to the hundredth frame that you load with python. We provide a browser test in the web-app that can tell you if you are at risk of experiencing this issue.
To mitigate such issues, you can re-encode your videos to get a new version of your videos that do not exhibit these issues.
Trigger a re-encoding task#
You re-encode a list of videos by triggering a task for the same using the Encord SDK.
Use the method dataset.re_encode_data()
to re-encode the list of videos specified
task_id = dataset.re_encode_data(
[
"video1_data_hash",
"video2_data_hash",
]
)
print(task_id)
1337 # Some integer
On completion, a task_id
is returned which can be used for monitoring the progress of the task.
Please ensure that the list contains videos from the same dataset that was used to initialise the EncordClient
.
Any videos which do not belong to the dataset used for initialisation will be ignored.
Check the status of a re-encoding task#
Use the method dataset.re_encode_data_status(task_id)
to get the status of an existing re-encoding task.
from encord.orm.dataset import ReEncodeVideoTask
task: ReEncodeVideoTask = (
dataset.re_encode_data_status(task_id)
)
print(task)
ReEncodeVideoTask(
status="DONE",
result=[
ReEncodeVideoTaskResult(
data_hash="<data_hash>",
signed_url="<signed_url>",
bucket_path="<bucket_path>",
),
...
]
)
The ReEncodeVideoTask
contains a field called status
which can take the values
"SUBMITTED"
: the task is currently in progress and the status should be checked back again later"DONE"
: the task has been completed successfully and the field ‘result’ would contain metadata about the re-encoded video"ERROR"
: the task has failed and could not complete the re-encoding
API keys#
We recommend using a Dataset
as described in Managing a dataset.
This will be simpler than dealing with the soon to be deprecated API keys which should only be used under specific circumstances as described in Resource authentication.
Creating a master API key with full rights#
It is also possible to create or get a master API key with both read and write access (both values of DatasetScope
).
The following example show how to get hold of this key:
from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset_api_key: DatasetAPIKey = user_client.get_or_create_dataset_api_key(
"<dataset_hash>"
)
print(dataset_api_key)
DatasetAPIKey(
dataset_hash="<dataset_hash>",
api_key="<api_key>",
title="",
scopes=[
DatasetScope.READ,
DatasetScope.WRITE,
]
)
Creating a dataset API key with specific rights#
Resource authentication using an API key allows you to control which capabilities the dataset client will have.
This can be useful if you, for example, want to share read-only access with some third-party.
You need to provide the <dataset_hash>
, which uniquely identifies a dataset (see, for example, the Listing existing datasets to get such hash).
If you haven’t created a dataset already, you can have a look at Creating a dataset.
from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey, DatasetScope
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
dataset_api_key: DatasetAPIKey = user_client.create_dataset_api_key(
"<dataset_hash>",
"Full Access API Key",
[DatasetScope.READ, DatasetScope.WRITE],
)
print(dataset_api_key)
DatasetAPIKey(
dataset_hash="<dataset_hash>",
api_key="<api_key>",
title="Example api key title",
scopes=[
DatasetScope.READ,
DatasetScope.WRITE,
]
)
Note
This capability is only available to an admin of a dataset.
With the API key in hand, you can now use Resource authentication.
Fetching dataset API keys#
To get a list of all API keys of a dataset, you need to provide the <dataset_hash>
which uniquely identifies the dataset (see, for example, the Listing existing datasets to get such hash).
If you haven’t created a dataset already, you can have a look at Creating a dataset.
from typing import List
from encord import EncordUserClient
from encord.orm.dataset import DatasetAPIKey
user_client: EncordUserClient = EncordUserClient.create_with_ssh_private_key(
"<your_private_key>"
)
keys: List[DatasetAPIKey] = user_client.get_dataset_api_keys("<dataset_hash>")
print(keys)
[
DatasetAPIKey(
dataset_hash="<dataset_hash>",
api_key="<dataset_api_key>",
title="Full Access API Key",
scopes=[
DatasetScope.READ,
DatasetScope.WRITE,
]
),
# ...
]
You can now use this API key for Resource authentication.