Previous
Overview
A dataset is a named collection of images at the organization level that you label and use for training. Before training, your dataset must meet these minimums:
| Requirement | Minimum |
|---|---|
| Total images | 15 |
| Labeled images | 80% of total |
| Examples per label | 10 |
| Label distribution | Roughly equal |
| Image source | Production environment |
For production use, aim for hundreds of images per label under varied conditions.
You can create a dataset from the web UI, the CLI, or programmatically.
Web UI:
inspection-parts-v1 or package-detection. Dataset names must be
unique within your organization.Your empty dataset now appears in the list.
CLI:
If you have the Viam CLI installed, create a dataset from the command line:
viam dataset create --org-id=YOUR-ORG-ID --name=my-inspection-dataset
The command returns the dataset ID, which you will need for subsequent CLI and SDK operations.
import asyncio
from viam.rpc.dial import DialOptions
from viam.app.viam_client import ViamClient
API_KEY = "YOUR-API-KEY"
API_KEY_ID = "YOUR-API-KEY-ID"
ORG_ID = "YOUR-ORGANIZATION-ID"
async def connect() -> ViamClient:
dial_options = DialOptions.with_api_key(API_KEY, API_KEY_ID)
return await ViamClient.create_from_dial_options(dial_options)
async def main():
viam_client = await connect()
data_client = viam_client.data_client
dataset_id = await data_client.create_dataset(
name="my-inspection-dataset",
organization_id=ORG_ID,
)
print(f"Created dataset: {dataset_id}")
viam_client.close()
if __name__ == "__main__":
asyncio.run(main())
package main
import (
"context"
"fmt"
"go.viam.com/rdk/app"
"go.viam.com/rdk/logging"
)
func main() {
apiKey := "YOUR-API-KEY"
apiKeyID := "YOUR-API-KEY-ID"
orgID := "YOUR-ORGANIZATION-ID"
ctx := context.Background()
logger := logging.NewDebugLogger("create-dataset")
viamClient, err := app.CreateViamClientWithAPIKey(
ctx, app.Options{}, apiKey, apiKeyID, logger)
if err != nil {
logger.Fatal(err)
}
defer viamClient.Close()
dataClient := viamClient.DataClient()
datasetID, err := dataClient.CreateDataset(
ctx, "my-inspection-dataset", orgID)
if err != nil {
logger.Fatal(err)
}
fmt.Printf("Created dataset: %s\n", datasetID)
}
Replace all placeholder values (YOUR-API-KEY, YOUR-API-KEY-ID,
YOUR-ORGANIZATION-ID) with your actual values. To find your organization ID,
click your organization name in the top navigation bar, then click Settings.
Your organization ID is displayed on the settings page with a copy button.
With a dataset created, you need to populate it with images.
Web UI:
The selected images are now part of your dataset.
CLI:
Add images to a dataset using filter criteria:
viam dataset data add filter \
--dataset-id=YOUR-DATASET-ID \
--location-id=YOUR-LOCATION-ID \
--tags=label1,label2
This adds all images matching the filter to the dataset. You can filter by location, machine, component, tags, or time range.
async def main():
viam_client = await connect()
data_client = viam_client.data_client
await data_client.add_binary_data_to_dataset_by_ids(
binary_ids=["binary-data-id-1", "binary-data-id-2"],
dataset_id="YOUR-DATASET-ID",
)
print("Images added to dataset.")
viam_client.close()
You can get binary data IDs by querying for images first using the data client’s
binary_data_by_filter method, which returns objects that include the binary ID.
err = dataClient.AddBinaryDataToDatasetByIDs(
ctx,
[]string{"binary-data-id-1", "binary-data-id-2"},
"YOUR-DATASET-ID",
)
if err != nil {
logger.Fatal(err)
}
fmt.Println("Images added to dataset.")
Before training, you need to label every image in your dataset with tags (for classification) or bounding boxes (for object detection).
See Annotate images for step-by-step instructions on manual labeling, or Automate annotation to use an existing ML model to speed up the process.
Before you train a model, check that your dataset meets the requirements.
In the web UI:
| Check | What to look for |
|---|---|
| Enough images | At least 15 total. More is better. |
| Labeling coverage | At least 80% of images have tags or bounding boxes |
| Examples per label | At least 10 images per label |
| Label balance | No label should have more than 3x the images of any other label |
| Production conditions | Images should represent real operating conditions, not staged or ideal setups |
Common issues to fix before training:
async def main():
viam_client = await connect()
data_client = viam_client.data_client
datasets = await data_client.list_datasets_by_organization_id(
organization_id=ORG_ID,
)
for ds in datasets:
print(f"Dataset: {ds.name}, ID: {ds.id}")
viam_client.close()
datasets, err := dataClient.ListDatasetsByOrganizationID(ctx, orgID)
if err != nil {
logger.Fatal(err)
}
for _, ds := range datasets {
fmt.Printf("Dataset: %s, ID: %s\n", ds.Name, ds.ID)
}
Was this page helpful?
Glad to hear it! If you have any other feedback please let us know:
We're sorry about that. To help us improve, please tell us what we can do better:
Thank you!