Previous
Track objects
A standard camera gives you a flat 2D image. You can see that there is a box on the table, but you cannot tell whether the box is 30 centimeters away or 3 meters away. This how-to shows you how to get depth data from your camera and extract useful distance measurements for robotics tasks that involve physical interaction.
Depth cameras capture both color (RGB) and distance information. There are several technologies:
| Technology | How it works | Common examples |
|---|---|---|
| Structured light | Projects a known pattern and measures distortion | Intel RealSense D400 series |
| Time of flight (ToF) | Measures how long light takes to bounce back | Intel RealSense L515, Azure Kinect |
| Stereo vision | Uses two cameras to calculate depth from disparity | Oak-D, ZED cameras |
All of these produce the same type of output in Viam: a point cloud or a depth map that you access through the standard camera API.
A point cloud is a collection of 3D points, each with an (x, y, z) position measured in millimeters from the camera’s optical center. Some point clouds also include color information for each point.
The coordinate system follows a standard convention:
A typical indoor scene captured by a depth camera contains tens of thousands to hundreds of thousands of points, depending on the camera resolution and range.
A depth map is a 2D image where each pixel value represents the distance from the camera to the surface at that pixel location. It is essentially a grayscale image where brighter pixels are farther away (or vice versa, depending on the encoding).
A point cloud is the 3D representation: each pixel in the depth map is projected into 3D space using the camera’s intrinsic parameters. The point cloud contains explicit (x, y, z) coordinates.
Viam provides point clouds through the GetPointCloud API. If you need a depth map instead, you can capture the depth image directly using GetImage with the appropriate MIME type.
Intrinsic parameters describe the internal geometry of the camera: focal length, principal point, and lens distortion. These parameters are required to accurately project 2D pixel coordinates into 3D space.
When you configure a depth camera in Viam, the intrinsic parameters are typically loaded automatically from the camera hardware. If your camera does not provide them, you can specify them manually in the camera configuration. Without correct intrinsic parameters, 3D projections will be inaccurate.
If you need 3D positions for detected objects –for example, to guide a robot arm to pick up a cup –you combine 2D detections with depth data. The workflow is:
The result is a 3D point cloud for each detected object, with coordinates in the camera’s frame (x right, y down, z forward, in millimeters). To use these positions with other robot components, transform them through the frame system.
Step 5 of this guide shows how to combine detections with depth data to measure distance. For full 3D point clouds, use the vision service’s GetObjectPointClouds method if your vision service supports it.
Go to app.viam.com, navigate to your machine, and verify your depth camera appears in the component list. Open the test panel and confirm it is producing images.
For Intel RealSense cameras, the webcam model or the realsense module is commonly used. Check that the depth stream is enabled in the camera configuration.
Use the camera API to retrieve a point cloud from your depth camera.
import asyncio
from viam.robot.client import RobotClient
from viam.components.camera import Camera
async def main():
opts = RobotClient.Options.with_api_key(
api_key="YOUR-API-KEY",
api_key_id="YOUR-API-KEY-ID"
)
robot = await RobotClient.at_address("YOUR-MACHINE-ADDRESS", opts)
camera = Camera.from_robot(robot, "my-depth-camera")
# Get a point cloud
point_cloud, _ = await camera.get_point_cloud()
print("Point cloud retrieved")
print(f"Type: {type(point_cloud)}")
await robot.close()
if __name__ == "__main__":
asyncio.run(main())
package main
import (
"context"
"fmt"
"go.viam.com/rdk/components/camera"
"go.viam.com/rdk/logging"
"go.viam.com/rdk/robot/client"
"go.viam.com/utils/rpc"
)
func main() {
ctx := context.Background()
logger := logging.NewLogger("depth")
machine, err := client.New(ctx, "YOUR-MACHINE-ADDRESS", logger,
client.WithDialOptions(rpc.WithEntityCredentials(
"YOUR-API-KEY-ID",
rpc.Credentials{
Type: rpc.CredentialsTypeAPIKey,
Payload: "YOUR-API-KEY",
})),
)
if err != nil {
logger.Fatal(err)
}
defer machine.Close(ctx)
cam, err := camera.FromProvider(machine, "my-depth-camera")
if err != nil {
logger.Fatal(err)
}
pc, err := cam.NextPointCloud(ctx, nil)
if err != nil {
logger.Fatal(err)
}
fmt.Printf("Point cloud retrieved with %d points\n", pc.Size())
}
Instead of a full point cloud, you can capture a depth image. This is a 2D representation where each pixel value is the depth in millimeters.
from viam.components.camera import Camera
camera = Camera.from_robot(robot, "my-depth-camera")
# Get images from the depth camera
images, _ = await camera.get_images()
# The depth image is typically the second image
# Check source_name to identify the depth stream
for img in images:
print(f"Source: {img.source_name}, size: {img.width}x{img.height}")
cam, err := camera.FromProvider(machine, "my-depth-camera")
if err != nil {
logger.Fatal(err)
}
// Get images including the depth frame
images, _, err := cam.Images(ctx, nil, nil)
if err != nil {
logger.Fatal(err)
}
for _, img := range images {
fmt.Printf("Source: %s\n", img.SourceName)
}
Given a depth image, you can look up the distance at any pixel coordinate. This is useful when you know where an object is in the 2D image (from a detection) and want to know how far away it is.
import numpy as np
from viam.components.camera import Camera
camera = Camera.from_robot(robot, "my-depth-camera")
# Get images from the depth camera
images, _ = await camera.get_images()
# Find the depth image by checking available images
# Convert to numpy array for pixel-level access
depth_array = np.array(images[1].image)
# Read depth at a specific pixel (center of image)
center_x = depth_array.shape[1] // 2
center_y = depth_array.shape[0] // 2
depth_mm = depth_array[center_y, center_x]
print(f"Depth at center ({center_x}, {center_y}): {depth_mm} mm")
print(f"That is {depth_mm / 1000:.2f} meters")
# Read depth at a specific coordinate
target_x, target_y = 320, 240
depth_at_target = depth_array[target_y, target_x]
print(f"Depth at ({target_x}, {target_y}): {depth_at_target} mm")
import (
"fmt"
"image"
)
cam, err := camera.FromProvider(machine, "my-depth-camera")
if err != nil {
logger.Fatal(err)
}
images, _, err := cam.Images(ctx, nil, nil)
if err != nil {
logger.Fatal(err)
}
// Access the depth image (check SourceName to identify the depth stream)
depthImg, err := images[1].Image(ctx)
if err != nil {
logger.Fatal(err)
}
bounds := depthImg.Bounds()
centerX := (bounds.Min.X + bounds.Max.X) / 2
centerY := (bounds.Min.Y + bounds.Max.Y) / 2
// Read the depth value at center
r, _, _, _ := depthImg.At(centerX, centerY).RGBA()
depthMM := int(r)
fmt.Printf("Depth at center (%d, %d): %d mm\n", centerX, centerY, depthMM)
fmt.Printf("That is %.2f meters\n", float64(depthMM)/1000.0)
Combine 2D detections with depth data to measure the distance to each detected object. Use the center of the bounding box as the depth sample point.
from viam.components.camera import Camera
from viam.services.vision import VisionClient
import numpy as np
camera = Camera.from_robot(robot, "my-depth-camera")
detector = VisionClient.from_robot(robot, "my-detector")
# Get detections
detections = await detector.get_detections_from_camera("my-depth-camera")
# Get images including depth
images, _ = await camera.get_images()
depth_array = np.array(images[1].image)
for d in detections:
if d.confidence < 0.5:
continue
# Use the center of the bounding box
center_x = (d.x_min + d.x_max) // 2
center_y = (d.y_min + d.y_max) // 2
# Clamp to image bounds
center_x = max(0, min(center_x, depth_array.shape[1] - 1))
center_y = max(0, min(center_y, depth_array.shape[0] - 1))
depth_mm = depth_array[center_y, center_x]
# Sample a small region around center for more robust measurement
region = depth_array[
max(0, center_y - 5):min(depth_array.shape[0], center_y + 5),
max(0, center_x - 5):min(depth_array.shape[1], center_x + 5)
]
# Filter out zero (invalid) depth values
valid_depths = region[region > 0]
if len(valid_depths) > 0:
median_depth = np.median(valid_depths)
else:
median_depth = depth_mm
print(f"{d.class_name}: {d.confidence:.2f}, "
f"distance: {median_depth:.0f} mm ({median_depth/1000:.2f} m)")
import (
"fmt"
"sort"
"go.viam.com/rdk/components/camera"
"go.viam.com/rdk/services/vision"
)
cam, err := camera.FromProvider(machine, "my-depth-camera")
if err != nil {
logger.Fatal(err)
}
detector, err := vision.FromProvider(machine, "my-detector")
if err != nil {
logger.Fatal(err)
}
detections, err := detector.DetectionsFromCamera(ctx, "my-depth-camera", nil)
if err != nil {
logger.Fatal(err)
}
images, _, err := cam.Images(ctx, nil, nil)
if err != nil {
logger.Fatal(err)
}
// Access the depth image (check SourceName to identify the depth stream)
depthImg, err := images[1].Image(ctx)
if err != nil {
logger.Fatal(err)
}
bounds := depthImg.Bounds()
for _, d := range detections {
if d.Score() < 0.5 {
continue
}
bb := d.BoundingBox()
centerX := (bb.Min.X + bb.Max.X) / 2
centerY := (bb.Min.Y + bb.Max.Y) / 2
// Clamp to image bounds
if centerX < bounds.Min.X {
centerX = bounds.Min.X
}
if centerX >= bounds.Max.X {
centerX = bounds.Max.X - 1
}
if centerY < bounds.Min.Y {
centerY = bounds.Min.Y
}
if centerY >= bounds.Max.Y {
centerY = bounds.Max.Y - 1
}
// Sample center pixel depth
r, _, _, _ := depthImg.At(centerX, centerY).RGBA()
depthMM := int(r)
fmt.Printf("%s: %.2f, distance: %d mm (%.2f m)\n",
d.Label(), d.Score(), depthMM, float64(depthMM)/1000.0)
}
If you do not have a depth camera, configure a fake camera that generates simulated point cloud data. This lets you develop and test your depth-related code without hardware.
{
"name": "my-depth-camera",
"api": "rdk:component:camera",
"model": "fake",
"attributes": {}
}
The fake camera generates both color images and simulated point cloud data. The depth values are synthetic but structurally correct, so your code will work the same way with real hardware.
If your camera does not automatically provide intrinsic parameters, you can set them manually in the configuration. These parameters are needed for accurate 2D-to-3D projection.
{
"name": "my-depth-camera",
"api": "rdk:component:camera",
"model": "webcam",
"attributes": {
"intrinsic_parameters": {
"fx": 615.0,
"fy": 615.0,
"ppx": 320.0,
"ppy": 240.0,
"width_px": 640,
"height_px": 480
},
"distortion_parameters": {
"rk1": 0.0,
"rk2": 0.0,
"rk3": 0.0,
"tp1": 0.0,
"tp2": 0.0
}
}
}
| Parameter | Description |
|---|---|
fx, fy | Focal length in pixels (horizontal and vertical) |
ppx, ppy | Principal point (optical center) in pixels |
width_px, height_px | Image dimensions |
rk1, rk2, rk3 | Radial distortion coefficients |
tp1, tp2 | Tangential distortion coefficients |
Most depth cameras (Intel RealSense, Oak-D) provide these automatically. You only need to set them manually for cameras without built-in calibration data.
If you need an image, its detections, and a point cloud together in one call, use CaptureAllFromCamera. This is more efficient than separate calls and ensures all results correspond to the same frame. See Detect Objects, step 7 for a full example.
Was this page helpful?
Glad to hear it! If you have any other feedback please let us know:
We're sorry about that. To help us improve, please tell us what we can do better:
Thank you!