CS285 DRL Notes-Lecture 2 Supervised Learning of Behaviors

This is the note of Berkeley CS285 taught by Sergey Levine. This lecture focuses on imitation learning with behavioral cloning (BC), why it can fail, and how DAgger addresses the key issue. Original slides: CS285 Lecture 2 PDF.

From supervised learning to imitation learning

Lecture 2 starts from a familiar idea: treat policy learning as supervised learning.

  • Collect expert demonstrations:
    $$
    \tau = (o_1, a_1, o_2, a_2, \ldots, o_T, a_T)
    $$
  • Train a policy with maximum likelihood:
    $$
    \max_\theta \sum_{(o,a)\sim \mathcal{D}} \log \pi_\theta(a\mid o)
    $$
  • This is exactly behavioral cloning: map observation to action like a standard prediction problem.

Partially observed vs fully observed

The slide figure distinguishes two common setups:

  • Partially observed case: the policy acts on observations
    $$
    \pi_\theta(a_t\mid o_t)
    $$
    while the environment can be described with
    $$
    p(o_t\mid s_t),\quad p(s_{t+1}\mid s_t,a_t)
    $$
    where the true state $s_t$ is not directly available.

  • Fully observed case: observation equals state, i.e., $o_t=s_t$, so the policy is
    $$
    \pi_\theta(a_t\mid s_t)
    $$
    with dynamics
    $$
    p(s_{t+1}\mid s_t,a_t)
    $$

In practical BC training, we usually only need supervised pairs $(o_t,a_t)$ (or $(s_t,a_t)$), without explicitly modeling $p(o_t\mid s_t)$ or $p(s_{t+1}\mid s_t,a_t)$.

For action outputs:

  • Discrete actions: output logits / class probabilities.
  • Continuous actions: output distribution parameters (e.g., Gaussian mean and variance).

Behavioral cloning algorithm

Core pipeline:

  1. Ask an expert (human or oracle policy) to provide demonstration trajectories.
  2. Build dataset $\mathcal{D} = {(o_i, a_i)}$.
  3. Train $\pi_\theta(a\mid o)$ with supervised learning.
  4. Deploy policy autoregressively in environment.

This is simple and often effective in practice, and historically important (e.g., ALVINN for autonomous driving).

Specifically, this pipeline can be written more explicitly:

  • Dataset shape ($N$ trajectories, horizon $H$ each):
    $$
    \mathcal{D}=\left{\left(o^{(i)}_1,a^{(i)}_1,\ldots,o^{(i)}_H,a^{(i)}H\right)\right}{i=1}^N
    $$
  • Training objective (supervised learning = maximum likelihood):
    $$
    \theta^\star=\arg\max_\theta \sum_{i=1}^N\sum_{t=1}^H \log \pi_\theta!\left(a_t^{(i)}\mid o_t^{(i)}\right)
    $$

Implementation-wise, the policy first maps $o_t$ to action-distribution parameters (logits for discrete actions, distribution parameters for continuous actions), then produces $a_t$ (log-likelihood supervision during training; sampling or mean/mode at inference).

Does behavioral cloning work?

Short answer from the lecture: sometimes yes, but not guaranteed.

Why can it fail even when supervised training loss is low?

  • Supervised learning relies on an i.i.d. assumption.
  • In control, the model’s own actions affect future states.
  • A small mistake can move the agent to unseen states.
  • On unseen states, policy error increases, causing more drift.

This is distributional shift (train-state distribution vs test-state distribution mismatch).

Why errors can compound with horizon

A key theoretical takeaway: BC error can scale poorly with trajectory length $T$.

  • Let per-step mistake probability under training distribution be roughly $\epsilon$.
  • Due to state distribution drift, total mistakes can grow much faster than $\epsilon T$.
  • Worst-case analysis shows quadratic dependence on horizon (order $\mathcal{O}(T^2\epsilon)$).

Intuition:

  • Early mistake changes later inputs.
  • Later predictions are made further off-distribution.
  • This creates cascading failures in long-horizon control.

Pessimism vs reality

The lecture also points out the nuance:

  • Worst-case analysis is pessimistic.
  • Real systems sometimes recover from mistakes.
  • But BC alone does not explicitly teach recovery unless recovery states appear in data.

A practical paradox:

  • Datasets containing some imperfect behavior + recoveries can sometimes make learned policies more robust.

Fixing distributional shift: DAgger

DAgger (Dataset Aggregation) addresses BC’s core issue by collecting data on states induced by the learned policy.

High-level loop:

  1. Train initial policy on expert demonstrations.
  2. Roll out current policy to visit its own state distribution.
  3. Query expert action labels on those visited states.
  4. Aggregate new labeled data into dataset.
  5. Retrain policy and repeat.

Why this helps:

  • Training distribution becomes closer to deployment distribution.
  • Policy learns what to do in states caused by its own mistakes.
  • Theoretical guarantees are stronger than plain BC in sequential settings.

Practical notes from this lecture

Besides DAgger, the lecture lists additional directions:

  • Use stronger models to reduce per-step error.
  • Improve data collection/augmentation.
  • Use multi-task training for broader state coverage.

In real projects, BC is often the starting point, and DAgger-style relabeling/online aggregation is a key upgrade for robustness.

My takeaway

  • BC is the most direct bridge from supervised learning to control.
  • The i.i.d. assumption break is the central problem in imitation learning.
  • Distributional shift is not a corner case; it is structural in closed-loop control.
  • DAgger is important because it fixes the data distribution mismatch, not because it changes network architecture.

References


CS285 DRL Notes-Lecture 2 Supervised Learning of Behaviors
https://jackyfl.github.io/JackYFL-blogs/2026/03/10/DRL-Berkeley-CS285-L2/
Author
JackYFL
Posted on
March 10, 2026
Licensed under