CS285 DRL Notes-Lecture 3 Imitation Learning Models and Data

This is the reading note for Lecture 3 of Berkeley CS285, taught by Sergey Levine. Lecture 3 focuses on three questions: how to design models, how to organize data, and how to improve generalization through multi-task and goal-conditioned learning. Original slides: CS285 Lecture 3 PDF.

Recap: the core BC bottleneck

Following Lecture 2, the main issue of Behavioral Cloning (BC) is still distributional shift.

  • During training, we learn from the expert distribution.
  • During deployment, the policy visits its own induced state distribution.
  • Errors can accumulate in closed-loop control.

The key point of Lecture 3 is: under this setting, how to make BC stronger and more robust.

Part 1: Models for imitation learning

1) Why might we fail to fit expert behavior?

The lecture gives two common reasons:

  • Non-Markovian behavior: expert actions depend on history, not just current observation.
  • Multimodal behavior: multiple reasonable action modes may exist for the same observation.

Corresponding improvement directions:

  • Use sequence models (e.g., Transformer) to encode history.
  • Use more expressive action distributions to model multimodal outputs.

2) History modeling and causal confusion

Adding history is not always strictly better. The lecture mentions causal confusion (de Haan et al.).

  • The model may rely on features correlated with success but not truly causal.
  • History can mitigate this issue, but can also amplify it.
  • Whether DAgger can mitigate causal confusion is a separate question worth analyzing.

3) Multimodal action modeling

For high-dimensional continuous actions, simple discretization is often impractical. The lecture gives two routes:

  • Autoregressive discretization: discretize dimensions progressively and predict autoregressively.
  • Expressive continuous distributions: directly model complex continuous distributions (VAE / flow / diffusion).

A key point for diffusion/flow matching is: the policy must genuinely use noise variables, so different noise samples correspond to different action modes.

Using the intuition from the figure, both diffusion and flow matching learn a generative mapping from a noise distribution to a data distribution:

  • diffusion is often written as a discrete denoising chain: from pure noise $x_T\sim \mathcal{N}(0,I)$ back to a clean sample $x_0$;
  • flow matching directly learns a continuous-time vector field $v(x_t,t)$ that continuously deforms the noise manifold into the data manifold. Sampling can be written as
    $$
    x_1 = x_0 + \int_0^1 v(x_t,t),dt,\quad x_0\sim p_0
    $$

In imitation learning, when $x$ corresponds to actions (or action chunks), these methods naturally represent one-to-many multimodal action distributions, instead of collapsing all reasonable actions to a single mean.

A common flow matching algorithm can be written in two parts:

  • Sampling (inference)

    1. Sample initial noise $x_0\sim \mathcal{N}(0,I)$.
    2. Integrate forward using Euler steps:
      $$
      x_{t+\Delta t}\leftarrow x_t+v(x_t,t)\Delta t,\quad t\in{0,\Delta t,\ldots,1-\Delta t}
      $$
    3. Return $x_1$ as the generated sample (here it can represent an action or action chunk).
  • Training (velocity matching)

    1. Sample $x_0\sim \mathcal{N}(0,I)$.
    2. Sample $x_1\sim \mathcal{D}$ from the data distribution.
    3. Sample time $t\sim p(t)$ (commonly $p(t)=\mathcal{U}(0,1)$).
    4. Construct interpolation point $x_t=t x_1+(1-t)x_0$.
    5. Supervise the vector field with target velocity $x_1-x_0$:
      $$
      \min_v \left|v(x_t,t)-(x_1-x_0)\right|_2^2
      $$

4) Action chunking

The lecture highlights a very practical trick: action chunking.

  • Predict a sequence of actions at once (instead of one step at a time).
  • This significantly improves stability in long-horizon control.
  • It works especially well when combined with diffusion policy.

Using the notation in the figure, the difference between standard and chunked policies can be written explicitly:

  • Standard step-wise policy
    first sample
    $$
    a_t \sim \pi_\theta(a_t\mid o_t)
    $$
    execute $a_t$, observe $o_{t+1}$, and repeat.

  • Action-chunked policy
    sample a full action segment
    $$
    a_{t:t+K}\sim \pi_\theta(a_{t:t+K}\mid o_t)
    $$
    then execute $a_t,a_{t+1},\ldots,a_{t+K}$ continuously, observe $o_{t+K+1}$ at a later time, and repeat.

Equivalent view: chunking changes the policy from “decision every step” to “decision every $K!+!1$ steps,” reducing decision frequency and improving short-horizon action consistency.

5) Case studies: diffusion policy in practice

  • Case 1: imitation with diffusion models (Chi et al., 2023)
    A common setup conditions on visual observation sequences and directly generates action sequences. In robot manipulation tasks, this often improves multimodal action modeling and execution stability.

  • Case 2: flow matching + action chunking + pre-training (Pi 0, Kevin et al., 2024)
    proposes a vision-language-action foundation model for robotics that combines a pre-trained VLM backbone with an action expert trained via flow matching (diffusion variant) and action chunking to generate dexterous continuous control across different robot embodiments. It is trained with a pre-training + post-training recipe on diverse cross-embodiment robot data (reported as 7 robot configurations and 68 tasks), then prompted or fine-tuned for harder multi-stage tasks like laundry folding and box assembly, showing improved generalization/robustness for real-world manipulation.

Part 2: Narrow vs broad data

This part discusses the tension between data quality and coverage:

  • Good but narrow data: better actions, but limited state coverage.
  • Bad but broad data: wider coverage, but suboptimal actions.

The key practice from the lecture is:

  • pre-train on broad data to gain situation understanding and robustness;
  • post-train/SFT on high-quality task data to learn consistent and reliable strategies.

This aligns with the modern foundation-model paradigm:

  • pre-training gives the model broad knowledge;
  • post-training teaches the model how to apply that knowledge toward user goals.

Part 3: Multi-task learning to the rescue

1) Goal-conditioned behavioral cloning

Lecture 3 introduces goal-conditioned policies:

$$
\pi_\theta(a_t\mid s_t, g)
$$

The objective becomes “move toward the goal from current state,” rather than copying a fixed single-task trajectory.

Benefits:

  • higher data reuse efficiency;
  • shared representations across tasks, often with better generalization;
  • a unified learning interface centered on reaching goals.

2) Beyond imitation: self-imitation / relabeling idea

The lecture also gives a direction beyond pure imitation:

  1. collect data with a random policy and random goals;
  2. relabel actually reached states as reachable goals;
  3. treat relabeled trajectories as demonstrations for continued training;
  4. iterate to improve the policy.

This is conceptually connected to later RL ideas such as offline relabeling and goal relabeling.

My takeaway

  • The main thread of L3 is not a single new algorithm, but a systematic upgrade path for BC.
  • On the model side, we must handle history dependence and action multimodality (Transformer + diffusion/flow/chunking).
  • On the data side, we should combine broad pre-training with high-quality post-training.
  • On the task side, we should move from single-task BC to goal-conditioned multi-task learning.

In one sentence: BC is only the starting point; practical imitation systems require coordinated upgrades in model, data, and task formulation.

References


CS285 DRL Notes-Lecture 3 Imitation Learning Models and Data
https://jackyfl.github.io/JackYFL-blogs/2026/03/10/DRL-Berkeley-CS285-L3/
Author
JackYFL
Posted on
March 10, 2026
Licensed under