Introducing Real-Time Video Chat for Agents

PikaStream1.0 is our latest advancement in creating more human AI

April 02, 2026 • By Pika Fundamental Research Team

PikaStream1.0 is our latest advancement in creating more human AI

April 2, 2026. By the Pika Team

Today, we’re making it possible to have a face-to-face, real-time conversation with any AI agent. Because we believe that most people would prefer to interact with AI as they do with other humans.

You can try it with your Pika AI Self by simply inviting them to a Google Meet. For other agents, you can download the Skill on GitHub here.

Why real-time video?

At Pika, we're building AI Selves: persistent, agentic beings that act as living extensions of you. They can assist you with tasks, join your group chats, and carry on conversations with your personality and taste. But for an AI Self to truly feel alive, they need more than language. They need a face, a voice, and the ability to react in real time.

Today's video generation models produce stunning results, but they operate offline: a single clip can take seconds to minutes. This makes them fundamentally incompatible with live interaction like video calls, FaceTime, or real-time avatars, where generation must be continuous, identity-consistent, and instantly responsive to new speech, emotions, and contexts.

The PikaStream Solution

PikaStream1.0 is the real-time visual engine built to close this gap. It generates personalized video at 24 FPS and 480p on a single H100 GPU, with an end-to-end speech-to-video latency of ~1.5 seconds. For a emotionally resonant and useful experience, we also ensured the model would maintain each agent’s memory and context, and the ability to enact tasks during interaction. Plus, the not-so-simple addition of natural gestures and appropriate emotions. Expressiveness & Natural Gestures

Consistent Personality

Persistent Memory

Agentic Abilities

Real-Time Adaptability

Three components make all of this possible:

FlashVAE: a full Transformer-based VAE trained from scratch that provides its own latent space and reconstructs video in real time via streaming decoding at 441 FPS with just 1.1 GB memory on a single GPU.

9B Diffusion Transformer (DiT): a bidirectional teacher distilled into a causal autoregressive student via optimized self-forcing, enabling chunk-by-chunk streaming at real-time frame rates.

Multi-reward RLHF: direct optimization for identity consistency, lip-sync accuracy, and motion naturalness across both teacher and student.

System Architecture

Figure 1. Overview of the PikaStream1.0 architecture. FlashVAE provides the latent space and handles real-time pixel-level reconstruction via streaming decoding. The 9B DiT generates audio-conditioned video in this latent space, and reference injection maintains identity consistency.

PikaStream1.0 consists of three model components and an efficient inference pipeline:

FlashVAE for latent space encoding and real-time streaming decoding

9B Diffusion Transformer for audio-conditioned video generation

Reference injection for identity consistency

Inference engineering that fuses decoding, audio conditioning, and scheduling into a single-GPU pipeline delivering 24 FPS

FlashVAE

FlashVAE builds on our concurrent work, FlashDecoder, which showed that a Transformer-based streaming decoder can match conventional 3D convolutional decoders in reconstruction quality (PSNR & LPIPS) while running over 10x faster. For the full architectural details and optimization techniques, we refer readers to the FlashDecoder paper.

Video generation models operate in a compressed latent space: an encoder compresses raw video into latents, the diffusion model generates in that space, and a decoder reconstructs the final pixels. Most existing decoders are 3D convolutional networks that reconstruct well but are too slow and memory-intensive for real-time streaming. FlashVAE replaces this with a full Transformer-based VAE trained from scratch, giving us our own latent space independent of any external model. Also, instead of decoding all frames at once, it processes one frame at a time using only the last few frames as context. As soon as the DiT produces a chunk of latents, FlashVAE decodes them into pixels immediately. Because the context window is short and fixed, per-frame cost is constant regardless of video length. With efficient kernels and compilation, FlashVAE achieves 441 FPS decoding throughput at 480p with just 1.1 GB peak memory on a single H100.

9B Diffusion Transformer (DiT)

The backbone of PikaStream1.0 is a 9-billion parameter Diffusion Transformer (DiT) that generates video from text and audio. We first train a bidirectional DiT that attends to all frames at once for maximum quality. This bidirectional model is later distilled into a causal autoregressive student for real-time streaming (described in the Autoregressive Distillation section). Here we describe the architecture shared by both. The model uses full spatio-temporal self-attention across frames, with two separate cross-attention mechanisms for multimodal conditioning:

Text cross-attention. Each chunk attends to text prompts for semantic guidance. Because conditioning is per-chunk, prompts can be swapped on the fly, enabling interactive control over motion, expressions, and actions during live generation.

Frame-wise audio cross-attention. Each video frame attends only to its temporally aligned audio tokens, not the full audio sequence. This tight frame-level alignment enables accurate lip-sync.

Reference Injection

To keep the generated face consistent with the target identity, we feed a reference image of the person into the model alongside the video frames. The model learns to distinguish the reference from the generated sequence using positional encodings (RoPE), so the identity signal persists throughout generation. This mechanism is shared by both teacher and student without modification.

Autoregressive Distillation

The bidirectional teacher produces the best quality by looking at all frames at once. This produces high quality, but it cannot stream because it needs the full sequence before producing any output. For real-time use, we need a model that generates frames left-to-right, one chunk at a time. Training from scratch in this mode produces lower quality, so instead we start from the strong bidirectional model and teach it to generate causally through two stages: (1) adapting its attention pattern so it only looks backward, and (2) distilling it into a fast autoregressive generator via self-forcing.

Adapting Bidirectional to Causal Transformer

We first teach the bidirectional model to generate left-to-right. We compared two approaches, diffusion forcing and ODE regression, and found ODE regression to be more stable, which is critical for smooth, temporally coherent streaming.

We partition each training sequence into chunks and restrict each chunk to only attend to past chunks, not future ones. Following MotionStream, chunk sizes are randomized to prevent overfitting to a fixed pattern. We pre-compute the ODE trajectory from the teacher model and train the student to denoise from those intermediate points, preparing it for few-step distillation. This adaptation requires only ~100k video samples.

Optimized Self-Forcing

Next, we distill the adapted model into a fast autoregressive generator using an optimized self-forcing. The student generates a chunk, then uses its own output (not ground truth) as context for the next chunk, learning to recover from its own imperfections. This is essential for stable long-form streaming. A distribution matching distillation (DMD) loss aligns the student's outputs with the teacher.

Three design choices make this practical at scale:

Constant cost per chunk. A fixed-size context window keeps memory and compute the same regardless of video length, enabling infinite-length generation.

Memory-efficient training. Only one denoising step per chunk carries gradients, keeping GPU memory manageable.

No extra forward pass. Standard self-forcing runs a separate pass on clean frames to build context for the next chunk. We instead cache the KV states directly from the last denoising step. This saves one full forward pass per chunk and avoids a distribution mismatch, since the model was never trained on fully clean inputs.

Multi-Reward RLHF

Figure 2. Multi-reward RLHF training pipeline. Generated frames are evaluated by multiple reward models, and the scores are back-propagated through a differentiable denoising window to update the diffusion model.

Diffusion models are trained to denoise, learning to predict clean outputs from noisy inputs. This objective produces coherent video, but it does not directly optimize for the qualities that matter most in a visual engine: does the face match the reference? Are the lips in sync with the audio? Is the motion natural? Are there visual artifacts? We address this with multi-reward RLHF.

We train a suite of proprietary reward models that score generated videos on these dimensions. Using Reward Feedback Learning (ReFL), we back-propagate these scores through a differentiable denoising window to update the diffusion model, steering it toward outputs that score higher on all reward axes simultaneously.

For both teacher and student, we denoise, decode to pixels, and compute rewards on the generated frames. For the student, this is done via autoregressive rollout, chunk by chunk, so the model receives reward feedback in its native streaming mode.

Training

Figure 3. Data pipeline overview. Raw videos are processed through content normalization, metadata enrichment, and structured filtering to produce pre-training (~10M clips) and SFT (~2M clips) corpora.

We built a multi-stage data pipeline that transforms raw video into two training corpora:

Pre-training corpus O(10M) clips. Raw videos go through scene detection, deduplication, quality filtering, and artifact removal (letterboxing, watermarks, subtitles). Each clip is then enriched with audio-visual sync scores, aesthetic ratings, camera motion labels, human action annotations, and dense captions generated by VLMs. The result is a diverse, semantically rich foundation for large-scale training.

SFT corpus O(1M) clips. From the pre-training pool, we select the highest-quality clips using stricter aesthetic and sync thresholds, plus human verification. This curated subset drives the quality improvements in fine-tuning.

Training is split into two independent tracks. FlashVAE is trained first in three progressive stages (low-resolution pre-training, high-resolution streaming, adversarial fine-tuning) to establish the latent space. The 9B DiT is then trained within that space in five stages: pre-training, supervised fine-tuning, RLHF on the bidirectional teacher, autoregressive distillation into the causal student, and a final round of RLHF on the student.

Inference

With the model trained, the final challenge is serving it at conversation speed, making a real-time visual engine is more than a fast model. The user speaks, and within ~1.5 seconds, an AI Self responds with synchronized video. Making this work requires careful orchestration of every stage in the pipeline to minimize end-to-end latency.

End-to-End Latency

Figure 4. End-to-end inference latency breakdown. All stages run concurrently in a streaming pipeline, achieving ~1.5 seconds from speech input to video output.

Our integrated speech-to-video pipeline delivers an end-to-end response in approximately 1.5 seconds. The pipeline streams every stage in parallel: speech recognition, LLM reasoning, and text-to-speech all run concurrently, feeding audio chunks directly into the video generator as they become available. Instead of waiting for a full response, the system begins generating video the moment the first audio chunk is ready. Each chunk triggers diffusion-based synthesis on the GPU, and the resulting frames are packaged into media segments for immediate playback. The user sees a video response begin almost as soon as they stop speaking.

Reference-Based Video Editing

Figure 5. Reference-based video editing via asynchronous KV cache injection. The new reference is encoded in the background and swapped into the active cache without interrupting generation.

When the user changes a reference image or prompt mid-stream, the system doesn't stop generating. It processes the new reference in the background, encoding and transforming the updated identity, while the main generation loop continues outputting frames without interruption.

Once the new reference is ready, the system swaps the relevant tokens directly in the active KV cache. There is no need to restart the sequence. The generator immediately begins producing frames that reflect the updated identity, while maintaining smooth temporal continuity with what came before.

From Pikaformance to PikaStream1.0

The sections above describe the architecture, training, and inference design of PikaStream1.0. Two natural questions follow: How much did it improve over our previous generation, and how does a real-time visual engine compare to the ways AI assistants interact today?

Pikaformance, our first-generation ultra-fast video generation model, required 8 GPUs and 4.5 seconds of latency per response. Fast for video generation, but too slow for real-time conversation. PikaStream1.0 runs on a single GPU with ~1.5 seconds end-to-end latency and streams video to the user at 24 FPS in deployment. At 4.5 seconds, every exchange feels like leaving a voicemail. At 1.5 seconds with continuous streaming, it feels like FaceTime.

Conclusion

With PikaStream1.0, we gave AI Selves a face, 24 FPS video, and 1.5 seconds of latency, on one GPU. For the first time, a real-time visual engine can power live, identity-consistent video at conversation speed.

Real-time video generation is moving fast, but building an agentic visual engine that speaks, expresses, and persists as a digital self is a new frontier. We're excited to share this first step, and there's a lot more coming.

References

FlashDecoder - Kang and Kwak, "FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers", CVPR 2026. Paper

Wan - Wan Team et al., "Wan: Open and Advanced Large-Scale Video Generative Models", 2025. arXiv

DiT - Peebles and Xie, "Scalable Diffusion Models with Transformers", ICCV 2023. arXiv

FlashAttention - Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", NeurIPS 2022. Paper

RoPE - Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", Neurocomputing 2024. arXiv

RLHF - Ouyang et al., "Training language models to follow instructions with human feedback", NeurIPS 2022. arXiv

Diffusion Forcing - Chen et al., "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion", NeurIPS 2024. arXiv

Self-Forcing - Huang et al., "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion", NeurIPS 2025 Spotlight. arXiv

DMD2 - Yin et al., "Improved Distribution Matching Distillation for Fast Image Synthesis", NeurIPS 2024. arXiv

ReFL - Xu et al., "ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation", NeurIPS 2023. arXiv

MotionStream - Shin et al., "MotionStream: Real-Time Video Generation with Interactive Motion Controls", 2025. arXiv

Muon - Jordan et al., "Muon: An optimizer for hidden layers in neural networks", 2024. GitHub

Back to Blog

Introducing Real-Time Video Chat for Agents

PikaStream1.0 is our latest advancement in creating more human AI

April 02, 2026 • By Pika Fundamental Research Team

PikaStream1.0 is our latest advancement in creating more human AI

April 2, 2026. By the Pika Team

Today, we’re making it possible to have a face-to-face, real-time conversation with any AI agent. Because we believe that most people would prefer to interact with AI as they do with other humans.

You can try it with your Pika AI Self by simply inviting them to a Google Meet. For other agents, you can download the Skill on GitHub here.

Why real-time video?

The PikaStream Solution

Consistent Personality

Persistent Memory

Agentic Abilities

Real-Time Adaptability

Three components make all of this possible:

FlashVAE: a full Transformer-based VAE trained from scratch that provides its own latent space and reconstructs video in real time via streaming decoding at 441 FPS with just 1.1 GB memory on a single GPU.

9B Diffusion Transformer (DiT): a bidirectional teacher distilled into a causal autoregressive student via optimized self-forcing, enabling chunk-by-chunk streaming at real-time frame rates.

Multi-reward RLHF: direct optimization for identity consistency, lip-sync accuracy, and motion naturalness across both teacher and student.

System Architecture

PikaStream1.0 consists of three model components and an efficient inference pipeline:

FlashVAE for latent space encoding and real-time streaming decoding

9B Diffusion Transformer for audio-conditioned video generation

Reference injection for identity consistency

Inference engineering that fuses decoding, audio conditioning, and scheduling into a single-GPU pipeline delivering 24 FPS

FlashVAE

9B Diffusion Transformer (DiT)

Frame-wise audio cross-attention. Each video frame attends only to its temporally aligned audio tokens, not the full audio sequence. This tight frame-level alignment enables accurate lip-sync.

Reference Injection

Autoregressive Distillation

Adapting Bidirectional to Causal Transformer

Optimized Self-Forcing

Three design choices make this practical at scale:

Constant cost per chunk. A fixed-size context window keeps memory and compute the same regardless of video length, enabling infinite-length generation.

Memory-efficient training. Only one denoising step per chunk carries gradients, keeping GPU memory manageable.

No extra forward pass. Standard self-forcing runs a separate pass on clean frames to build context for the next chunk. We instead cache the KV states directly from the last denoising step. This saves one full forward pass per chunk and avoids a distribution mismatch, since the model was never trained on fully clean inputs.

Multi-Reward RLHF

Training

We built a multi-stage data pipeline that transforms raw video into two training corpora:

Inference

End-to-End Latency

Figure 4. End-to-end inference latency breakdown. All stages run concurrently in a streaming pipeline, achieving ~1.5 seconds from speech input to video output.

Reference-Based Video Editing

Figure 5. Reference-based video editing via asynchronous KV cache injection. The new reference is encoded in the background and swapped into the active cache without interrupting generation.

From Pikaformance to PikaStream1.0

Conclusion

References

FlashDecoder - Kang and Kwak, "FlashDecoder: Real-Time Latent-to-Pixel Streaming Decoder with Transformers", CVPR 2026. Paper

Wan - Wan Team et al., "Wan: Open and Advanced Large-Scale Video Generative Models", 2025. arXiv

DiT - Peebles and Xie, "Scalable Diffusion Models with Transformers", ICCV 2023. arXiv

FlashAttention - Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", NeurIPS 2022. Paper

RoPE - Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", Neurocomputing 2024. arXiv

RLHF - Ouyang et al., "Training language models to follow instructions with human feedback", NeurIPS 2022. arXiv

Diffusion Forcing - Chen et al., "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion", NeurIPS 2024. arXiv

Self-Forcing - Huang et al., "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion", NeurIPS 2025 Spotlight. arXiv

DMD2 - Yin et al., "Improved Distribution Matching Distillation for Fast Image Synthesis", NeurIPS 2024. arXiv

ReFL - Xu et al., "ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation", NeurIPS 2023. arXiv

MotionStream - Shin et al., "MotionStream: Real-Time Video Generation with Interactive Motion Controls", 2025. arXiv

Muon - Jordan et al., "Muon: An optimizer for hidden layers in neural networks", 2024. GitHub