Back to Blog
Introducing Real-Time Video Chat for Agents
PikaStream1.0 is our latest advancement in creating more human AI
April 02, 2026 • By Pika Fundamental Research Team
PikaStream1.0 is our latest advancement in creating more human AI
April 2, 2026. By the Pika Fundamental Research Team
Today, we’re making it possible to have a face-to-face, real-time conversation with any AI agent. Because we believe that most people would prefer to to interact with AI as they do with other humans.
You can try it with your Pika AI Self by simply inviting them to a Google Meet. For other agents, you can download the Skill on Github here.
Why real-time video?
At Pika, we're building AI Selves: persistent, agentic beings that act as living extensions of you. They can assist you with tasks, join your group chats, and carry on conversations with your personality and taste. But for an AI Self to truly feel alive, they need more than language. They need a face, a voice, and the ability to react in real time.
Today's video generation models produce stunning results, but they operate offline: a single clip can take seconds to minutes. This makes them fundamentally incompatible with live interaction like video calls, FaceTime, or real-time avatars, where generation must be continuous, identity-consistent, and instantly responsive to new speech, emotions, and contexts.
The PikaStream Solution
PikaStream1.0 is the real-time visual engine built to close this gap. It generates personalized video at 30 FPS and 480p on a single H100 GPU, with an end-to-end speech-to-video latency of ~1.5 seconds.
For a emotionally resonant and useful experience, we also ensured the model would maintain each agent’s memory and context, and the ability to enact tasks during interaction. Plus, the not-so-simple addition of natural gestures and appropriate emotions.
Expressiveness
demo video coming soon!
Agentic Abilities
demo video coming soon!
Persistent Memory
demo video coming soon!
Three components make this possible:
- FlashVAE: a full Transformer-based VAE trained from scratch that provides its own latent space and reconstructs video in real time via streaming decoding at 441 FPS with just 1.1 GB memory on a single GPU.
- 9B Diffusion Transformer (DiT): a bidirectional teacher distilled into a causal autoregressive student via optimized self-forcing, enabling chunk-by-chunk streaming at real-time frame rates.
- Multi-reward RLHF: direct optimization for identity consistency, lip-sync accuracy, and motion naturalness across both teacher and student.
System Architecture

Figure 1. Overview of the PikaStream1.0 architecture. FlashVAE provides the latent space and handles real-time pixel-level reconstruction via streaming decoding. The 9B DiT generates audio-conditioned video in this latent space, and reference injection maintains identity consistency.
PikaStream1.0 consists of three model components and an efficient inference pipeline:
- FlashVAE for latent space encoding and real-time streaming decoding
- 9B Diffusion Transformer for audio-conditioned video generation
- Reference injection for identity consistency
- Inference engineering that fuses decoding, audio conditioning, and scheduling into a single-GPU pipeline delivering 24 FPS
FlashVAE
FlashVAE builds on our concurrent work, FlashDecoder, which showed that a Transformer-based streaming decoder can match conventional 3D convolutional decoders in reconstruction quality (PSNR & LPIPS) while running over 10x faster. For the full architectural details and optimization techniques, we refer readers to the FlashDecoder paper.
Video generation models operate in a compressed latent space: an encoder compresses raw video into latents, the diffusion model generates in that space, and a decoder reconstructs the final pixels. In most existing systems, this decoder is a 3D convolutional network. These decoders reconstruct well, but they are slow and memory-intensive, too slow for real-time streaming where frames must be decoded as fast as they are generated.
FlashVAE extends FlashDecoder into a full VAE trained from scratch, with both encoder and decoder built entirely on Transformer architectures. This gives us our own latent space, independent of any external model. The encoder defines the latent space, but the decoder is where real-time performance matters most: instead of decoding all frames at once, it processes one frame at a time, using only the last few frames as context. As soon as the DiT produces a chunk of latents, FlashVAE decodes them into pixels immediately, frame by frame. Because the context window is short and fixed, the per-frame cost is small and constant regardless of video length.
With efficient kernels and compilation, FlashVAE achieves 441 FPS decoding throughput at 480p with just 1.1 GB peak memory on a single H100.
9B Diffusion Transformer (DiT)
The backbone of PikaStream1.0 is a 9-billion parameter Diffusion Transformer (DiT) that generates video from text and audio. We first train a bidirectional DiT that attends to all frames at once for maximum quality. This bidirectional model is later distilled into a causal autoregressive student for real-time streaming (described in the Autoregressive Distillation section). Here we describe the architecture shared by both. The model uses full spatio-temporal self-attention across frames, with two separate cross-attention mechanisms for multimodal conditioning:
Text cross-attention. Each chunk attends to text prompts for semantic guidance. Because conditioning is per-chunk, prompts can be swapped on the fly, enabling interactive control over motion, expressions, and actions during live generation.
Frame-wise audio cross-attention. Each video frame attends only to its temporally aligned audio tokens, not the full audio sequence. This tight frame-level alignment enables accurate lip-sync.
Reference Injection
To keep the generated face consistent with the target identity, we feed a reference image of the person into the model alongside the video frames. The model learns to distinguish the reference from the generated sequence using positional encodings (RoPE), so the identity signal persists throughout generation. This mechanism is shared by both teacher and student without modification.
Autoregressive Distillation
The bidirectional teacher produces the best quality by looking at all frames at once. This produces high quality, but it cannot stream because it needs the full sequence before producing any output. For real-time use, we need a model that generates frames left-to-right, one chunk at a time. Training from scratch in this mode produces lower quality, so instead we start from the strong bidirectional model and teach it to generate causally through two stages: (1) adapting its attention pattern so it only looks backward, and (2) distilling it into a fast autoregressive generator via self-forcing.
Adapting Bidirectional to Causal Transformer
We first teach the bidirectional model to generate left-to-right. We compared two approaches, diffusion forcing and ODE regression, and found ODE regression to be more stable, which is critical for smooth, temporally coherent streaming.
We partition each training sequence into chunks and restrict each chunk to only attend to past chunks, not future ones. Following MotionStream, chunk sizes are randomized to prevent overfitting to a fixed pattern. We also pre-compute the denoising trajectory that the student will follow during inference and train the model to denoise from exactly those intermediate points. This adaptation requires only ~100k video samples.
Optimized Self-Forcing
Next, we distill the adapted model into a fast autoregressive generator using an optimized self-forcing. The student generates a chunk, then uses its own output (not ground truth) as context for the next chunk, learning to recover from its own imperfections. This is essential for stable long-form streaming. A distribution matching distillation (DMD) loss aligns the student's outputs with the teacher.
Three design choices make this practical at scale:
- Constant cost per chunk. A fixed-size context window keeps memory and compute the same regardless of video length, enabling infinite-length generation.
- Memory-efficient training. Only one denoising step per chunk carries gradients, keeping GPU memory manageable.
- No extra forward pass. Standard self-forcing runs a separate pass on clean frames to build context for the next chunk. We instead cache the KV states directly from the last denoising step. This saves one full forward pass per chunk and avoids a distribution mismatch, since the model was never trained on fully clean inputs.
Multi-Reward RLHF

Figure 2. Multi-reward RLHF training pipeline. Generated frames are evaluated by multiple reward models, and the scores are back-propagated through a differentiable denoising window to update the diffusion model.
Diffusion models are trained to denoise, learning to predict clean outputs from noisy inputs. This objective produces coherent video, but it does not directly optimize for the qualities that matter most in a visual engine: does the face match the reference? Are the lips in sync with the audio? Is the motion natural? Are there visual artifacts? We address this with multi-reward RLHF.
We train a suite of proprietary reward models that score generated videos on these dimensions. Using Reward Feedback Learning (ReFL), we back-propagate these scores through a differentiable denoising window to update the diffusion model, steering it toward outputs that score higher on all reward axes simultaneously.
For both teacher and student, we denoise, decode to pixels, and compute rewards on the generated frames. For the student, this is done via autoregressive rollout, chunk by chunk, so the model receives reward feedback in its native streaming mode.
Training
Figure 3. Data pipeline overview. Raw videos are processed through content normalization, metadata enrichment, and structured filtering to produce pre-training (~10M clips) and SFT (~2M clips) corpora.
We built a multi-stage data pipeline that transforms raw video into two training corpora:
Pre-training corpus (~10M clips). Raw videos go through scene detection, deduplication, quality filtering, and artifact removal (letterboxing, watermarks, subtitles). Each clip is then enriched with audio-visual sync scores, aesthetic ratings, camera motion labels, human action annotations, and dense captions generated by VLMs. The result is a diverse, semantically rich foundation for large-scale training.
SFT corpus (~2M clips). From the pre-training pool, we select the highest-quality clips using stricter aesthetic and sync thresholds, plus human verification. This curated subset drives the quality improvements in fine-tuning.
Training is split into two independent tracks. FlashVAE is trained first in three progressive stages (low-resolution pre-training, high-resolution streaming, adversarial fine-tuning) to establish the latent space. The 9B DiT is then trained within that space in five stages: pre-training, supervised fine-tuning, RLHF on the bidirectional teacher, autoregressive distillation into the causal student, and a final round of RLHF on the student.
Inference
With the model trained, the final challenge is serving it at conversation speed, making a real-time visual engine is more than a fast model. The user speaks, and within ~1.5 seconds, an AI Self responds with synchronized video. Making this work requires careful orchestration of every stage in the pipeline to minimize end-to-end latency.
End-to-End Latency
Figure 4. End-to-end inference latency breakdown. All stages run concurrently in a streaming pipeline, achieving ~1.5 seconds from speech input to video output.
Our integrated speech-to-video pipeline delivers an end-to-end response in approximately 1.5 seconds. The pipeline streams every stage in parallel: speech recognition, LLM reasoning, and text-to-speech all run concurrently, feeding audio chunks directly into the video generator as they become available. Instead of waiting for a full response, the system begins generating video the moment the first audio chunk is ready. Each chunk triggers diffusion-based synthesis on the GPU, and the resulting frames are packaged into media segments for immediate playback. The user sees a video response begin almost as soon as they stop speaking.
Reference-Based Video Editing
Figure 5. Reference-based video editing via asynchronous KV cache injection. The new reference is encoded in the background and swapped into the active cache without interrupting generation.
When the user changes a reference image or prompt mid-stream, the system doesn't stop generating. It processes the new reference in the background, encoding and transforming the updated identity, while the main generation loop continues outputting frames without interruption.
Once the new reference is ready, the system swaps the relevant tokens directly in the active KV cache. There is no need to restart the sequence. The generator immediately begins producing frames that reflect the updated identity, while maintaining smooth temporal continuity with what came before.
Results
The sections above describe the architecture, training, and inference design of PikaStream1.0. Two natural questions follow: How much did it improve over our previous generation, and how does a real-time visual engine compare to the ways AI assistants interact today?
From Nova01 to PikaStream1.0
Pika Nova01, our first-generation ultra-fast video generation model, required 8 GPUs and 4.5 seconds of latency per response. Fast for video generation, but too slow for real-time conversation. PikaStream1.0 runs on a single GPU with ~1.5 seconds end-to-end latency and streams video to the user at 24 FPS in deployment. At 4.5 seconds, every exchange feels like leaving a voicemail. At 1.5 seconds with continuous streaming, it feels like FaceTime.
[Nova01 vs PikaStream1.0 comparison video]
Conclusion
With PikaStream1.0, we gave AI Selves a face. 24 FPS, 1.5 seconds of latency, one GPU. For the first time, a real-time visual engine can power live, identity-consistent video at conversation speed.
Real-time video generation is moving fast, but building an agentic visual engine that speaks, expresses, and persists as a digital self is a new frontier. We're excited to share this first step, and there's a lot more coming.
References
- FlashDecoder - Kang and Kwak, "FlashDecoder: A Pure-Transformer Streaming Video Decoder for Real-Time Latent Decoding", CVPR 2026.
- Wan - Wan Team et al., "Wan: Open and Advanced Large-Scale Video Generative Models", 2025. arXiv
- DiT - Peebles and Xie, "Scalable Diffusion Models with Transformers", ICCV 2023. arXiv
- FlashAttention - Dao et al., "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness", NeurIPS 2022. Paper
- RoPE - Su et al., "RoFormer: Enhanced Transformer with Rotary Position Embedding", Neurocomputing 2024. arXiv
- RLHF - Ouyang et al., "Training language models to follow instructions with human feedback", NeurIPS 2022. arXiv
- Diffusion Forcing - Chen et al., "Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion", NeurIPS 2024. arXiv
- Self-Forcing - Huang et al., "Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion", NeurIPS 2025 Spotlight. arXiv
- DMD2 - Yin et al., "Improved Distribution Matching Distillation for Fast Image Synthesis", NeurIPS 2024. arXiv
- ReFL - Xu et al., "ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation", NeurIPS 2023. arXiv
- MotionStream - Lee et al., "MotionStream: Real-Time Video Generation with Interactive Motion Controls", 2025. arXiv
- Muon - Jordan et al., "Muon: An optimizer for hidden layers in neural networks", 2024. GitHub
Contributors
Fundamental Research Team (sorted alphabetically):
Cade Li, Fan Yang, Minguk Kang, Richard Fan, Xiaotong Bao, Xiaojian Wang, Yifeng Wang, Zhicheng Sun.
Acknowledgement
We'd like to thank the Fundamental Research Team for their essential contributions. We are especially grateful for the leadership of Demi Guo and Chenlin Meng. We would also like to thank Siqi He and Jiajun Chen for leading the product effort; Brian Ton, Saarang Srinivasan, and Shiva Oswal for their engineering support; and Lindsay Brillson, Matan Cohen-Grumi, Jessie Ma, and their teams for their marketing effort throughout the launch. Finally, we'd like to thank everyone who provided feedback and support throughout the project.