
Creating smooth, cinematic video transitions between images has long been a challenge in AI workflows. With the release of the WAN Video First-Last-Frame-to-Video (FLF2V) model, it is now possible to generate highly realistic short videos by simply providing a start and end frame.
In this guide, we’ll explore how to set up and use the WAN Video model inside ComfyUI, particularly when working alongside GGUF-based diffusion models like HiDream-I1. By combining powerful still-image generation with advanced frame interpolation, you can create natural motion sequences that retain fine details and cinematic consistency — all within a fully local and efficient ComfyUI workflow. Whether you’re building AI short films, visual storytelling projects, or stylish before-and-after animations, this method opens new creative possibilities.
Models
- GGUF Models: The 720p WAN2.1 flf2v models are available here. I have a mobile RTX 4090 with 16GB VRAM and I used wan2.1-flf2v-14b-720p-Q8_0.gguf. If you have less VRAM, use the other variants like Q5 or Q6. The model goes to ComfyUI\models\unet directory.
- Text Encoder: Download umt5_xxl_fp8_e4m3fn_scaled.safetensors and place it in ComfyUI\models\text_encoders
- Clip Vision: Download clip_vision_h.safetensors and place it in ComfyUI\models\clip_vision
- VAE: Download wan_2.1_vae.safetensors and place it in ComfyUI\models\vae
Installation
- Update ComfyUI to the latest version. I tested this based on v0.3.30. If your ComfyUI is below this version, use upcate\update_comfyui.bat to update it.
- Drag this image to ComfyUI’s canvas.
- Use ComfyUI manager to install any missing nodes.
Nodes
Select the gguf model you downloaded here.
Select the text encoder model.
Select the VAE model.
Select the clip vision model.
Select the first frame image.
Select the last frame image.
Enter the positive prompt and negative prompt.
Adjust the width and height to fit your images. Note that the model is optimized for 720P, but it’s taking too long on my GPU. So I changed it to only 480P. The length is frame. The default frame per second is 16, so 81 frames is about 5 seconds.
Examples
Prompt: the girl transforms.
I was trying to see if the girl in the first frame would tranform to the the girl in the last frame. However, the model just pans the camera to the right and show the girl in the last frame.
Prompt: the girl is doing photoshoot.
The transition is not as smooth as the first video. I increased the length from 81 to 121 and the results are better.
Conclusion
The WAN Video model provides a simple yet powerful way to bring static AI-generated images to life with smooth, realistic motion. By integrating it with ComfyUI and GGUF models like HiDream-I1, you can achieve cinematic-quality videos without needing a heavy text-to-video system.
As WAN Video continues to improve, it offers an accessible entry point for AI video generation focused on detail preservation and natural motion. With the right setup in ComfyUI, you can easily expand your creative workflow from generating stunning photos to producing short, dynamic videos — all while keeping full control over your outputs.
Stay tuned for more updates as the community continues to push the limits of first-to-last frame animation!
Leave a Reply