
The world of AI-generated video is rapidly evolving, and with the release of Wan 2.1 and the VACE node system for ComfyUI, creators now have powerful tools for transforming existing footage into stunning stylized videos. Whether you’re working on fashion clips, cinematic scenes, or virtual influencer content, video-to-video workflows with GGUF models offer a new level of realism and control — all while being more efficient on consumer GPUs.
In this guide, we’ll walk through how to use Wan 2.1 VACE in ComfyUI, leveraging GGUF models for low VRAM usage and impressive quality. If you’re curious about turning short clips into stylized AI videos or creating consistent multi-frame animations from source footage, this post is for you.
Models
- GGUF Models: The 14B WAN2.1 VACE models are available here. I have a RTX 3090 with 24GB VRAM and I used Q6. If you have less VRAM, use the other variants like Q5 or Q4. The model goes to ComfyUI\models\unet directory.
- Text Encoder: Download the GGUF text encoder here and place it in ComfyUI\models\text_encoders . Remember to download the matching one. I use the Q6 model, so I download the Q6 text encoder.
- LoRA: Download Wan21_CausVid_14B_T2V_lora_rank32.safetensors and place it in ComfyUI\models\loras.
- VAE: Download wan_2.1_vae.safetensors and place it in ComfyUI\models\vae
Installation
- Update your ComfyUI to the latest version, if you haven’t already.
- Drag the following full size image to ComfyUI canvas.
- Use ComfyUI manager to install any missing nodes.
- Restart ComfyUI
Nodes
Select the GGUF model you downloaded here. Note that the file name changed recently, so yours would be different.
Select the LoRA model here. This LoRA reduces the required steps from 20 to 4 or 6 and greatly speeds up the generation.
Select the text encoder here. Remember to match the gguf model. I use the Q6 model, so I select the Q6 text encoder.
Enter the positive prompt and negative prompt here. You can also use the positive prompt to alter the generation. In one case, I added a second person in the prompt and it generated the second person even though the there is only one person in the control video.
There are two Image Resize nodes. One for the reference image and one for the control video. Make sure the sizes are the same. I tested several dimensions, but I only get two to work. 576×1024 and 720×1280. If you get any other dimensions to work, please let us know in the comment section.
Choose a reference image here.
Choose a reference video here. I set the force_rate to 16 (frames per second) and the frame_load_cap to 81 (frames).
I use DWPose Estimator for image preprocessing. I have tried canny and OpenPose. Canny never worked for me, and OpenPose took a long time to run.
Set the length (81) to match the frame_load_cap earlier. Since I use the CausVid LoRA, I set the steps to 6. If you don’t use the LoRA, you might need to set the steps to 20.
Example
Example 1
Control video
Reference image
Output
Example 2
Control video
Reference image
Output
Conclusion
Combining Wan 2.1, VACE, and GGUF inside ComfyUI opens up powerful possibilities for video creators and AI artists alike. With just a few setup steps, you can transform everyday footage into high-quality, stylized outputs with impressive temporal consistency — even on mid-range GPUs. As ComfyUI continues to evolve, and GGUF models get more optimized, the barrier to entry for AI video creation keeps getting lower.
Whether you’re working on creative projects or looking to integrate this into your content pipeline, now’s the perfect time to explore what video-to-video AI generation can do. Try it, tweak it, and share your results — the future of video is here, and it’s AI-powered.
Leave a Reply