Donut Mochi Pack - Video Generation Workflows for AI

This model was originally uploaded to HuggingFace / Civitai. If it's yours, please contact us so we can transfer it to your account!

MOCHI VIDEO GENERATOR

(results are in the v1, v2, etc gallery, click the tabs at the top)

Current leader:
"\FP8--T5-Scaled\Donut-Mochi-848x480-batch16-CFG7-T5scaled-v8"

WIP project by Kijai
Info/Setup/Install guide: https://civitai.com/articles/8313
Requires Torch 2.5.0 minimum, so update your Torch if you are behind.
As with the CogVideo Workflows, they are provided for people that want to try the Preview :)

Even with a 4090 it can push the limits a little, I provide my workflows used to research Tile Optimisation in V1;

We're reducing tile sizes by roughly 20-40% from the defaults
We're increasing the frame batch size to compensate
Maintaining the same overlap factors to prevent visible seams

Key principles:

Tile sizes should ideally be multiples of 32 for most efficient processing
Keep width:height ratio similar to the original tile sizes
Frame batch size increases should be modest to avoid frame skipping

Researchers Tip!
If you work with a fixed seed, the sampler remains in memory, so the first gen took ~1700 seconds, however, changes to the Decoder can be made which means that the next video will take ~23 seconds. All the work is already done by the Sampler, so unless we take a new seed it will use the samples over and over, VAE decode speed is very good!

^ subsequent gens on same seed are very fast, allowing tuning of the decoder settings ^

^ initial generation was taking ~1700 with pytorch 2.5.0 SDP ^

V1 Workflows:

outputs labelled and added to V1 gallery, test prompt used:
"In a bustling spaceport, a diverse crowd of humans and aliens board a massive interstellar cruise ship. Robotic porters effortlessly handle exotic luggage, while holographic signs display departure times in multiple languages. A family of translucent, floating beings drift through the security checkpoint, their tendrils wrapping around their travel documents. In the sky above, smaller ships zip between towering structures, their ion trails creating an ever-changing tapestry of light."

Donut-Mochi-848x480-batch10-default-v5 = Author Default Settings

This version used the recommended config from Author

Donut-Mochi-640x480-batch10-autotile-v5 = Reduzed size, Auto Tiling
- This is my first run which created the video in the gallery, simply using Auto Tile on the decoder and reducing the overall dimensions to 640x480. This reduction makes generation take less memory, but is heavy handed and will reduce the quality of outputs.

The remaining workflows are all Investigating the possible configs, without using Auto Tiling so we know what was used exactly. Videos will be labelled for the batch count and added to v1 gallery. Community research is required !

Donut-Mochi-848x480-batch12-v5
frame_batch_size = 12
tile_sample_min_width = 256
tile_sample_min_height = 128

Donut-Mochi-848x480-batch14-v5
frame_batch_size = 14
tile_sample_min_width = 224
tile_sample_min_height = 112

Donut-Mochi-848x480-batch16-v5
frame_batch_size = 16
tile_sample_min_width = 192
tile_sample_min_height = 96

Donut-Mochi-848x480-batch20-v5

frame_batch_size = 20
tile_sample_min_width = 160
tile_sample_min_height = 96

Donut-Mochi-848x480-batch24-v5

frame_batch_size = 24
tile_sample_min_width = 128
tile_sample_min_height = 64

Donut-Mochi-848x480-batch32-v5

frame_batch_size = 32
tile_sample_min_width = 96
tile_sample_min_height = 48

The last workflow is a Hybrid Approach, the increased overlap factors (0.3 instead of 0.25) might help reduce visible seams when using very small tiles.

Donut-Mochi-848x480-batch16-v6

frame_batch_size = 16
tile_sample_min_width = 144
tile_sample_min_height = 80
tile_overlap_factor_height = 0.3
tile_overlap_factor_width = 0.3

V2 Workflow

Donut-Mochi-848x480-batch16-CFG7-v7

This used the Donut-Mochi-848x480-batch16-v6 workflow with 7.0 CFG
this seems to be a good setting, generation time is 24 minutes with this setup.
(pytorch SDP used)

V3 Workflow

Donut-Mochi-848x480-batch16-CFG7-T5scaled-v8

We decided to use the FP8_Scaled T5 CLIP model, this improved the outputs greatly across all prompts tested. check the v3 gallery. This is the best so far ! (until we beat it)

Donut-Mochi-848x480-b16-CFG7-T5scaled-Q8_0-v9

This did not yield the best results, probably due to T5 scaled Clip still being in FP8 as we were testing the use of GGUF Q8_0 as the main model.

V4 Workflow

Donut-Mochi-848x480-b16-CFG7-CPU_T5-FP16-v11

used T5XXL in FP16 by forcing it onto the CPU. Seems like the same artifacts from V3 where we used GGUF Q8_0 with T5XXL FP8.

Increasing steps to 100-200 is increasing quality at the expense of time taken, 200 steps takes 45 minutes. Likely no version for this because anybody can add more steps to any of these workflows and just wait a very long time for a 6 second video. This can be remedied with a Cloud setup and more/larger GPU/VRAM allocation.

Popularity

60 ~10

Info

Base model: Other

Latest version (v5): 1 File

donutMochiPackVideo_v5.zip (32.4 KB)

To download these files, please visit this page from a desktop computer.

6 Versions

Share on Twitter

Donut Mochi Pack - Video Generation