12/01 512px model update! 512px_v0.3

Please check the details in the 512px_v0.3 tab.

Personally, I recommend the 512px model. I like the workflow of using the 512px model for trial-and-error inference to generate good images, then either upscaling them with i2i using the 1024px model or trying the same prompt with the 1024px model.

■This is an experimental fine tuning.

Attention This Fine Tuning model is very difficult!

The quality is not good!! Don't expect too much!

If you are interested in PixArt-Sigma for the first time, we recommend that you check out the workflow that allows you to infer the original model... Even if my model is not great, try using other people's amazing fine-tuning models!

I think the "Suggested Resources workflow" can be used even by those who have never used ComfyUI before. There's no need for a difficult installation. Just download and try it out!

Merging can be done with ComfyUI. The "Suggested Resources merge tool" is also simple and good.

I haven't tested it, but I believe inference should also be possible with SDNext.

Forge also has the following extensions available.

https://github.com/DenOfEquity/PixArt-Sigma-for-webUI

The 'anime sigma1024px' in Suggested Resources is a flexible and aesthetically pleasing anime model. Give it a try.

I would be happy if you could be interested in Pixart even a little.Pixart has potential.

My hope is for more people to discover basemodels with potential and to see their possibilities grow even further. I would be happy if I could help make that happen.

■I trained using onetrainer.

Fine-tuning is performed on a 70,000 or 400,000 image dataset that mainly contains anime images, but also some realistic and AI images.all booru tag train. The training resolution is 512px or 1024px. Pixart is high quality but has low requirements, making it suitable for training.Detailed information about the training is written at the bottom of the page, so please refer to it. I have also uploaded the Onetrainer configuration data.

■Please be careful as sexual images are also generated.

■Here are my recent favorite inference settings. This will be updated as needed.

This is not the optimal solution.Please try various things!

Both booru tags and natural language are available for use.

●sampler:"3m_SDE cfg2.5-5 step30-50" ,"Euler cfg_pp" or "Euler A cfg_pp" cfg 1.5-2.5 step30-50

Scheduler:"GITS" or "simple"

●GITS provides rich textures, Simple ensures stable generation quality, SDE stays true to the dataset, Euler is sharp,Euler A offers stability.

I generally prefer GITS + "Euler," "Euler cfg_pp," or "3m_SDE."

"GITS + Euler" or "Euler cfg_pp" is very sharp.

"GITS + 3m_SDE" is dynamic.

"simple + Euler A or 3m_SDE" feels stable and seems to improve fidelity, though it may have high contrast.

●GITS can produce amazing detail, but it sometimes seems prone to breakdowns or not following prompts. I prefer it when I want to focus on atmosphere using natural language. Simple, on the other hand, is stable and follows prompts well, making it more suited for character work.

●Resolutions slightly outside of 512x512 and 1024x1024 are acceptable. Resolutions like 512x768 or 1024x1536 may have minor issues but remain practical. For more stability, it’s best to stick to resolutions like 832x1216 that are closer to standard.

I prefer larger resolutions over stability, so I tend to choose non-standard resolutions.

●If you can't come up with a prompt, try using the prompt auto-generation below.

https://huggingface.co/spaces/KBlueLeaf/TIPO-DEMO

●Negative prompts are not trained. Please try various prompts!

As described in the dataset contents on the page below, if you don't like realistic textures, you might want to include terms like "realistic, figure".

Adding 'anime screencap' to the negative prompt helps reduce flatness.

I don't like restrictions and prioritize diversity, so I keep the negative prompts to a minimum.

Lately, I've been favoring a workflow where I disable negative prompts in the early steps and only apply them starting from the later steps. This approach results in fewer compositional issues in the early stages, and since I can freely adjust the style in the later stages, the overall quality is improved.

However, my way of thinking is unconventional. You don't have to follow it! You might get better results with many negative prompts, so give it a try!

I feel that with fewer steps, the composition doesn't turn out as well.

●It might be better to have at least 20 steps. Recently, I've been sticking to 50 steps.

For previews, I stop around 15-25 steps to check the progress.

Once I find a good seed, I refine it with 50 or 100 steps, adjusting the CFG as needed.

Since there is little change in the later steps, I can predict the outcome. This way, I balance both efficiency and quality.

However, with a higher number of steps, breakdowns may decrease, but it might end up overcooked. A setting like 30 steps might provide a better balance in terms of contrast.

By the way, I haven't trained with tags for work titles, but sometimes character tags include the work title. This tendency is especially strong with mobile games. When I randomly added a work title, there was a change in the style, so it’s possible that it may have some effect.

If you find it troublesome to come up with prompts that produce stable quality, using prompts like the ones below might help stabilize the output. Ironically, tags like these end up becoming quality tags.lol

" nikke, azur lane, blue archive, kancolle, virtual youtuber, arknights, girls' frontline"

This massive, chaotic negative prompt might actually be effective, though I just copied it from other models without any guarantees. Still, it seems to have some effect.

If you feel that the composition or anatomy looks strange, try removing the negative prompt. I've noticed several times that it can have a negative impact.

●amputated,amputation,bad anatomy,bad proportions,blurry,cloned face,deformed,extra arms,extra fingers,extra legs,extra limbs,flawed,flaws,fused fingers,glitched,gross proportions,long neck,low detail,low quality,malformed,malformed limbs,missing arms,missing fingers,missing legs,morbid,mutated,mutated faces,mutated feet,mutated fingers,mutated hands,mutation,mutilated,mutilated faces,mutilated feet,mutilated fingers,mutilated hands,poorly drawn,poorly drawn face,poorly drawn hands,too many fingers,ugly

■512px model.

The standard size for this model is 512px

A ratio like 512x768 like SD1.5 is suitable.

768px 1024px is not trained, so the result will be disastrous.

The base model is very high quality even at 512px!

Usually, models in the middle of pre-training or lite versions lack sufficient learning or aesthetic appeal, but this model is different. It is the most aesthetically pleasing I have seen so far.

Due to its low requirements for training and inference specs and its fast speed, I feel that it has the potential to become the successor to SD1.5 that I've been looking for.I love this model.

Honestly, for creating images focused on 2D characters, there’s little difference between 512px and 1024px. Unless it’s a concept that clearly requires high resolution, 512px should be sufficient.

■ 1024px_v03 updated!

Please check the changes in the description of the version on the right.

If you don’t want to waste time, it might be a good idea to use the 512px model first to practice which prompts are effective.

Merging might also be interesting.

Merging with a realistic model can sometimes improve anatomy.

An example of an interesting merging experiment:

simply merge the 1024px and 512px models at a 0.5 ratio. This will allow you to generate at a 768px scale. Try resolutions like 768x768, 576x960, or even 640x1024. 768x1024 may sometimes break down, but it can succeed occasionally.

If the preview shows no block noise or line noise, then it’s fine. If these appear and strange artifacts start to show in the generated image, that’s the resolution limit.

This approach balances speed and detail, but I’m not entirely confident the merge is stable—it may have some issues. Still, it’s worth trying for an interesting experiment.

An example of an interesting merging experimen 2t:

Perform a differential merge of my 512px model with a 512px base model to extract only the fine-tuning elements.

Then, adding about "0.1-0.25" of the extracted elements to a 1024px model .There are more failures, but it's fun because it emphasizes the style.

※By the way, I don't think the older versions are inferior.

As the training progresses, the model learns more concepts but gradually deviates from PixArt's aesthetics.

Therefore, earlier versions might have a better balance in some cases.

It's a matter of personal preference, so I think you should use the version you like best.

Personally, there are sample images from older versions that I really like. I'm not confident I could replicate them with the latest version, lol.

■I am training with the danbooru tag.

We are only learning general tags such as 1gril, and we are not training artist or anime work tags.

A small number of tags will produce a disastrous result.

Popular tags tend to be of higher quality.

Examples: looking at viewer, upper body,shiny skin,anime screencap, etc..

If the effect is too strong, it might be a good idea to lower the weight.

It would be interesting to generate various tags using something that can automatically generate tags.

This is an experiment to see how much the tags can learn.

My training quality is poor, but it's learning better than expected.

In some cases, it may be able to express things that are difficult to do with other models.

It seems possible to add some new concepts even without fine-tuning the T5.

The base model is not excessively censored; like Cascade, it can handle high-exposure outfits without issues and sometimes even generate nudity.

It's interesting because it feels different from other models.

Due to the small size of the dataset, we are not yet able to recognize all tags.

It seems that natural language still works as well. There might be an interesting aspect that is different from the base model.

It's quite fun. I give themes to ChatGPT to create natural language prompts.

■There are cases where the look of something realistic or AI comes out strongly.

It might be a good idea to add "realistic" to the negative prompt.

On the other hand, it might be fun to try something other than anime.

New discoveries are made in areas that were not originally intended.

It's okay not to expect perfection too much.

This model is still immature.The broken results are more interesting!

■There is no consistency in style.The quality is poor and there are no fixed settings or prompts.

It has no advantage over existing models and has a narrower dataset.

It's an incomplete and very difficult model, but if you're interested, please give it a try.

If the human body breaks down, it's not due to censorship but rather because my fine-tuning is poor, so please bear with me! lol

I will continue to refine it to make it better in the future!

I am considering expanding the dataset in the future. If there are any specific things you would like to see included, please let me know, and I will take them into consideration!

Merging is no problem.If you have any interesting results please share!

I think the 512px model can be merged into the 1024px model using differential merging. If the proportion is too large, it might break down, but it could be useful for enhancing concepts and styles.

■Dataset Notes:

"realistic, figure, anime screencap"

These are the only three tags that I intentionally trained for style, and using them will enforce a particular style.

"anime screencap" will result in a TV anime style.

Putting "realistic, figure" in the negative prompts will enforce an anime style.

However, other 2D styles lack consistency and the style will change based on the keywords...

From what I understand, sexual content tends to adopt a visual novel game style, and natural language tends to lean towards AI or 2.5D.

Tags like "looking at viewer, upper body, shiny skin" are tagged in many images, so the quality might be higher. I feel they tend to be closer to the AI image style.

"blush" is also widely used and tends to be the flat style of visual novel games and Japanese 2D artists.

The contents of my dataset include visual novel games, real people, figures, 2.5D, anime screencaps, and AI images.

Because I trained on such a wide range, styles are linked to tags, which might make control a bit difficult...

■For reference, I will also share my simple confyui workflow and onetrainer training setting data.

If you want to use confyui for inference, you need to install the "ExtraModels" plugin. I will also share the URLs of "vae" and "T5" that I use.

I don't know if it can be used with other WebUI.

Other people have shared their workflows, so it might be a good idea to refer to them.

■ExtraModels

https://github.com/city96/ComfyUI_ExtraModels?tab=readme-ov-file#installation

■vae

https://huggingface.co/madebyollin/sdxl-vae-fp16-fix/blob/main/diffusion_pytorch_model.safetensors

■T5

https://huggingface.co/theunlikely/t5-v1_1-xxl-fp16/tree/main

It's the same as the T5 on sd3, so you can probably use the 8bit T5 on sd3 as well. That should load faster.

■Base model Please download when you want to try other resolutions.

https://huggingface.co/PixArt-alpha/PixArt-Sigma/tree/main

■1024px diffuser model is required during training. Please specify this as the base model and train.

https://huggingface.co/PixArt-alpha/PixArt-Sigma-XL-2-1024-MS

■ 512px Model.

https://huggingface.co/PixArt-alpha/PixArt-Sigma-XL-2-512-MS

Compared to the 1024px model, it has lower hardware requirements and training speed is about 4 times faster, making it accessible for more people to train. Apart from the transformer, it uses the same data as the 1024px model, so please transfer the data from the URL above.

■If you have room in your GPU, loading T5 on the GPU will make inference faster and less stressful.

By converting T5 to 4-bit, inference is possible even with lower specifications.

A 12GB GPU should be fine.If you convert it to 4bit you might be able to load it on an 8GB GPU...If that doesn't work don't worry you can load it into your system RAM!

If an error occurs even after installing ExtraModels with ComfyUI Manager,

follow the instructions in the ExtraModels URL,

activate VENV, and re-enter the requirements.

When I tried to convert T5 to 4-bit, an error occurred with bitsandbytes, but re-entering the requirements solved the problem.

I don't know much about it either, so it may be difficult for me to provide support for installation...

■I'm new to civitai, so if you have any opinions, I'd appreciate it if you could let me know.

I'm not good at training, but I would be happy if I could share the potential of pixart with as many people as possible.

PixArt-Sigma have potential.

My dream is to see more Pixart models. I'd love to see the models you've trained as well!

The training requirements are low, 12GB is fine!

The total number of downloads has exceeded 600. Thank you for your interest in my immature model! Thank you very much for your many likes. m(＿＿)m

Thank you for the buzz as well!

This fine-tuning itself isn't particularly exceptional, but I hope the information about my training can help someone interested in Pixart!

■Below I will list the GPU and training time I used for my training. Please use it as a reference for your training!

If you want to know the exact settings, please download the onetrainer data.

GPU: RTX 4060 Ti 16GB

■512px

Batch size: 48

70,000 / 48 = 1,500 steps

1 epoch: 5 hours

15 epochs: 75 hours

GPU usage: 13GB

With this batch size and epoch time, I think the speed isn't much different from SD1.5. It's fast.

I feel the 512px model is like a successor to SD1.5.

■1024px (testing)

Batch size: 12

70,000 / 12 = 5,833 steps

1 epoch: 30 hours

5 epochs: 150 hours

GPU usage: 15GB

The reason it doesn't take exactly four times longer is due to the difference in batch size.

In my environment, I felt it was impossible to train a 1024px SDXL model, so I haven't tried it and don't know if it's fast or slow. But I think the batch size is good!

■Full fine-tuning With 12GB, 1024px training is not a problem.

I have 16GB, so my batch size is slightly larger.

If you lower the batch size, the VRAM usage decreases significantly.

With a batch size of 1 or 2, it might be fine even with 8GB.

I use CAME as the optimizer, which slightly increases GPU usage.I liked it because the quality was good.

With Adafactor or AdamW8bit, VRAM usage is significantly reduced.

Since the text encoder is T5 and very large, it might be difficult for now because training requires a lot of VRAM...

With the advent of SD3, this discussion will progress and training methods will be established. Until then, a large amount of VRAM might be necessary...

If you want guidelines for full fine-tuning settings, you can use these as a reference.

However, it may sometimes lead to overfitting or be challenging due to your PC specifications.

While referring to these, try to find settings that work best for you.

I was able to achieve the same settings by switching to BF16 training to reduce GPU usage, so that's what I use.

https://github.com/PixArt-alpha/PixArt-sigma/blob/master/configs/pixart_sigma_config/PixArt_sigma_xl2_img512_internalms.py

https://github.com/PixArt-alpha/PixArt-sigma/blob/master/configs/pixart_sigma_config/PixArt_sigma_xl2_img1024_internalms.py

Note!

■When training with Onetrainer, the number of tokens may be limited to 120.

For tag training, the impact should be minimal since tag shuffling is performed.

Honestly, I have never had any issues with 120 tokens for tags.

However, for natural language, the length of the caption is important, so unintended truncation might occur.

■Relevant part: "max_token_length=120" This value is the token limit.

https://github.com/Nerogar/OneTrainer/blob/23006f0c2543e52a9376b0557e7a78016d489acc/modules/dataLoader/PixArtAlphaBaseDataLoader.py#L244

■In the case of xformers, errors occurred beyond 256 tokens. With sdp, there were no issues up to 300 tokens, but at 512 tokens, the generated images broke down.

It seems that more tokens do not necessarily mean better results.

Due to the increase in cache size, if the cost-effectiveness is not promising, 120 tokens might be sufficient.

There is no guarantee of quality improvement, but it might be worth investigating.

Since there is no certainty, please let me know if there are any mistakes!

If you have any questions, please feel free to ask!

日本語での質問も大丈夫ですのでご気軽にお声がけください～

Description

PixArt-Sigma-1024px_512px-animetune

Model Details

Available Files

Tags

Versions

Related Models

Model Information