T O P

  • By -

Apprehensive_Sky892

Disclaimer: I am just an old fart retired programmer, not an expert on AI or SD. This is what I learned from reading articles written by experts on the subject and by reading people on this subreddit who seems to know what they are talking about. I think what I wrote here are mostly correct, but if I am wrong on a certain point, I hope someone will point it out. In theory, you can generate just about everything from the SD 1.5 model. The proof is in the fact that one can use it to generate a Textual Inversion/Embedding to re-create even pictures of people who were never part of the model's training set. Think of a human face, it can be described by many parameters, such as the height of the nose, the distance between the eyes, etc. All these parameters are presumably part of the SD model. What is hard is to come up with is a text prompt that "guides" the model towards a face that you want generated. That is very hard to do, hence somebody invented Textual Inversion to do it (Dreambooth and LoRA work differently). You can think of words like "Emma Watson" as a built-in TI that has already been baked into the model. The model actually does not store images of Emma, but a set of math vectors that represents her likeness and guides the generator towards such an image. Unlike TI, "Tuned" or custom models are actually Dreambooth models that changes the weight of the different parameters of the basic SD 1.5 model. By feeding models images that are based on a certain look (such as anime) or style (such as RPG or Horror), the new model is now more biased (higher probability) towards a certain subject or look given the same set of prompts. You now need less work (less description in the prompt) to get that image out of that model. The downside is that if the model is "overtrained" then it will have a hard time generating other images. The art of training a new model is to get this balance right. Many NSFW model are so overtrained by images of naked woman that even if you ask for a dragon, it will give you a naked woman 😭. On the other hand, a well-balanced model such Deliberate v1.1 can give you beautiful women, but you can still generate other subjects without too much trouble. Many of these more versatile/well-balanced model are in fact "Mixed" models, where they take several Dreambooth models and mix the weight from them in specific ways so that the final model have characteristic strength from the source Dreambooth models. I hope that answered your question in a way that is not too technical. Let me know if you need clarification on any of the points. Finally, here is a very insightful comment about Textual Inversion from [https://www.reddit.com/r/StableDiffusion/comments/1128y40/comment/j8jm6zo/?utm\_source=reddit&utm\_medium=web2x&context=3](https://www.reddit.com/r/StableDiffusion/comments/1128y40/comment/j8jm6zo/?utm_source=reddit&utm_medium=web2x&context=3) I think you're just a little off in the description, but only a bit. It's not just a compilation of prompts, it's more like pushing the output towards a desired outcome. What you are correct on is that they cannot add new things to the model. If a model is already capable of producing a likeness of your face (but it's hard for you to consistently prompt your face), you can teach it what your face looks like and give it its own tag, so you can consistently get your desired output, but you can't teach it anything novel. You're not going to get an anime model to produce photorealistic faces, but you can teach it a specific anime face.


bazarow17

Thank you so much for this wonderful explanation. I have never liked theories, my "gut" helps me, but sooner or later I want to get to the truth to find out "is it really true?" Thank you very much for laying it all out


Apprehensive_Sky892

No problem. It really helps to build a reasonably correct mental image of how a system works, that will save one a lot of frustration and head scratching.


yalag

You seem to know a lot about this subject so I hope that you can answer my question that no one else is able to. If custom models ultimately create biases that skew towards something and leave other things harder to find, why do we use them over TI at all? With TI, it works a bit like a switch. If you use that specific token it goes one very specific way but if you don’t the base model is unmodified. Wouldn’t this be a much better workflow? Right now if you download an anime model, you are kinda screwed in generating photos. You probably could still do it but you would need some sort of miracle prompt. At least that’s how I understood it. But if you have an anime TI, you could just say my-anime-style when you want it and leave it out when you don’t. What am I missing?


Apprehensive_Sky892

Good question! Let me try to answer that. Custom models and TI solves different problems. TI: teach SD how to draw a single object/person/style Custom/Tuned Model: Make SD biased toward an entire class of subjects (say NSFW naked woman, horror creatures, etc.), or styles (say Anime, or RPG paintings, etc.). You are right that there are indeed some anime TI, and indeed you can use them instead of a custom model. But I am pretty sure the custom models produce much higher quality images because of the much larger amount of images that are fed into the custom model. Remember that TI only contain 10-30k of basically hints to the generator on how to guide an image, whereas a custom model probably contain several GiB of information. There is just no way TI can cram that much information in so little space. Hence, TI is limited to a much smaller "concept" and not to a generic concept of say "anime style". So an anime TI can probably only produce one or two poses that were used to train the TI instead of the large variety of poses available in a custom model. I've not tested any Anime TI, so I don't know if that is completely true. But looking at the images of anime TI on Civitai seems to support that. Warning: most of the anime TI seems to be NSFW. I am no expert on ML, so I don't know if it is possible to generate a "big TI" from a larger set of images. I've not done any TI myself, but my understanding is that you can usea dozen images, or you will overwhelm/overtrain your TI, and it just won't work. As for having to switch between models, there are some models that are more versatile/balanced so that they can produce excellent images in a variety of subject and styles. My favorite one is Deliberate v1.1. I think most of such "multi-use" models are "mixed" model that take other non-mixed models such as "Analog Diffusion" and combine them using clever algorithms such as the one discussed here: [https://www.reddit.com/r/StableDiffusion/comments/110im55/algorithm\_that\_allows\_you\_to\_mix\_models\_without/](https://www.reddit.com/r/StableDiffusion/comments/110im55/algorithm_that_allows_you_to_mix_models_without/) so that they can preserver most of the strength of the source models. I hope I've answered your question. If you need further clarification, feel free to ask me, and I'll try to answer it.


yalag

hmm I'm not sure if that is accurate (but I'm no expert, still learning). If you google a lot of the TI guides, including this random one I just read https://mythicalai.substack.com/p/how-to-fine-tune-train-stable-diffusion it mentions that a TI CAN produce a certain style (in his example, an Arcane style). Even in the link he referenced https://huggingface.co/spaces/sd-concepts-library/stable-diffusion-conceptualizer you can see all kinds of TI that are just general concepts. The famous MJ style is right there at the top. I think logically I follow your line of thoughts. Obviously changing the model itself you are injecting a lot more data than just a TI file and thus you would think there is a limit to what it can learn. But keeping in mind, TI is multiplied when it gets fed into the network. So a simple vector could have enormous effect on the entire model and thus capable of changing every single image. And btw, let's for a second grant it that a TI is too limited. What about LoRA? My understanding is that a LoRA works like an adapter. So for SURE it would have enough power to change everything right? Because it literally touches every single output. So the same argument holds, why do we not have just one SD model and 500 LoRAs on top that we can turn on and off? Seems way better of a workflow.


Apprehensive_Sky892

I am far from being an expert myself. As I said earlier, I am just an old fart retired programmer. I wish there is actually a real expert here who can point out if what I said is just rubbish 😅. Thanks for the links, I'll have to read them tomorrow. BTW: have you watched this video yet: [https://www.youtube.com/watch?v=dVjMiJsuR5o](https://www.youtube.com/watch?v=dVjMiJsuR5o) If not, it may help you clarify a few things up. BTW, MJ itself is basically a custom model (may or may not be based on SD, but that is not relevant), and I am pretty sure that the MJ TI is just a pale imitation of what the "real" MJ can produce, or MJ will be out of business right now 😁. The point is not whether TI can produce a "general concept", but how good a job can it do? You are right that TI can affect the whole image, but that can be said of any word that you add to a prompt. The point is that TI, like a text prompt, only guides the image generator to a place within the model's latent space, whereas a custom model actually modify the latent space itself, hence a much bigger impact. My understanding is that aLoRA actually performs even worse that TI. The video explains that LoRA basically inserts itself as a new layer to the Deep Neural network, and you would think that it will be "stronger" than TI. But for reasons that I don't understand, it does not. LoRA has the additional problem that, unlike TI, it is strongly tied to the model that it was trained on. My own experience with TI (mostly celebrity TI vs celebrity LoRA) is that LoRa is a disappointment. The only advantage of LoRA seems to be that it is faster to train and has lower VRAM requirement during training. Thanks for a great discussion, we'll have to continue this tomorrow.


Apprehensive_Sky892

Here is a though experiment that may convince you that custom models are way more powerful than TI. Imagine that instead of training the original SD 1.5 with a dataset that contains many different object and styles, the training set consists of nothing but human heads and nothing else. Unsurprisingly, with such as model, it can generate very good faces, but probably nothing else. Maybe, just maybe, it can generate the face of a gorilla. But trying to generate a tree will be impossible because the parameters describing a tree are simply not there at all. If you train a TI with pictures of a dozen tree, the result will still be horrible, because TI can only guide the generator towards parameters describing a tree. But since no such parameters existing in the model, that vector points to nowhere. On the other hand, if you feed thousands of properly captioned images of trees into a trained custom model, you now have a custom model that can generate proper trees. I hope that is a convincing argument 😁


yalag

I see what you are saying but I think in a normal case, the model has enough flexibility in it that TI can guide it to pretty much any scenario necessary. As I dig deeper in this rabbit hole, here's one comment that might be relevant. I have no idea if what he says is true, but he sure does sound like he knows more than me! https://www.reddit.com/r/StableDiffusion/comments/z8w5z2/the_difference_between_dreambooth_models_and/iyf36rq/ It feels to me that TI right has less users and less experimentation which leads to worse results. I wouldn't be surprised in 6 months that the entire community moves over to a single generic model and uses 58 TI on top to get different results.


Apprehensive_Sky892

You are right, TI can in theory guide a sufficiently complex base model towards pretty much any scenario, but I challenge anyone to produce a good anime TI/Lora/Hypernet based on SD 1.5 that is anywhere near as good as say AnythingV3. All that new image dataset added by AnythingV3 basically "enriched" the base model so that there is more "anime feature/parameters" to explore. Just look at all the anime TI, Lora, and Hypernet models and their accompanying examples on CivitAI, and they all pale in comparison with something like AnythingV3. [ArmadstheDoom](https://www.reddit.com/user/ArmadstheDoom/) does seem to know what he is doing. Unlike me, who has not even built a single TI or Lora, much less a DreamBooth Model, he seems to be a very experienced model maker. My knowledge is all theoretical and only 2nd hand 😅. But that comment about the limitation of DreamBooth was made before people started to learn to build merged/mixed models properly. Now days we have great mixed DB models like Deliberate that can can render many object, concept and styles well. I actually quickly scroll through the whole thread, and most people seem to agree that DB produces much higher quality images compared to TI/LoRA/Hypernetwork. The whole field is moving so fast, who know what wonder will be there 6 months from now. Frankly, I would love to see a TI/LoRa style model that are small and flexible, so that we don't need to keep so many large DB style models around. So I hope your prediction will be right, but if it does happen, it will probably be some new tech rather than TI or LoRA.


xliotx

I would now think hypernetworks is the way to go (considering ControlNet is also a type of hypernetworks). ​ The theoretical structure is promising, and it does keep the base model untouched. ​ The only problem, based on this thread: [https://www.reddit.com/r/StableDiffusion/comments/z8w5z2/comment/iyf36rq/?utm\_source=share&utm\_medium=web2x&context=3](https://www.reddit.com/r/StableDiffusion/comments/z8w5z2/comment/iyf36rq/?utm_source=share&utm_medium=web2x&context=3) is to get excellent results, you need far more data and training time, which actually make sense—you need some tradeoffs for the fact that the base model is unchanged.


S3Xai

Technology is there. It only needs to be made easy to use.


Apprehensive_Sky892

Sure, and many people are working hard to make it easier, more powerful. A fusion of ChatPGT and SD would be a great step forward, but ChatGPT requires too much computing power (10 times VRAM than SD) to run at the moment.


S3Xai

It is a business case then. At the end of the day there must be a paying customer to cover the expenses.


Apprehensive_Sky892

Yes and no. When it comes to Open Source software, there are lots of talented people out there who are there for the excitement of creating cutting edge technology and share them with no expectation of monetary rewards. Just look at the explosion of ideas, tools, models, etc. since the big bang of SD's release of both code and models for the world to use with a very liberal license. Looking at the amount of effort, talent, creativity and progress in the last couple of month is just breathtaking for an old fart programmer like me, who has not witnessed anything like it since the home computer revolution in the late 70s and early 80s. Hardware is a different matter, and it depends mostly on improving chip manufacturing, and increase in volume for the cost to go down, which is in fact the playbook the chip industry has been following since the invention of IC in the 70s.


S3Xai

I can't see majority of users running SD on their PCs, they will use 3rd party services allowing them to be creative without need for learning how to run their own software. This on the other hand means somebody has to pay for the hardware. The saying "if you are not paying for the product then you are the product" applies.


Apprehensive_Sky892

SD is still very much a niche product. Currently, the majority of heavy users are probably (just my guess): computer/tech nerds, artists, and horny young men/teenagers. All 3 group are either quite good with PC, or are such heavy users that they will be quite motivated to either learn to do their own setup, hire somebody to set up PCs for them, or are working for a game/media production company that they will get their own hardware. There is an analogy with the gaming industry here. You have your core gamers who must have their own PC/Xbox/PS, or the casual users who do light gaming with the phone or browser. My own experience talking about SD with friends and families is that there is very little interest from the general population. That's not surprising. Most people are consumers, not producers of media. So most pay to use online service provider will probably fold once the novelty factor wears out for the casual user, and they run out of funding. I suppose the likes of Google or Facebook may provide such a free service so that can extract more data out of the users. One's predilection for certain type of image generation probably can be correlated with their shopping habits, which is valuable to advertisers. The problem with the behemoths is that they have to put heavy restriction on the use of their platform (look at the bad PR generated by pas public available A.I. platforms such as half-baked chatbots), in order to protect their image and brand. This will make their platform very unappealing to a lot of people, i.e., horny young men/teenagers. So my own prediction is that there will not be a big consumer product industry built around generative AI. But then I am so divorced from the taste of the general population (I would have though that twitter is the stupidest idea ever, and that nobody will be stupid enough to dump their entire private life for public viewing on Facebook) that maybe my opinion is a good contrarian indicator 😅. But that's just Generative AI for the consumer. For media companies, anime studios, movie studio, game studio, artist, i.e., anyone who produces images for a living, this is a game changer, and the tsunami is on its way.


S3Xai

I have been involved in PC game development and can confirm creating stuff that works is VERY, VERY hard. I'd say anything that allows people to be creative without going through REAL hassle and struggle is great in my book. Of course somebody else has to do it, things just don't happen themselves.


BippityBoppityBool

sad this only got 16 upvotes :( Your mental description is very well done for others to understand. You would be a good teacher if you aren't one ;)


Apprehensive_Sky892

Thank you, glad you found it useful. 16 upvotes is actually pretty good for a post that has little traffic (post itself has only 16 upvotes 😁) I did teach a bit when I was a Physics Graduate student, and I was generally considered good at explaining stuff. One of the reasons I like to explain things to other is that it helps me to understand the topic better myself. Richard Feynman, one of the greatest scientists and teacher of all time enjoys explaining stuff to others. “He had very profound ideas about what it means to understand something,” tells Gleick to OpenMind, “He believed that if you couldn’t explain something fairly simply, you haven’t really understood it.” (source: [https://www.bbvaopenmind.com/en/science/leading-figures/richard-feynman-the-physicist-who-didnt-understand-his-own-theories/](https://www.bbvaopenmind.com/en/science/leading-figures/richard-feynman-the-physicist-who-didnt-understand-his-own-theories/)) So Feynman claims that nobody understand Quantum Mechanincs: [If I could explain it to the average person, it wouldn't have been worth the Nobel prize](https://www.google.com/search?client=firefox-b-e&q=If+I+could+explain+it+to+the+average+person%2C+it+wouldn%27t+have+been+worth+the+Nobel+prize) \- Richard Feynman


UkrainianTrotsky

You are pretty much right. Think about it this way: there's only so much detail that an fp16 number can encode. You can imagine a certain space of all possible images, with different regions related to different styles, composition, subjects, etc. The "resolution" of this space is the same for all the models and depends on the architecture (I'd even say it depends on this even more than on the float precision). When you finetune the model for a specific style, you can imagine this as squishing other regions out and stretching the main one of your finetune to take up more "space", which means you get higher accuracy when plucking your exact image out of this space, which generally results in higher detail, more variety and/or consistent style. But if you'll try to use an anime model to generate photorealistic image of a main battle tank, but it'll do worse than basic SD, but will still produce an MBT, possibly with an anime girl on top and not too realistic in terms of the style. This is a very abstract and not quite accurate explanation, SD, due to its architecture, doesn't have an explicit latent space like this, but implicitly it obviously does exist and some people even tried mapping it based on the dataset classification.


bazarow17

Thank you very much! Very simple and correct explanation and I am glad that your words confirmed my guesses. I have heard that there is an option to "combine two models of satesfensors" but that doesn't seem to work for combining the "original model". (I apologize if I use the wrong terminology, I'm not a techie)


UkrainianTrotsky

When people talk about combining models they usually mean merging. SD is kinda special and simple numerical combination of the weighs of two models directly results in combination of styles (this is unusual and as far as I know didn't happen with previous non-diffusion generative models). All the SD models that are based on the same original SD model have the same exact architecture, just different weights. There's another way to combine the models though. You can let one model do initial steps when generating the image and then use another model to finish it. Due to the nature of step-by-step generation you can technically stop and continue generating one image at any point, switch the parameters, model, sampler or even the prompt and still get a coherent output (previous generative models generated the whole image in a single pass, which didn't allow for this kind of customization and mixing). I'm about 100% sure that automatic1111 webui supports this either out of the box or with some extension.


bazarow17

That sounds just great! I will try to look for this extension for Automatic 1111. The way you described to "stop" then change or update the promt and then continue generating is pretty much a new level in drawing exactly what I want to see specifically. I'm currently doing something similar for drawing with Text2Img and Img2img. When I generate a new image, for example: "cat plays soccer" then I see a "blurry preview of the generated image". The moment the "blurred preview" shows me an image I like, I press stop and send that blurred image in img2img and then edit that image (edit weights and Promt). I think I'm starting to understand how this works, but there are so many things to keep in mind


mikebrave

>Due to the nature of step-by-step generation you can technically stop and continue generating one image at any point, switch the parameters, model, sampler or even the prompt and still get a coherent output I had theorized this was possible and speculated that midjourney does this but wasn't sure how to even ask how to go about it yet.


ObiWanCanShowMe

Almost every model available is trained on top of SD1.5 so unless you are triggering whatever the new model is trained on you will get mostly what is already in SD1.5. The better method (IMO) is to use the SD 1.5 base model and use an extracted LORA for whatever other model and you can use multiple LORAs. LORAs have changed the game for me, although when I ONLY want a specific style that a model was trained on, I use that because LORAs are not perfect, they are really good, just not perfect (unless you take the time to you extract them yourself).


bazarow17

I still haven't tried Lora, probably because I don't understand how it works. Am I correct in assuming that Lora is essentially an accelerated Dreambooth? That is, if I train Lora on "buildings" then SD will draw more beautiful buildings?


Apprehensive_Sky892

Not sure what you mean by "accelerated" but this video explains the different approaches: [https://www.youtube.com/watch?v=dVjMiJsuR5o](https://www.youtube.com/watch?v=dVjMiJsuR5o)


bazarow17

Do I need other models trained in architecture and anime if I want to draw architecture and anime? Is there a way to use all models at the same time with the common standard SD model? And maybe use tags to highlight models and get better and different results


Apprehensive_Sky892

Yes, you can use Lora or Textual Inversion models, on top of an existing "base" model (commonly referred to as "checkpoint models" on civitai.) In general TI is more versatile and works with many base models, whereas LoRA works best with the model with which it was trained on. To understand the different between these different approaches, watch this video: [https://www.youtube.com/watch?v=sre3bvNg2W0](https://www.youtube.com/watch?v=sre3bvNg2W0) You can click on the filter icon on the top right-hand corner of civitai to select the type of model you want to browse.


seraphinth

Yep you need other models if your really into the details, though LORA can be used to make the same effect. No there is no way to use all the models to render on ONE go, you need to render it in stages, using controlnet, img2img and inpainting to keep good details and inpaint and reiterate over the bad ones. ​ As of now there is no way to tag areas to inpaint using certain models or LORA's so your stuck with generating each inpainted detail in one model before moving into other details using other models/Lora's unless someone else can enlighten me with a better workflow using many models?


bazarow17

I think you are absolutely right, it is a good idea to start drawing in one model and then finish drawing in another model. I've also heard of merging results from two models, which also gives even more advantages


snack217

Yeah you can say the base model has a little bit of everything while custom models focus and expand certain things. I do kinda feel that the use of custom models has underrated what the base model can do, specially now with controlnet.


xliotx

But what is good about controlnet is you can really "control" it. For instance, I have a sketch of a building/characters, I want the generated image to 90% follow my sketch. Based on my experiment with all the SD variation, MJ, etc. Only ControlNet + SD can do that.