>it is the distribution of copied information that copyright protects against, the information has been uploaded in vectors
The "vectors" are not "uploaded" or "copied" into a file or neural network, they're transformed. In the context of stable diffusion, They're transformed, progressively "noised" or "corrupted", "diffused" with random gaussian noise, and in the context of stable diffusion, it's "trained" and "learns" how to "denoise" various "noised" stages of images represented as a vector of pixel data into their original form.
Then, when it comes to generation of images consistent with an "annotation" or "prompt", it is "conditioned towards" or "biased towards", with more "training", by noising an input image, and concating or combining that vector of pixels with a vector of the annotation of the image. It then "learns" to denoise with that conditioning information, the annotation.
Then, you can take the trained model, and do the same thing, with just a text prompt as a vector concated to a vector of random gaussian noise, and no input images.
That's basically and very simplistically how it works.
The output is not a substantial reproduction from the input images + annotations when trained. It takes the random noise, and "tries" to denoise it into something consistent with the prompt with conditioning to guide it.
Your attempt at covering would be a substantially similar reproduction. Your goal is to do a reproduction. Whereas, the model "learns" to generate images consistent with an annotation/prompt, by conditioning it with that "goal" on top of how it "learned" to denoise the images.
The "vectors" are not "uploaded" or "copied" into a file or neural network, they're transformed. In the context of stable diffusion, They're transformed, progressively "noised" or "corrupted", "diffused" with random gaussian noise, and in the context of stable diffusion, it's "trained" and "learns" how to "denoise" various "noised" stages of images represented as a vector of pixel data into their original form.
Then, when it comes to generation of images consistent with an "annotation" or "prompt", it is "conditioned towards" or "biased towards", with more "training", by noising an input image, and concating or combining that vector of pixels with a vector of the annotation of the image. It then "learns" to denoise with that conditioning information, the annotation.
Then, you can take the trained model, and do the same thing, with just a text prompt as a vector concated to a vector of random gaussian noise, and no input images.
That's basically and very simplistically how it works.
The output is not a substantial reproduction from the input images + annotations when trained. It takes the random noise, and "tries" to denoise it into something consistent with the prompt with conditioning to guide it.
Your attempt at covering would be a substantially similar reproduction. Your goal is to do a reproduction. Whereas, the model "learns" to generate images consistent with an annotation/prompt, by conditioning it with that "goal" on top of how it "learned" to denoise the images.