Une expérience avec Stable Diffusion 1.5, Llama et du TTS qui permet la génération de "vidéo" (suite d'images) avec une histoire narré.
Find a file
2025-07-04 12:22:43 +02:00
gen.py Actualiser gen.py 2025-04-20 12:40:14 +02:00
genXL.py Actualiser genXL.py 2025-04-20 12:40:43 +02:00
LICENSE Initial commit 2025-04-20 11:41:13 +02:00
main.py Original 2025-04-20 11:41:31 +02:00
out.mp4 Original 2025-04-20 11:31:29 +02:00
promptTtoI.txt Original 2025-04-20 11:29:26 +02:00
promptUtoT.txt Original 2025-04-20 11:29:26 +02:00
README.md Actualiser README.md 2025-07-04 12:22:43 +02:00
video.py Original 2025-04-20 11:41:31 +02:00

VideoStory

This is an experiment with Stable Diffusion (1.5/SDXL), Llama (2/3), Phi, and others that allows for the generation of "video" (a sequence of images) with a narrated story. This program is not really intended for practical use.

Why does this project suck

Context Limitation

AIs are limited by the size of their context. Too much, and the AI goes crazy; not enough, and the output is worse than the usual trash. It is impossible to create a "real" long story fully with AI because:

  • If you generate it in one go, as in V1, the AI will create a pretty short story, and the further it goes, the more it becomes incoherent and repetitive.
  • If you generate it in multiple parts, as in V2, the story might be more coherent and better in the long term, but the overall quality will be lesser because when you rebuild the whole story into one big text, there are a lot of "artifacts."

Self-Biasing Limitation

AIs self-bias themselves all the time because of their context. If there weren't any context, there wouldn't be any bias, but also no output. AI self-biasing is the same thing as human biasing but on a much larger scale. Everything biases AI toward its final output. The proof is that if you prompt the AI to generate a story about a cat, it will generate a story about a cat. However, this is also an issue because every word in its context is taken into account to generate the final output, along with all the "artifacts" it created along the way. For one artifact, ten more are generated, and the output rapidly becomes garbage. This is due to the fact that AIs are probabilistic machines, i.e., useless for tasks that require more than just probabilities.

This self-bias is really visible in V2 because, at each pass, the AI's context is cut and modified. This means that instead of having one AI with one context and one bias, we have multiple versions of the AI with different biases. This creates a LOT of artifacts, as they all have different "state of mind" and "goal." You could visualize the AI's bias as a vector made of all the tokens/n-grams in its context. While V1 only uses one context, with one vector in one direction, V2 uses multiple contexts with multiple vectors all pointing in "kind of the same direction" but still diverging.

Conclusion

To correct the issue, you would need to write the text yourself multiple times with various small wording variations and then train the AI with them. Then you would have a well-written and longer story, and V2's bias would probably be better (i.e., pointing more in the same direction). Soo yeah, shocker: writing your own story is better than using an AI to generate them, even with the most sophisticated methods. The same goes for image and audio generation.

Output exemple

https://uwo.nya.pub/forge/Joachim/VideoStory/src/branch/main/out.mp4

Flow charts

V1

flowchart TD;
sd{{"Stable Diffusion"}}
img1["Image 1"]
img2["Image 2"]
img3["Image 3"]
p1["Paragraphe 1"]
p2["Paragraphe 2 + (1)"]
p3["Paragraphe 3 + (1 + 2)"]
fa["Fichier Audio"]
vd{"Vidéo"}
prt{"Prompt"}
llm{{"Llama"}}
llm1{{"Llama"}}
llm2{{"Llama"}}
llm3{{"Llama"}}
tts{{"TTS"}}
prt --> llm;
llm --> Texte;
Texte --> p1;
Texte --> p2;
Texte --> p3;
Texte --> tts;
tts --> fa;
p1 --> llm1;
p2 --> llm2;
p3 --> llm3;
llm1 --> sd
llm2 --> sd
llm3 --> sd
sd --> img1;
sd --> img2;
sd --> img3;
fa --> vd;
img1 --> vd;
img2 --> vd;
img3 --> vd;

V2 (Unpublished)

stateDiagram-v2
state "Part 1" as p1
state "Part 2" as p2
state "Part N" as pN
state "Gen Story p1" as Gp1
state "Gen Story p2" as Gp2
state "Gen Story pN" as GpN
state "Summary 1" as S1
state "Summary 2" as S2
state "Summary N" as SN
state "Prompt 1" as pt1
state "Prompt 2" as pt2
state "Prompt N" as ptN
state "Gen illustration 1" as it1
state "Gen illustration 2" as it2
state "Gen illustration N" as itN
state "Gen TTS 1" as tt1
state "Gen TTS 2" as tt2
state "Gen TTS N" as ttN
state "Subtitle 1" as sub1
state "Subtitle 2" as sub2
state "Subtitle N" as subN
state "Video part 1" as v1
state "Video part 2" as v2
state "Video part N" as vN
state "Video Final" as vf
World --> Base
Description --> Base
Name --> Base
Base --> Master
Master --> Player : Until number x of max interations is reached
Player --> Master

Logs --> p1
Logs --> p2
Logs --> pN

p1 --> Gp1
p2 --> Gp2
pN --> GpN

Master --> Logs
Player --> Logs

Gp1 --> S1
Gp2 --> S2
GpN --> SN

S1 --> pt1
S2 --> pt2
SN --> ptN

pt1 --> it1
pt2 --> it2
ptN --> itN

Gp1 --> tt1
Gp2 --> tt2
GpN --> ttN

Gp1 --> sub1
Gp2 --> sub2
GpN --> subN

it1 --> v1
tt1 --> v1
sub1 --> v1
it2 --> v2
tt2 --> v2
sub2 --> v2
itN --> vN
ttN --> vN
subN --> vN

v1 --> vf
v2 --> vf
vN --> vf

World: World name
Description: World description/rules
Name: Main actor's name
Logs: Roleplay's logs
Master: AI leading the game
Player: AI choosing next state, with only current state context
p1: Part 1 of logs
p2: Part 2 of logs
pN: Part N of logs
Gp1: Story generated with Part 1
Gp2: Story generated with Part 2
GpN: Story generated with Part N
Base: Base prompt for leading AI
S1: Story summary
S2: Story summary
SN: Story summary
sub1: Video's subtitles
sub2: Video's subtitles
subN: Video's subtitles
pt1: Gen SD prompt with simplified story
pt2: Gen SD prompt with simplified story
ptN: Gen SD prompt with simplified story

Library

Here are the dependencies:

re
llama_cpp
outetts
diffusers
torch
os
moviepy

Usage

In the main.py file, add the prompt in the call to main(). SYSTEMPROMPTT is the system prompt for Llama. SDBAD is the negative prompt for Stable Diffusion. SYSTEMPROMPTI is the system prompt for Llama for Stable Diffusion.

promptTtoI.txt and promptUtoT.txt are respectively the system prompt for Stable Diffusion and that for Llama.

In the gen.py file, in the functions loadllama(), loadtts(), and loadsdxl(), you need to add your models (local files).

The program is launched with main.py.