How I built an AI Text-to-Art Generator
我是如何构建一个AI文本到艺术生成器的

潘腾鑫    东北电力大学
时间:2022-10-10 语向:英-中 类型:人工智能 字数:2056
  • How I built an AI Text-to-Art Generator
    我是如何构建一个人工智能文本—艺术生成器
  • How I built an AI Text-to-Art Generator
    我是如何构建一个人工智能文本—艺术生成器
  • A detailed, step-by-step write-up on how I built Text2Art.com
    关于我是如何构建Text2Art.com的,我将会在下文逐步做出详细的分享。
  • Overview
    概述
  • This article is a write-up on how I built Text2Art.com in a week. Text2Art is an AI-powered art generator based on VQGAN+CLIP that can generate all kinds of art such as pixel art, drawing, and painting from just text input. The article follows my thought process from experimenting with VQGAN+CLIP, building a simple UI with Gradio, switching to FastAPI to serve the models, and finally to using Firebase as a queue system. Feel free to skip to the parts that you are interested in.
    这篇文章讲述了我如何在一周内构建text2art.com。Text2Art是一个基于VQGAN+Clip的人工智能艺术生成器,它可以将输入文本生成像素艺术、素描和油画等各种艺术。本文阐述顺序依照以下顺序——实验VQGAN+Clip、用Gradio构建一个简单的用户见面(UI)、用FastAPI调整模型以及将Firebase作为队列系统。您请随意跳阅至您感兴趣的部分。
  • If you like the project, you can vote for the project here.
    如果您喜欢这个项目,您可以为这个项目投上一票。
  • Outline
    提纲
  • Introduction
    引言
  • How It Works
    如何实现该生成器
  • Generating Art with VQGAN+CLIP with Code
    用带有代码的VQGAN+CLIP生成艺术作品
  • Making UI with Gradio
    用Gradio制作用户界面
  • Serving ML with FastAPI
    用FastAPI调节模型
  • Queue System with Firebase
    用Firebase作为队列系统
  • Introduction
    简介
  • Not long ago, generative arts and NFT took the world by storm. This is made possible after OpenAI significant progress in text-to-image generation. Earlier this year, OpenAI announced DALL-E, a powerful text-to-image generator that works extremely well. To illustrate how well DALL-E worked, these are DALL-E generated images with the text prompt of “a professional high quality illustration of a giraffe dragon chimera. a giraffe imitating a dragon. a giraffe made of dragon”.
    不久前,生成艺术和NFT风靡全球。这是OpenAI在文本-图像生成方面取得重大进展后才得以实现的技术。今年年初,OpenAI发布了DALL-E,一款功能强大且运行稳定的文本-图像生成器。为了说明DALL-E的卓越的运行效果,这些都是DALL-E根据输入文本“一个专业高质量的长颈鹿龙嵌合体的插图。模仿龙的长颈鹿。用龙做成的长颈鹿”生成的图像。
  • Unfortunately, DALL-E was not released to the public. But luckily, the model behind DALL-E’s magic, CLIP, was published instead. CLIP or Contrastive Image-Language Pretraining is a multimodal network that combines text and images. In short, CLIP is able to score how well an image matched a caption or vice versa. This is extremely useful in steering the generator to produce an image that exactly matches the text input. In DALL-E, CLIP is used to rank the generated images and output the image with the highest score (most similar to text prompt).
    遗憾的是,DALL-E并没有向公众公布。但幸运的是,Dall-E魔法背后的模型CLIP对外公开了。图像语言对比预训练是一种结合文本和图像的多模态网络。简而言之,CLIP能够为图片与说明文字的匹配程度打分,反之亦然。这在引导生成器生成与输入文本完全匹配的图像时非常有用。DALL-E通过CLIP对生成的图像进行排序,并输出得分最高的图像(最类似于输入文本)。
  • Few months after the announcement of DALL-E, a new transformer image generator called VQGAN (Vector Quantized GAN) was published. Combining VQGAN with CLIP gives a similar quality to DALL-E. Many amazing arts have been created by the community since the pre-trained VQGAN model was made public.
    DALL-E发布几个月后,一款新型图像转换生成器问世,名为VQGAN(Vector Quantized GAN)。VQGAN与CLIP结合运作可以获得与DALL-E运作相似的成果。预先训练VQGAN模型发布后,人们已经创造出了许多令人惊叹的艺术作品。
  • I was really amazed at the results and wanted to share this with my friends. But since not many people are willing to dive into the code to generate the arts, I decided to make Text2Art.com, a website where anyone can simply type a prompt and generate the image they want quickly without seeing any code.
    我惊叹于文本-图像法生成器的效果,想和大家一齐分享这一成果。但由于没有多少人愿意通过钻研代码来生成艺术作品,所以我决定编写text2art.com这个网站。在这个网上上,人们只需要输入关键词,就可以快速生成自己想要的图像,不会出现任何代码。
  • How It Works
    它是如何工作的
  • So how does VQGAN+CLIP work? In short, the generator will generate an image and the CLIP will measure how well the image matches the image. Then, the generator uses the feedback from the CLIP model to generate more “accurate” images. This iteration will be done many times until the CLIP score becomes high enough and the generated image matches the text.
    那么VQGAN+Clip是如何工作的呢?简单地说,生成器生成一张图像后,CLIP将测评该图像与文本的匹配程度。接着生成器将根据CLIP模型反馈来生成更匹配度更高的图像。该迭代会进行多次,直至CLIP得出的评分足够高并且生成的图像与输入文本相匹配。
  • I won’t discuss in the inner working of VQGAN or CLIP here as it’s not the focus of this article. But if you want a deeper explanation on VQGAN, CLIP, or DALL-E, you can refer to these amazing resources that I found.
    我不会在文章中阐述VQGAN或CLIP的进一步的工作原理,因为这不是本文的重点。但是如果您想要更加深入地了解VQGAN,CLIP或DALL-E,您可以参阅我发现的大量文献资料。
  • The Illustrated VQGAN by LJ Miranda: Explanation on VQGAN with great illustrations.
    LJ 米兰达对VQGAN的说明:通过优质的插图对VQGAN进行说明。
  • DALL-E Explained by Charlie Snell: Great DALL-E explanations from the basics
    查理·斯奈尔对DALL-E的理解:从基础原理理解DALL-E
  • CLIP Paper Explanation Video by Yannic Kilcher: CLIP paper explanation
    剪纸解释视频Yannic Kilcher:剪纸解释
  • X + CLIP
    X+CLIP
  • VQGAN+CLIP is simply an example of what combining an image generator with CLIP is able to do. However, you can replace VQGAN with any kind of generator and it can still work really well depending on the generator. Many variants of X + CLIP have come up such as StyleCLIP (StyleGAN + CLIP), CLIPDraw (uses vector art generator), BigGAN + CLIP, and many more. There is even AudioCLIP which uses audio instead of images.
    VQGAN+CLIP只是将图像生成器与CLIP相结合的一个例子。但是,您可以用任何类型的生成器替换VQGAN,并且它仍然可以根据生成器的不同而很好地工作。X+CLIP的许多变体出现了,比如StyleCLIP(StyleGAN+CLIP),CLIPDraw(使用矢量艺术生成器),BigGAN+CLIP等等。甚至还有AudioCLIP,它使用音频而不是图像。
  • Generating Art with VQGAN+CLIP with Code
    用带有代码的VQGAN+CLIP生成艺术作品
  • I’ve been using the code from clipit repository by dribnet which made generating art using VQGAN+CLIP into a simple few lines of code only (UPDATE: clipit has been migrated to pixray).
    我一直在使用来自dribnet的clipit存储库的代码,它将使用VQGAN+Clip生成art变成了简单的几行代码(更新:clipit已经迁移到pixray)。
  • It is recommended to run this on Google Colab as VQGAN+CLIP requires quite a lot GPU memory. Here is a Colab notebook that you can follow along.
    建议在谷歌Colab上运行,因为vqgan+clip需要相当多的GPU内存。这里有一个Colab笔记本,你可以跟着走。
  • First of all, if you are running on Colab, make sure you change the runtime type to use GPU.
    首先,如果你正在Colab上运行,确保你将运行时类型更改为使用GPU。
  • Next, we need to set up the codebase and the dependencies first.
    接下来,我们需要首先设置代码库和依赖项。
  • (NOTE: “!” is a special command in google Colab that means it will run the command in bash instead of python”)
    (注意:“!”是google Colab中的一个特殊命令,这意味着它将在bash而不是Python中运行该命令“)
  • Once we installed the libraries, we can just import clipit and run these few lines of code to generate your art with VQGAN+CLIP. Simply change the text prompt with whatever you want. Additionally, you can also give clipit options such as how many iterations, width, height, generator model, whether you want to generate video or not, and many more. You can read the source code for more information on the available options.
    一旦我们安装了库,我们只需导入clipit并运行这几行代码就可以用vqgan+clip生成您的艺术。只要用你想要的任何东西来改变文本提示。此外,您还可以给clipit选项,例如迭代次数,宽度,高度,生成器模型,是否要生成视频,以及更多的选项。有关可用选项的更多信息,可以阅读源代码。
  • Once you run the code, it will generate an image. For each iteration, the generated image will be closer to the text prompt.
    一旦您运行代码,它将生成一个图像。对于每次迭代,生成的图像将更接近文本提示。
  • Generating Video
    正在生成视频
  • Since we need to generate the image for each iteration anyway, we can save these images and create an animation on how the AI generates the image. To do this, you can simply add the video=True before applying the settings.
    因为我们无论如何都需要为每次迭代生成图像,所以我们可以保存这些图像,并创建一个关于AI如何生成图像的动画。为此,您可以在应用设置之前简单地添加video=true。
  • It will generate the following video.
    它将生成以下视频。
  • Customizing Image Size
    自定义图像大小
  • You can also modify the image by adding the size=(width, height)option. For example, we will generate a banner image with 800x200 resolution. Note that higher resolution will require higher GPU memory.
    您还可以通过添加size=(width,height)选项来修改图像。例如,我们将生成800x200分辨率的横幅图像。注意,更高的分辨率将需要更高的GPU内存。
  • Generating Pixel Arts
    生成像素艺术
  • There is also an option to generate pixel art in clipit. It uses the CLIPDraw renderer behind the scene with some engineering to force pixel art style such as limiting palette colors, pixelization, etc. To use the pixel art option, simply enable the use_pixeldraw=True option.
    在Clipit中还有一个生成像素艺术的选项。它使用CLIPDraw渲染器在场景后面加上一些工程来强制像素艺术风格,例如限制调色板颜色,像素化等。要使用像素艺术选项,只需启用use_pixeldraw=true选项。
  • VQGAN+CLIP Keywords Modifier
    vqgan+剪辑关键字修饰符
  • Due to the bias in CLIP, adding certain keywords to the prompt may give a certain effect to the generated image. For example, adding “unreal engine” to the text prompt tends to generate a realistic or HD style. Adding certain site names such as “deviantart”, “artstation” or “flickr” usually makes the results more aesthetic. My favorite is to use “artstation” keyword as I find it generates the best art.
    由于CLIP中的偏见,在提示中添加某些关键字可能会给生成的图像以一定的效果。例如,在文本提示中添加“虚幻引擎”倾向于生成逼真或高清风格。添加某些站点名称,如“DeviantArt”,“ArtStation”或“Flickr”,通常会使结果更美观。我最喜欢的是使用“ArtStation”关键字,因为我发现它能生成最好的艺术。
  • Additionally, you can also use keywords to condition the art style. For example, the keywords “pencil sketch”, “low poly” or even artist’s name such as “Thomas Kinkade” or “James Gurney”.
    此外,您还可以使用关键字来设置艺术样式。比如“铅笔素描”,“低保利”等关键词,甚至是“托马斯·金卡德”或“詹姆斯·格尼”等艺术家的名字。
  • To explore more on the effect of various keywords, you can checkout the full experiment results by kingdomakrillic which shows 200+ keywords results using the same 4 subjects.
    为了探索更多关于各种关键字的效果,你可以通过kingdomakrillic检查完整的实验结果,它显示了200多个关键字使用相同的4个被试的结果。
  • Building UI with Gradio
    使用梯度构建UI
  • My first plan on deploying an ML model is to use Gradio. Gradio is a python library that simplifies building ML demos into a few lines of code only. With Gradio, you can build a demo in less than 10 minutes. Additionally, you can run the Gradio in Colab and it will generate a sharable link using Gradio domain. You can instantly share this link with your friends or the public to let them try out your demo. Gradio still has some limitations but I find it’s the most suitable library to use when you just want to demonstrate a single function.
    关于部署ML模型,我的第一个计划是使用Gradio。Gradio是一个python库,它将构建ML演示简化为几行代码。使用Gradio,您可以在不到10分钟的时间内构建一个演示。另外,您可以在Colab中运行Gradio,它将使用Gradio域生成一个可共享的链接。您可以立即与您的朋友或公众分享此链接,让他们试用您的演示。Gradio仍然有一些限制,但我发现当您只想演示单个函数时,它是最适合使用的库。
  • So here is the code that I wrote to build a simple UI for the Text2Art app. I think the code is quite self-explanatory, but if you need more explanation, you can read the Gradio documentation.
    下面是我编写的代码,用于为Text2Art应用程序构建一个简单的UI。我认为代码是相当自解释的,但是如果您需要更多的解释,您可以阅读Gradio文档。
  • Once you run this in Google Colab or local, it will generate a shareable link that makes your demo accessible public. I find this extremely useful as I don’t need to use SSH tunneling like Ngrok on my own to share my demo. Additionally, Gradio also offers a hosting service where you can permanently host your demo for only 7$/month.
    一旦你在谷歌Colab或本地运行这个,它将生成一个可共享的链接,使你的演示可访问的公共。我发现这非常有用,因为我不需要像Ngrok那样单独使用SSH隧道来分享我的演示。此外,Gradio还提供了一个托管服务,您可以永久托管您的演示,只需每月7美元。
  • However, Gradio only works well for demoing a single function. Creating a custom site with additional features like gallery, login, or even just custom CSS is fairly limited or not possible at all.
    然而,Gradio只适用于演示单个函数。创建一个具有额外功能的自定义站点,比如画廊,登录,甚至只是自定义CSS,这是相当有限的,或者根本不可能。
  • One quick solution I could think of is by creating my demo site separate from the Gradio UI. Then, I can embed the Gradio UI on the site using the iframe element. I initially tried this method but then realized one important drawback, I cannot personalize any parts that need to interact with the ML app itself. For example, things such as input validation, custom progress bar, etc are not possible with iframe. This is when I decided to build an API instead.
    我能想到的一个快速解决方案是创建与Gradio UI分开的演示站点。然后,我可以使用iframe元素在站点上嵌入Gradio UI。我最初尝试了这种方法,但后来意识到一个重要的缺点,我不能个性化任何需要与ML应用程序本身交互的部分。例如,输入验证,自定义进度条等东西在iframe中是不可能的。这时我决定转而构建一个API。
  • Serving ML Model with FastAPI
    使用FastAPI服务ML模型
  • I’ve been using FastAPI instead of Flask to quickly build my API. The main reason is I find FastAPI is faster to write (less code) and it also auto-generates documentation (using Swagger UI) that allows me to test the API with basic UI. Additionally, FastAPI supports asynchronous functions and is said to be faster than Flask.
    我一直在使用FastAPI而不是Flask来快速构建我的API。主要原因是我发现FastAPI编写速度更快(代码更少),而且它还自动生成文档(使用Swagger UI),允许我用基本UI测试API。此外,FastAPI支持异步函数,据说比Flask更快。
  • Here is the code I wrote to serve my ML function as FastAPI server.
    下面是我编写的作为FastAPI服务器的ML函数的代码。
  • Once we defined the server, we can run it using uvicorn. Additionally, because Google Colab only allows access to their server through the Colab interface, we have to use Ngrok to expose the FastAPI server to the public.
    一旦我们定义了服务器,我们就可以使用UVICORN运行它。另外,因为Google Colab只允许通过Colab接口访问他们的服务器,所以我们必须使用Ngrok向公众公开FastAPI服务器。
  • Once we run the server, we can head to the Swagger UI (by adding /doc on the generated ngrok URL) and test out the API.
    一旦我们运行了服务器,我们就可以进入Swagger UI(通过在生成的ngrok URL上添加/doc)并测试API。
  • While testing the API, I realized that the inference can takes about 3–20 mins depending on the quality/iterations. 3 mins itself is already considered very long for HTTP request and users may not want to wait that long on the site. I decided that setting the inference as a background task and emailing the user once the result is done might be more suitable for the task due to the long inference time.
    在测试API时,我意识到推理可能需要大约3-20分钟,这取决于质量/迭代。对于HTTP请求,3分钟本身已经被认为很长,用户可能不想在站点上等待那么长时间。我决定,将推断设置为后台任务,一旦结果完成就给用户发邮件,由于推断时间较长,可能更适合该任务。
  • Now that we decided on the plan, we first will write the function to send the email. I initially use SendGrid email API to do this, but after running out of the free usage quota (100 emails/day), I switched to Mailgun API since they are part of the GitHub Student Developer Pack and allows 20,000 emails/month for students.
    既然我们决定了计划,我们首先将编写发送电子邮件的函数。我最初使用SendGrid电子邮件API来做这件事,但在用完免费使用配额(100封/天)后,我改用Mailgun API,因为它们是GitHub学生开发包的一部分,允许学生每月发送20,000封电子邮件。
  • So here is the code to send an email with an image attachment using Mailgun API.
    下面是使用Mailgun API发送带有图像附件的电子邮件的代码。
  • Next, we will modify our server code to use background tasks in FastAPI and send the result through email in the background.
    接下来,我们将修改我们的服务器代码,在FastAPI中使用后台任务,并在后台通过电子邮件发送结果。
  • With the code above, the server will quickly reply to the request with the “Task is processed in the background” message instead of waiting for the generation process to finish and replying with the image.
    有了上面的代码,服务器就会快速地用“任务在后台处理”消息回复请求,而不是等待生成过程结束再用图像回复。
  • Once the process is finished, the server will send the result by emailing the user.
    一旦该过程完成,服务器将通过给用户发送电子邮件来发送结果。
  • Now that everything seems to be working, I built the front end and shared the site with my friends. However, I found that there was a concurrency problem when testing it out with multiple users.
    现在一切似乎都在工作,我构建了前端,并与我的朋友分享了这个站点。但是,我在用多个用户测试它的时候发现有一个并发问题。
  • When a second user makes a request to the server while the first task is still processing, somehow the second task will terminate the current process instead of creating a parallel process or queueing. I was not sure what caused this, maybe it was the use of global variables in the clipit code or maybe not. I did not spend too much time debugging it as I realized that I need to implement a message queue system instead.
    当第二个用户在第一个任务仍在处理时向服务器发出请求时,不知何故第二个任务将终止当前进程,而不是创建并行进程或排队。我不确定是什么导致了这种情况,可能是因为在clipit代码中使用了全局变量,也可能不是。我没有花太多时间调试它,因为我意识到我需要实现一个消息队列系统。
  • After a few google searches on the message queue system, most recommend RabbitMQ or Redis. However, I was not sure whether RabbitMQ or Redis can be installed on Google Colab as it seems to require sudo permission. In the end, I decided to use Google Firebase as a queue system instead as I wanted to finish the project ASAP and Firebase is the one I’m most familiar with.
    在google上搜索消息队列系统后,大多数人推荐使用RabbitMQ或Redis。然而,我不确定是否可以在Google Colab上安装RabbitMQ或Redis,因为它似乎需要sudo权限。最后,我决定使用Google Firebase作为队列系统,因为我想尽快完成这个项目,而Firebase是我最熟悉的一个。
  • Basically, when the user tries to generate an art in the frontend, it will add an entry in a collection namedqueue describing the task (prompt, image type, size, etc). On the other hand, we will run a script on Google Colab that continuously listens for a new entry in the queue collection and processes the task one by one.
    基本上,当用户试图在前端生成艺术时,它将在一个名为Queue的集合中添加一个描述任务(提示符,图像类型,大小等)的条目。另一方面,我们将在Google Colab上运行一个脚本,该脚本持续侦听队列集合中的一个新条目,并逐一处理该任务。
  • In the front end, we only have to add a new task in the queue. But make sure you have done a proper Firebase setup on your front end.
    在前端,我们只需在队列中添加一个新任务即可。但要确保你已经在你的前端做了一个正确的Firebase设置。
  • And it’s done! Now, when a user tries to generate art in the frontend, it will add a new task in the queue. The worker script in the Colab server will then process the tasks in the queue one by one. You can check out the GitHub repo to see the full code (feel free to star the repo).
    就这样搞定了!现在,当用户试图在前端生成art时,它会在队列中添加一个新任务。然后,Colab服务器中的worker脚本将逐一处理队列中的任务。您可以查看GitHub repo以查看完整代码(请随意启动repo)。
  • If you enjoyed my writing, check out my other articles!
    如果您认可我的文章,也可以去看看我的其他文章!
  • Feel free to connect with me on Linkedin as well.
    欢迎在领英网上联系我。
  • Reference
    参考文献
  • [1] https://openai.com/blog/dall-e/
    [1]https://openai.com/blog/dall-e/
  • [2] https://openai.com/blog/clip/
    [2]https://openai.com/blog/clip/
  • [3] https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/
    [3]https://ljvmiranda921.github.io/notebook/2021/08/08/clip-vqgan/
  • [4] https://github.com/orpatashnik/StyleCLIP
    [4]https://github.com/orpatashnik/styleclip
  • [5] https://towardsdatascience.com/understanding-flask-vs-fastapi-web-framework-fe12bb58ee75
    [5]https://towardsdatascience.com/understandment-flask-vs-fastapi-web-framework-fe12bb58ee75

400所高校都在用的翻译教学平台

试译宝所属母公司