文本到圖像生成模型

文本到圖像生成模型是一種機器學習模型，一般以自然語言描述為輸入，輸出與該描述相匹配的圖像。這種模型的開發始於2010年代中期，伴隨深度神經網絡技術的發展而進步。2022年，最先進的文生圖模型，例如OpenAI的DALL-E 2、谷歌大腦的Imagen和StabilityAI的Stable Diffusion，其品質開始接近真實照片或是人類所繪藝術作品。

文生圖模型通常結合了一個語言模型，負責將輸入的文本轉化為機器描述，而圖像生成模型則負責生成圖像。最有效的模型通常是用從互聯網上抓取的大量圖像和文本數據訓練出來的。^[1]

歷史

在深度學習興起之前，搭建文生圖模型的嘗試僅限於通過排列現有的組件圖像，如來自美工圖案數據庫的素材，形成類似於拼貼畫的圖像。^[2]^[3]

相反的任務，即給圖像配文更具有可操作性，在第一個文生圖模型出現之前，就已經出現了一些類似的模型。^[4]

第一個現代文生圖模型是alignDRAW，由多倫多大學研究人員於2015年推出，擴展了之前的DRAW架構（其使用帶有注意力機制的循環變分自編碼器）使其能以文本序列作為輸入。^[4] alignDRAW生成的圖像是模糊的，並不逼真，但該模型能歸納出訓練數據中沒有的物體（如紅色校車）。並適當的處理新的提示，如「停車標識在藍天上飛」，表明它並不僅僅是在「回放」訓練集中的數據。^[4]^[5]

以「A stop sign is flying in blue skies（在藍天上飛行的停車標識）」為文本提示，由AlignDRAW生成的8張圖像(2015)。（經過放大處理以顯示細節）^[6]

2016年，Reed、Akata、Yan等人首先試圖將生成對抗網絡用於文生圖任務。^[5]^[7]通過用狹窄的、特定領域的數據集訓練的模型，他們能夠從文字說明中生成「視覺上可信的」物體，如從「an all black bird with a distinct thick, rounded bill（一隻全黑的鳥。有明顯的厚而圓的喙）」中生成「視覺上可信的」鳥和花。在更多樣化的COCO數據集上訓練的模型產生的圖像「從遠處看……令人鼓舞」，但在細節上缺乏一致性。^[5]後來的系統包括VQGAN+CLIP、^[8]XMC-GAN和GauGAN2。^[9]

最早引起公眾廣泛關注的文生圖模型之一是OpenAI的DALL-E，它是一個公佈於2021年1月的Transformer模型系統。^[10]2022年4月，又發佈了能生成更複雜、更逼真圖像的DALL-E 2，^[11]2022年8月又出現了公開發布的Stable Diffusion。^[12]

繼其他文生圖模型之後，由語言模型驅動的文生視頻平台開始湧現，如Runway、Make-A-Video、^[13]Imagen Video、^[14] Midjourney^[15]Phenaki等，^[16]它們可以從文本和/或文/圖描述生成視頻。^[17]

結構與訓練

AI藝術機器學習模型高級結構示意圖。展示了AI藝術領域中影響較大的模型與應用，及之間的關係和依賴性。

文生圖模型有各種不同架構。文本編碼這一步可以用循環神經網絡如長短期記憶（LSTM）網絡實現，後來更流行的是Transformer模型。對於圖像生成這一步，通常使用條件生成對抗網絡，近年來擴散模型也很受歡迎。與其直接訓練一個以文本為輸入、以高解像度圖像為輸出的模型，不如先訓練一個模型來生成低解像度圖像，再用一個或多個輔助的深度學習模型來提升質量，填補更精細的細節。

文生圖模型是在大型（文，圖）對數據集的基礎上訓練的，通常是從互聯網上抓取來的。谷歌大腦在2022年的Imagen模型中使用的大型語言模型僅用到了純文本數據（其權重隨後被凍結），並得到了積極的結果，這與以往的標準方法不同。^[18]

數據集

來自3個不同公開數據集的文-圖對例子。訓練文生圖模型的數據通常采自這些數據集。

訓練文生圖模型需要一個與文字說明相互相匹配的圖像數據集。常用於此目的的數據集是微軟於2014年發佈的COCO（Common Objects in Context，語境中的常見對象），其由約12.3萬張描述各種物體的圖片組成，每張圖片都有5條說明，由人類標註。Oxford-120 Flowers和CUB-200 Birds是較小的數據集，各有約1萬張圖片，分別限於花和鳥。它們的主題範圍比較窄，因此用它們訓練領域內的高質量文生圖模型難度較小。^[7]

評價

評價文生圖模型的質量十分具有挑戰性，需要評估多種不同的屬性。與任何生成性圖像模型相同，所生成的圖像最好比較真實（看起來像是來自訓練集的有意義圖像），且風格多樣。文生圖模型的一個具體要求是，生成的圖像在語義上應與用於生成圖像的文字說明相一致。這個一致性的度量與許多方案，有些是自動的，有些則基於人類的判斷。^[7]

評估圖像質量和多樣性的常用算法指標是初始分數（Inception score，IS），它基於預訓練的Inception v3圖像分類模型應用於文生圖模型生產的圖像樣本時，預測的標籤分佈。一個單一標籤的可能性越高，分數就越高，這是基於鼓勵「獨特性」的理念做出的。另一個較為知名的指標是與其相關的FID分數，它根據預訓練的圖像分類模型的最後一層所提取的特徵，對生成的圖像和真實訓練圖像的分佈進行比較。^[7]

影響與應用

紐約現代藝術博物館的「思考機器:1959-1989,計算機時代的藝術與設計」（Thinking Machines: Art and Design in the Computer Age, 1959–1989）展覽提供了AI在藝術、建築和設計中的應用概況。展示AI用於生產藝術作品的展覽有2016年穀歌贊助的三藩市灰色區域基金會的慈善活動和拍賣會，以及2017年於洛杉磯和法蘭克福舉辦的「非人類：AI時代的藝術」（Unhuman: Art in the Age of AI），藝術家們在那裏實驗了DeepDream算法。2018年春，美國計算機協會專門出版了一期以計算機和藝術為主題的雜誌。2018年6月，允許觀眾與AI互動的藝術作品「人與機器的二重奏」（Duet for Human and Machine）於Beall藝術+技術中心首演。奧地利Ars Electronica和維也納應用藝術博物館在2019年開設了關於AI的展覽。Ars Electronica的2019年節日主題「盒子之外」（Out of the box）探討了藝術在可持續社會轉型中的作用。

網絡上對於生成圖像的應用開始蓬勃發展，也開始出現傳統非尖端科技的領域的延伸。例如「小秋子繪本^[19]」就是文字轉圖片輔助語言治療的案例。

2022年9月，一位專家得出結論：「AI藝術現在無處不在」，甚至專家也不知道它將意味着什麼。^[20]一家新聞媒體確定「AI藝術蓬勃發展」，並報道了專業藝術家的版權和自動化問題，^[21]一家新聞媒體則調查了網絡社區面對大量此種作品時的反應，^[22]也有人提出了對深偽技術的擔憂。^[23]一部雜誌強調了實現「新的藝術表現形式」的可能性，^[24]一篇社論指出，它可能被視為一種受歡迎的「人類能力的增強」。Vincent, James. Anyone can use this AI art generator — that's the risk. The Verge. 2022-09-15 [2022-11-09]. （原始內容存檔於2023-02-14）. ^[25]^[26]

以「月夜裏盛滿銀河的游泳池」（swimming pool filled with a galaxy on a moonlit night）為提示，用Midjourney生成的圖像

這種增強的例子可能包括，使業餘愛好者能擴大非商業的市場定位體裁（常見的是賽博朋克衍生體裁，如太陽龐克）。

包括AI藝術在內的合成媒體在2022年被描述為一個主要的技術驅動趨勢，可能會在將來幾年內影響商業。^[26]

參見

參考文獻

^ Vincent, James. All these images were generated by Google's latest text-to-image AI. The Verge (Vox Media). May 24, 2022 [2022-05-28]. （原始內容存檔於2023-02-15）.
^ Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan. A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis (PDF). 2019-10 [2023-01-14]. arXiv:1910.09399  . （原始內容存檔 (PDF)於2023-03-16）.
^ Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley. A text-to-picture synthesis system for augmenting communication (PDF). AAAI. 2007, 7: 1590–1595 [2023-01-14]. （原始內容存檔 (PDF)於2022-09-07）.
^ ^4.0 ^4.1 ^4.2 Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan. Generating Images from Captions with Attention. ICLR. 2015-11 [2023-01-14]. arXiv:1511.02793  . （原始內容存檔於2023-04-14）.
^ ^5.0 ^5.1 ^5.2 Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak. Generative Adversarial Text to Image Synthesis (PDF). International Conference on Machine Learning. 2016-06 [2023-01-14]. （原始內容存檔 (PDF)於2023-03-16）.
^ Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan. Generating Images from Captions with Attention (PDF). International Conference on Learning Representations. 2016-02-29 [2023-01-14]. arXiv:1511.02793  . （原始內容存檔 (PDF)於2023-02-03）.
^ ^7.0 ^7.1 ^7.2 ^7.3 Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas. Adversarial text-to-image synthesis: A review. Neural Networks. 2021-12, 144: 187–209. PMID 34500257. S2CID 231698782. doi:10.1016/j.neunet.2021.07.019.
^ Rodriguez, Jesus. 🌅 Edge#229: VQGAN + CLIP. thesequence.substack.com. [2022-10-10]. （原始內容存檔於2022-12-04）（英語）.
^ Rodriguez, Jesus. 🎆🌆 Edge#231: Text-to-Image Synthesis with GANs. thesequence.substack.com. [2022-10-10]. （原始內容存檔於2022-12-04）（英語）.
^ Coldewey, Devin. OpenAI's DALL-E creates plausible images of literally anything you ask it to. TechCrunch. 2021-01-05 [2023-01-14]. （原始內容存檔於2021-01-06）.
^ Coldewey, Devin. OpenAI's new DALL-E model draws anything — but bigger, better and faster than before. TechCrunch. 2022-04-06 [2023-01-14]. （原始內容存檔於2023-05-06）.
^ Stable Diffusion Public Release. Stability.Ai. [2022-10-27]. （原始內容存檔於2022-08-30）（英國英語）.
^ Kumar, Ashish. Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text. MarkTechPost. 2022-10-03 [2022-10-03]. （原始內容存檔於2022-12-01）（美國英語）.
^ Edwards, Benj. Google's newest AI generator creates HD video from text prompts. Ars Technica. 2022-10-05 [2022-10-25]. （原始內容存檔於2023-02-07）（美國英語）.
^ Rodriguez, Jesus. 🎨 Edge#237: What is Midjourney?. thesequence.substack.com. [2022-10-26]. （原始內容存檔於2022-12-04）（英語）.
^ Phenaki. phenaki.video. [2022-10-03]. （原始內容存檔於2022-10-07）.
^ Edwards, Benj. Runway teases AI-powered text-to-video editing using written prompts. Ars Technica. 2022-09-09 [2022-09-12]. （原始內容存檔於2023-01-27）.
^ Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2022-05-23 [2023-01-14]. arXiv:2205.11487  . （原始內容存檔於2023-03-25）.
^ 小秋子繪本. 小秋子繪本. 小秋子繪本.
^ Ocampo, Rodolfo. AI art is everywhere right now. Even experts don't know what it will mean. techxplore.com. [2022-09-15]. （原始內容存檔於2023-01-19）（英語）.
^ As AI-generated art takes off - who really owns it?. Thomson Reuters Foundation. [2022-09-15]. （原始內容存檔於2022-09-23）.
^ Edwards, Benj. Flooded with AI-generated images, some art communities ban them completely. Ars Technica. 2022-09-12 [2022-09-15]. （原始內容存檔於2023-01-31）（美國英語）.
^ Wiggers, Kyle. Deepfakes: Uncensored AI art model prompts ethics questions. TechCrunch. 2022-08-24 [2022-09-15]. （原始內容存檔於2022-08-31）.
^ AI is reshaping creativity, and maybe that's a good thing. Dazed. 2022-08-18 [2022-09-15]. （原始內容存檔於2023-01-23）（英語）.
^ AI-generated art illustrates another problem with computers | John Naughton. The Guardian. 2022-08-20 [2022-09-15]. （原始內容存檔於2023-02-06）（英語）.
^ ^26.0 ^26.1 Elgan, Mike. How 'synthetic media' will transform business forever. Computerworld. 2022-11-01 [2022-11-09]. （原始內容存檔於2023-02-10）（英語）.

[imagen-verge-1] Vincent, James. All these images were generated by Google's latest text-to-image AI. The Verge (Vox Media). May 24, 2022 [2022-05-28]. （原始內容存檔於2023-02-15）.

[agnese-2] Agnese, Jorge; Herrera, Jonathan; Tao, Haicheng; Zhu, Xingquan. A Survey and Taxonomy of Adversarial Neural Networks for Text-to-Image Synthesis (PDF). 2019-10 [2023-01-14]. arXiv:1910.09399  . （原始內容存檔 (PDF)於2023-03-16）.

[zhu-2007-3] Zhu, Xiaojin; Goldberg, Andrew B.; Eldawy, Mohamed; Dyer, Charles R.; Strock, Bradley. A text-to-picture synthesis system for augmenting communication (PDF). AAAI. 2007, 7: 1590–1595 [2023-01-14]. （原始內容存檔 (PDF)於2022-09-07）.

[mansimov-2015-4] 4.0 ^4.1 ^4.2 Mansimov, Elman; Parisotto, Emilio; Lei Ba, Jimmy; Salakhutdinov, Ruslan. Generating Images from Captions with Attention. ICLR. 2015-11 [2023-01-14]. arXiv:1511.02793  . （原始內容存檔於2023-04-14）.

[reed-2016-5] 5.0 ^5.1 ^5.2 Reed, Scott; Akata, Zeynep; Logeswaran, Lajanugen; Schiele, Bernt; Lee, Honglak. Generative Adversarial Text to Image Synthesis (PDF). International Conference on Machine Learning. 2016-06 [2023-01-14]. （原始內容存檔 (PDF)於2023-03-16）.

[6] Mansimov, Elman; Parisotto, Emilio; Ba, Jimmy Lei; Salakhutdinov, Ruslan. Generating Images from Captions with Attention (PDF). International Conference on Learning Representations. 2016-02-29 [2023-01-14]. arXiv:1511.02793  . （原始內容存檔 (PDF)於2023-02-03）.

[frolov-7] 7.0 ^7.1 ^7.2 ^7.3 Frolov, Stanislav; Hinz, Tobias; Raue, Federico; Hees, Jörn; Dengel, Andreas. Adversarial text-to-image synthesis: A review. Neural Networks. 2021-12, 144: 187–209. PMID 34500257. S2CID 231698782. doi:10.1016/j.neunet.2021.07.019.

[8] Rodriguez, Jesus. 🌅 Edge#229: VQGAN + CLIP. thesequence.substack.com. [2022-10-10]. （原始內容存檔於2022-12-04）（英語）.

[9] Rodriguez, Jesus. 🎆🌆 Edge#231: Text-to-Image Synthesis with GANs. thesequence.substack.com. [2022-10-10]. （原始內容存檔於2022-12-04）（英語）.

[tc-dalle-10] Coldewey, Devin. OpenAI's DALL-E creates plausible images of literally anything you ask it to. TechCrunch. 2021-01-05 [2023-01-14]. （原始內容存檔於2021-01-06）.

[tc-dalle-2-11] Coldewey, Devin. OpenAI's new DALL-E model draws anything — but bigger, better and faster than before. TechCrunch. 2022-04-06 [2023-01-14]. （原始內容存檔於2023-05-06）.

[12] Stable Diffusion Public Release. Stability.Ai. [2022-10-27]. （原始內容存檔於2022-08-30）（英國英語）.

[13] Kumar, Ashish. Meta AI Introduces 'Make-A-Video': An Artificial Intelligence System That Generates Videos From Text. MarkTechPost. 2022-10-03 [2022-10-03]. （原始內容存檔於2022-12-01）（美國英語）.

[14] Edwards, Benj. Google's newest AI generator creates HD video from text prompts. Ars Technica. 2022-10-05 [2022-10-25]. （原始內容存檔於2023-02-07）（美國英語）.

[15] Rodriguez, Jesus. 🎨 Edge#237: What is Midjourney?. thesequence.substack.com. [2022-10-26]. （原始內容存檔於2022-12-04）（英語）.

[16] Phenaki. phenaki.video. [2022-10-03]. （原始內容存檔於2022-10-07）.

[17] Edwards, Benj. Runway teases AI-powered text-to-video editing using written prompts. Ars Technica. 2022-09-09 [2022-09-12]. （原始內容存檔於2023-01-27）.

[imagen-paper-18] Saharia, Chitwan; Chan, William; Saxena, Saurabh; Li, Lala; Whang, Jay; Denton, Emily; Kamyar Seyed Ghasemipour, Seyed; Karagol Ayan, Burcu; Sara Mahdavi, S.; Gontijo Lopes, Rapha; Salimans, Tim; Ho, Jonathan; J Fleet, David; Norouzi, Mohammad. Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2022-05-23 [2023-01-14]. arXiv:2205.11487  . （原始內容存檔於2023-03-25）.

[19] 小秋子繪本. 小秋子繪本. 小秋子繪本.

[20] Ocampo, Rodolfo. AI art is everywhere right now. Even experts don't know what it will mean. techxplore.com. [2022-09-15]. （原始內容存檔於2023-01-19）（英語）.

[21] As AI-generated art takes off - who really owns it?. Thomson Reuters Foundation. [2022-09-15]. （原始內容存檔於2022-09-23）.

[22] Edwards, Benj. Flooded with AI-generated images, some art communities ban them completely. Ars Technica. 2022-09-12 [2022-09-15]. （原始內容存檔於2023-01-31）（美國英語）.

[deepfakes-23] Wiggers, Kyle. Deepfakes: Uncensored AI art model prompts ethics questions. TechCrunch. 2022-08-24 [2022-09-15]. （原始內容存檔於2022-08-31）.

[24] AI is reshaping creativity, and maybe that's a good thing. Dazed. 2022-08-18 [2022-09-15]. （原始內容存檔於2023-01-23）（英語）.

[25] AI-generated art illustrates another problem with computers | John Naughton. The Guardian. 2022-08-20 [2022-09-15]. （原始內容存檔於2023-02-06）（英語）.

[computerworld-26] 26.0 ^26.1 Elgan, Mike. How 'synthetic media' will transform business forever. Computerworld. 2022-11-01 [2022-11-09]. （原始內容存檔於2023-02-10）（英語）.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]