Since Sora made its debut earlier this year, people at home and abroad have been trying to use AI to subvert Hollywood. The AI video circle has been very lively recently, with products being released one after another, all clamoring to surpass Sora. Two foreign AI video startups took the lead. Luma, an artificial intelligence technology company in San Francisco, launched the Dream Machine video generation model and released a movie-level promotional video. The product is also available to users for free trial. Another startup company with a certain reputation in the AI video field, Runway, also announced that it will open the Gen-3 Alpha model to some users for testing, claiming that it can produce details such as light and shadow. Not to be outdone, Kuaishou has launched the Keling Web app, which allows users to generate video content up to 10 seconds long, and also has the functions of controlling the first and last frames and the camera lens. Its original AI fantasy short play "The Mirror of Mountains and Seas: Cutting the Waves" is also broadcast on Kuaishou, and the pictures are all generated by AI. The AI science fiction short play "Sanxingdui: Apocalypse of the Future" was also broadcast recently, and was produced by ByteDance's AI video product Jimeng. The AI video updates so quickly that many netizens exclaimed, "Hollywood may be on strike again." Today, in the AI video track, there are domestic and foreign technology and Internet giants such as Google, Microsoft, Meta, Alibaba, ByteDance, and Meitu, as well as emerging companies such as Runway and Aishi Technology. According to incomplete statistics from "Dingjiao", in China alone, there are about 20 companies that have launched self-developed AI video products/models. According to data from TouBao Research Institute, the market size of China's AI video generation industry is 8 million yuan in 2021, and it is expected that the market size will reach 9.279 billion yuan in 2026. Many industry insiders believe that the generation video track will usher in the mid-journey moment in 2024. What stage have the Soras of the world reached? Who is the strongest? Can AI take down Hollywood? 1. Besieging Sora: There are many products, but few are usefulThere are many products/models launched in the AI video track, but only a few can be used by the public . The most prominent representative abroad is Sora, which is still in internal testing after half a year and is only open to security teams and some visual artists, designers and filmmakers. The situation in China is similar. Alibaba Damo Academy's AI video product "Xunguang" and Baidu's AI video model UniVG are both in the internal testing stage. As for the currently popular Kuaishou Keling, users who want to use it also need to queue up to apply, which has eliminated more than half of the products. Some of the remaining AI video products have set up usage thresholds, requiring users to pay or have certain technical knowledge. For example, Open-Sora from Luchen Technology, if users do not have a little knowledge of code, they will have no idea where to start. "Dingjiao" sorted out the AI video products released at home and abroad and found that the operation methods and functions of each company are similar. Users first generate instructions with text, and then select functions such as frame size, image clarity, generation style, and generation seconds, and finally click one button to generate. The technical difficulties behind these functions are different. The most difficult one is the clarity and number of seconds of the generated video , which is also the focus of competition among AI video companies in their publicity. The reason behind this is closely related to the quality of the materials used in the training process and the size of the computing power. AI researcher Cyrus told Dingjiao that currently most AI videos at home and abroad support the generation of 480p/720p, and a small number support 1080p high-definition videos. He said that the more high-quality materials there are, the higher the computing power is, and the trained model can generate higher-quality videos, but it does not mean that high-quality materials and computing power can generate high-quality materials. If a model trained with low-resolution materials is forced to generate high-resolution videos, it will collapse or repeat, such as multiple hands and feet. Such problems can be solved by enlarging, repairing, and redrawing, but the effect and details are average. Many companies also use the generated long seconds as a selling point. Most domestic AI videos support 2-3 seconds, and products that can reach 5-10 seconds are considered relatively powerful. There are also some very popular products, such as Jimeng, which can be up to 12 seconds. However, none of them can match Sora, which once stated that it can generate a maximum of 60 seconds of video. However, as it has not yet been opened for use, its specific performance cannot be verified. The length of the video is not enough, the content of the generated video must also be reasonable. Zhang Heng, chief researcher of Shiliu AI, told "Dingjiao": Technically, AI can be required to output continuously. It is no exaggeration to say that even generating an hour of video is not a problem, but most of the time what we want is not a surveillance video, nor a looping landscape animation, but a short film with beautiful pictures and stories. Dingjiao tested five popular free AI products for Chinese video, namely, ByteDance’s Jimeng, Morph AI’s Morph Studio, Aishi Technology’s PixVerse, MewXAI’s Yiying AI, and Right Brain Technology’s Vega AI, and gave them the same text instruction: “A little girl in a red skirt is feeding a white rabbit with carrots in the park.” The generation speed of several products is similar, only taking 2-3 minutes, but the clarity and duration vary greatly, and the accuracy is even more chaotic. The advantages and disadvantages of each are obvious. Dreams wins in terms of duration, but the quality of the images is not high. The main character, the little girl, is directly deformed in the post-production. Vega AI also has the same problem. PixVerse has poor image quality. In comparison, the content generated by Morph is very accurate, but it is only 2 seconds long. Yiying also has good image quality, but it does not understand the text well enough, and directly loses the key element of the rabbit. The generated video is not realistic enough and tends to be cartoon-like. In short, no product has yet been able to provide a video that meets the requirements. 2. AI video challenges: accuracy, consistency, and richnessThe experience of "fixed focus" is very different from the promotional videos released by various companies. If AI video wants to be truly commercialized, there is still a long way to go. Zhang Heng told "Dingjiao" that from a technical perspective, they mainly consider the levels of different AI video models from three dimensions: accuracy, consistency, and richness. Zhang Heng gave an example to illustrate how to understand these three dimensions. For example, generate a video of "two girls watching a basketball game on the playground". Accuracy is reflected in, first, accurate understanding of content structure, for example, the objects appearing in the video must be girls, and there are two of them; second, accurate process control, for example, after a shot is made, the basketball must gradually fall from the net; and finally, accurate static data modeling, for example, when there are obstructions to the camera, the basketball cannot turn into a rugby ball. Consistency refers to AI's modeling capabilities in time and space, which includes subject attention and long-term attention. The main attention can be understood as, while watching a basketball game, the two little girls must always stay in the picture and cannot run around; long-term attention means that during the movement, the various elements in the video cannot be lost, nor can they be deformed or have other abnormal conditions. Richness means that AI also has its own logic and can generate some reasonable details even without text prompts. Basically, the AI video tools available on the market have not been able to fully achieve the above dimensions, and various companies are constantly proposing solutions. For example, in terms of the consistency of characters in videos, Jimeng and Keling came up with the idea of replacing text-generated videos with image-generated videos. That is, users first generate pictures with text, and then generate videos with pictures, or directly give one or two pictures, and AI will connect them into a moving video. "But this is not a new technological breakthrough, and the difficulty of image-generated video is lower than that of text-generated video," Zhang Heng told Dingjiao. The principle of text-generated video is that AI first analyzes the text input by the user, breaks it down into a group of lens descriptions, converts the descriptions into text and then into pictures, and then gets the middle key frames of the video. By connecting these pictures, you can get a continuous video with action. Image-generated video is equivalent to giving AI a specific picture to imitate, and the generated video will continue the facial features in the picture to achieve consistency of the protagonist. He also said that in actual scenarios, the effect of image-generated videos is more in line with user expectations, because text has limited ability to express picture details, and having pictures as references will help generate videos, but it is not yet commercially available. Intuitively speaking, 5 seconds is the upper limit of image-generated videos, and longer than 10 seconds may not be very meaningful, either because the content is repeated or the structure is distorted and the quality is reduced. Currently, many short films and TV shows that claim to be produced entirely with AI mostly use image-to-video or video-to-video. Jimeng also uses the last frame function of the picture-based video, and deliberately tried "fixed focus". In the process of combining, the characters appeared deformed and distorted. Cyrus also said that videos require continuity, and many AI video tools that support image-to-video conversion also infer subsequent actions through single-frame images. As for whether the inference is correct, it still depends on luck. It is understood that when it comes to achieving consistency of the protagonist, each company does not rely solely on data generation. Zhang Heng said that most models are based on the original underlying DIT large model, superimposed with various technologies, such as ControlVideo (a controllable text-video generation method proposed by Harbin Institute of Technology and Huawei Cloud), so as to deepen AI's memory of the protagonist's facial features, so that the face will not change too much during movement. However, it is still in the trial stage. Even with the addition of technologies, the problem of character consistency has not been completely solved. 3. Why is AI video evolving so slowly?In the AI circle, the United States and China are currently the most competitive. From the relevant report of "The World's Most Influential Artificial Intelligence Scholars in 2023" (referred to as the "AI 2000 Scholars" list), it can be seen that among the 1,071 global "AI 2000 Institutions" in the four years from 2020 to 2023, the United States has 443, followed by China with 137. From the country distribution of "AI 2000 Scholars" in 2023, the United States has the largest number of selected people, with a total of 1,079 people, accounting for 54.0% of the global total, followed by China, with a total of 280 people selected. In the past two years, in addition to making great progress in the areas of visual images and music, AI has also made some breakthroughs in the most difficult areas of breakthrough, namely AI videos. At the recent World Artificial Intelligence Conference, Yitian Capital partner Le Yuan publicly stated that video generation technology has made progress far beyond expectations in the past two to three years. Liu Ziwei, assistant professor at Nanyang Technological University in Singapore, believes that video generation technology is currently in the GPT-3 era and it will take about half a year to mature. However, Le Yuan also emphasized that its technical level is still not enough to support large-scale commercialization . The methodology and challenges used in developing applications based on language models are also applicable to video-related application fields. At the beginning of the year, the emergence of Sora shocked the world. It made technical breakthroughs in diffusion and generation based on the new diffusion model DiT of the transformer architecture, which improved the quality and realism of image generation, making AI video a major breakthrough. Cyrus said that most of the cultural videos at home and abroad currently use similar technologies. At this moment, everyone is basically consistent in the underlying technology. Although each company is also seeking technological breakthroughs based on this, more volumes of training data are needed to enrich product functions. When using ByteDance's Jimeng and Morph AI's Morph Studio, users can choose the camera movement method of the video. The principle behind this is the different data sets. "In the past, the images used by various companies in training were relatively simple. They mostly labeled the elements in the images, but did not explain what lens was used to shoot the element. This allowed many companies to discover this gap, so they used 3D rendered video datasets to complete the lens features." Zhang Heng said that currently these data come from renderings from the film and television industry and game companies. "Fixed focus" also tried this function, but the lens changes were not very obvious. The reason why Sora and other platforms are developing slower than GPT and Midjourney is that they have a timeline and it is more difficult to train video models than text and images. "All the video training data that can be used now has been mined, and we are also thinking of some new ways to create a series of data that can be used for training," said Zhang Heng. Moreover, each AI video model has its own style. For example, the food broadcast videos made by Kuaishou Keling are better because they are supported by a large amount of such data. Shen Renkui, founder of Shiliu AI, believes that the technologies of AI video include Text to video, Image to video, Video to video, and Avatar to video. Digital humans with customized images and voices have been used in the marketing field and have reached commercial levels, while Avatar videos still need to solve the problems of accuracy and controllability. At this moment, whether it is the AI sci-fi short drama "Sanxingdui: Apocalypse of the Future" co-produced by Douyin and Bona, or the AI fantasy short drama "The Mirror of Mountains and Seas: Cutting Through the Waves" originally created by Kuaishou, it is more about large model companies actively looking for film and television production teams to cooperate with them. They have the need to promote their own technology products, and their works have not gone out of the circle. In the field of short videos, AI still has a long way to go, and it is premature to say that it has taken over Hollywood. Author: Wang Lu Source: WeChat public account: "Dingjiaoone (ID: dingjiaoone)" |
<<: LOL, a collection of financial article titles from when I was 65
>>: The secret of Pinduoduo's explosive sales lies in the details
Students who work with data may encounter many dif...
Shopee is a platform with great development moment...
If a brand wants to operate a good account on Xiao...
It's the end of the year, and there is still s...
With the rapid development of the global economy a...
FOB cost is a very common trade term in internatio...
What is "upward socialization"? Why do w...
If you want to do e-commerce now, there are actual...
In order to turn losses into profits, Ele.me has b...
If you want to open a store on Shopee, you have to...
Shopee has many sites, and the opening methods of ...
In the modern e-commerce market, Amazon has become...
In recent years, more and more blind box concept p...
Why is Xiaomi SU7 so successful? Why is Xiaomi sti...
This article will take an in-depth look at Bawang ...