8 AI video generation products tested, which one will become China's Sora?

8 AI video generation products tested, which one will become China's Sora?

When the release of Sora pushed the video generation model to the forefront, who can become the "Chinese version of Sora"? The author will discuss from three perspectives: product design, actual test results, and industry analysis. Recommended for those who are interested in AI video generation products.

At the beginning of 2024, nothing in the technology circle is more exciting than the emergence of Sora.

Just like the LLM entrepreneurial boom brought by ChatGPT in early 2023, the release of Sora also pushed the video generation model to the forefront.

Tech giants are pushing their products aggressively, and startups are riding the wave.

On March 13, AI video model company Aishi Technology completed RMB 100 million in A1 round of financing; on March 12, Shengshu Technology completed RMB 100 million in A round of financing; on March 1, AI video generation SaaS service provider "Boolean Vector" completed nearly RMB 10 million in financing...

Sora implemented the DiT architecture for the first time, integrating the previously independent diffusion model and big model, and opening a new chapter in the history of video generation models.

There is no doubt that a new technological storm is coming. Overnight, domestic large and small video generation models are vying for the label of "Chinese version of Sora".

In order to explore the answer to this question, "Zi Quadrant" conducted actual experience with existing domestic video generation products, and combined public information, data from third-party testing agencies and other dimensions to conduct a comprehensive evaluation of the current mainstream video generation models.

We will comprehensively explore who can become the "Chinese version of Sora" from three perspectives: product design, actual test results and industry analysis?

1. Who can replicate DIT’s innovation?

Although the Sora trend has just reached China from across the ocean, video generation is not a new topic.

Prior to this, this track has gone through several waves of revolutions including Runway's Gen-2, Pika1.0 and Google VideoPoet, and finally came to the "Sora" moment with better generation effects, longer time, stronger logic and greater stability.

The "self-quadrant" sorts out the basic situation of domestic video large model companies and products.

▲Figure: List of domestic and foreign video generation large model companies, with visits calculated as of February 2024

Overseas, "Silicon Valley old money" such as Google and Microsoft have long invested in the research of multimodal video generation. Last year, Google released the multimodal large model Gemini and the VideoPoet video large model, which allowed people to see the possibility of multimodal video generation from an intuitive effect level.

In China, we see more possibilities in the direction of multimodal technology paths: there is Baidu, a large company with profound technical accumulation, there is Zhipu, a large-model unicorn company, and there are startups like Shengshu Technology and Zhixiang Future that aim at multimodal large models.

The diffusion model route is the mainstream route of Vincent Video, and plays an important role in ensuring the effect generation. Therefore, even the amazing Sora is only a transformation in the underlying architecture, not a complete subversion.

This road is the most crowded, both at home and abroad. The first is Stability AI, which built and open-sourced the diffusion model, followed by Runway and Pika, which are rushing forward, and then giants such as OpenAI, Meta, and Nvidia.

Back in China, Tencent, Alibaba, and ByteDance have almost monopolized the research in the field of video generation in the early stage, and occasionally throw out a demo to surprise people. But when it comes to landing products, startups are obviously one step ahead. For example, Aishi Technology, Morph Studio, Right Brain Technology and other companies have already opened their doors to users.

DiT, also known as the "Sora route", stands for Diffusion Transformer. Its essence is to integrate the training method and mechanism of large models into the diffusion model. Judging from the results presented in the Sora technical report, it may produce the effect of a world physics simulator under great efforts.

Today, Sora's underlying architecture has been thoroughly investigated, and its training components and technologies are also on the road to open source, but this does not mean that everyone will have a Sora in the near future. Technology, data, computing power, and training scale are all hurdles.

Recently, the head of the Sora core team revealed in an interview: "Sora is still in the feedback acquisition stage, it is not yet a product, and will not be open to the public in the short term."

From the perspective of technology, Aishi Technology is one of the few companies in China that has adhered to the DiT route from the beginning. Its founder Wang Changhu said in a public interview that the emergence of Sora has verified the correctness of Aishi's direction in generating large models for video. For this reason, Aishi Technology has set the goal of "surpassing Sora in 3-6 months" and seized the opportunity to catch up.

2. Product testing and user “running scores”

In the field of video generation models, domestic startups can be roughly divided into two categories.

One category is self-developed basic large models represented by PixVerse, PixWeaver, Morph Studio and Pixeling, which focus on video generation tools for general scenarios.

The other category includes Vega AI, Li Bai AI Lab (promeai), 6PenArt, boolv.video and MewXAI. This category is larger in number and more product-oriented, focusing on solving problems in a certain scenario, more like an online editing platform for AIGC.

Our testing and evaluation consists of three parts: usage threshold, basic product functions and content generation capabilities.

The first is the usage threshold. The eight products we tested all support using the product through the website (many startup products can only be used through Discord), and they can all be tried for free.

However, only PixVerse from AiShi Technology has no limit on the number of free trials. Other products have a trial limit of three to five times. After exceeding the number of trials, you need to open a membership or recharge energy, and the prices range from a few yuan to several hundred yuan.

Except for PixVerse, other products basically have functional limitations before payment. For example, Yiying AI and Pixeling can only generate 2s and 4s videos, and longer videos require payment.

Therefore, considering the usage threshold comprehensively, PixVerse is more user-friendly and has a relatively greater advantage in this area. The usage thresholds of other products are relatively average.

The specific situation is as follows:

The second is the basic functions of the product.

The 8 products we tested, except for Promeal and 6PenArt, all have the ability to generate videos from both text and images. However, Promeal and 6PenArt only have the ability to generate videos from images, not directly from text.

Apart from these two, other manufacturers are relatively mature, but the differences in product functions are quite large.

Among them, AiShi Technology's PixVerse has added rich auxiliary functions on top of the basic functions. For example, in addition to positive prompt words, users can also enter negative prompt words to require that certain elements should not appear in the generated picture.

When converting pictures into videos, users can also write prompt words to control the output effect, choose the video style, adjust the aspect ratio, etc.

Among similar products, only Pixeling has negative prompts, image-generated video prompts and video ratio adjustment, and only Yiying AI can adjust video style and picture ratio.

The technical level of the big model determines the quality of video generation, while the product capability determines whether the big model can be well utilized and combined with the application scenario.

For video generation products, the richness of functions determines how easy it is for users to get started, their ability to control video generation, and ultimately affects the output results and user experience.

Therefore, in terms of product perfection and functional completeness, PixVerse leads the pack, followed by Pixeling from Zhixiang Future, Yiying AI, and Vega AI. Boolean Vector is an exception. As a video generation tool focused on cross-border e-commerce, it is more complete and easy to use in specific scenarios, but it lacks competitiveness in video generation.

Of course, in addition to the basic functions, the core is the video generation effect. So the third part is the video content generation ability test.

The first is the duration of video generation. Sora can currently generate 60s of video, but the duration of large video generation models of domestic startups is mostly concentrated around 2s to 4s, and the gap is not particularly large.

The second is the ability to express based on the content of the prompt words.

When Sora was released, a video like this was output, with the prompt words: Beautiful, snowy Tokyo streets are bustling. Several people are enjoying the beautiful snowy day and shopping at a nearby stall. Beautiful cherry blossom petals and snowflakes are flying in the wind.

The first is PixVerse from AiShi Technology.

The content of 4s basically restores all the keywords mentioned in the prompt words, and also reflects the atmosphere of "bustling" and "stalls". The camera moves slowly along the picture, and the overall style of the video remains consistent. The buildings, lights, trees on the roadside, and pedestrians are relatively realistic. There is no obvious freeze in the picture. Except for the slightly unnatural walking of the characters, there is no distortion of the elements.

The second is VegaAI from RightBrain Technology.

The same 4s content, the same only one shot, slowly moving along a crowded street. But unlike PixVerse, which puts the scene in the evening when the lights are just coming on, VegaAI chooses daytime.

Compared with PixVerse by Aishi Technology, VegaAI's character footsteps are more chaotic. Some characters change from two feet to three feet while walking, and then disappear. In addition, the generation of some characters is also very blurry, with only one figure that is constantly changing.

Then there is Yiying AI.

Unlike PixVerse and VegaAI, which have certain lens movements, the video lens generated by Yiying AI is fixed, and it is the only one among these videos that chooses a frontal perspective.

However, choosing the frontal perspective also brings a problem to Yiying AI, that is, it cannot handle the facial expressions of the characters well. The faces of the two people walking towards each other in the video have not stabilized. In addition, Yiying AI also has the problem of character movement, but because the generated video is only 2s, it is not obvious.

The fourth is Pixeling from Zhixiang Technology.

The 4s video uses a fixed lens and the characters move forward. Similar scenes have the same problems of character generation and movement, but Pixeling's understanding of semantics is obviously shallower.

For example, the word "bustling" in the prompt is shown in the previous videos through lights, street shops, and people, but Pixeling chose a rainy alley with few people. The whole picture looks deserted. In addition, the word "shopping" in the prompt is not shown in this video.

Finally, there is Morph Studio.

Its official website has not yet opened for public testing, and "Self-Quadrant" is being tested through Discord.

There are two interesting things about Morph Studio. One is that the generation effect of English prompt words is much better than that of Chinese prompt words. "Zi Quadrant" first generated a video with Chinese prompt words, and the result was completely irrelevant to the prompt words. Then "Zi Quadrant" changed the prompt words to English, and the output effect improved dramatically.

▲Image: Discord screenshot

In terms of video content, Morph Studio's video generation is only 3 seconds, which is shorter than other products, and the clarity is lower than other products, but the overall picture content is more realistic. In terms of details, the videos generated by Morph Studio still have problems such as blurred and distorted details, "drifting" of characters, and appearing and disappearing.

In addition to Wensheng Video, there are two other players who only support "Picture-based Video" - Shencai Promeal and 6PenArt. But these two players' performance in Picture-based Video is not good either.

Among them, Shencai Promeal only supports generating "dynamic images" from a single image, and does not have a prompt word function. Therefore, the generated video characters are distorted and have no practical use value.

In contrast, 6PenArt is more like an AIGC content community, where image generation and video generation are just one of its capabilities. However, 6PenArt does not support the generation of videos directly through prompt words, but requires the platform to generate images through prompt words first, and then convert the images into videos.

"Self-quadrant" generated four pictures through the prompt sentence "A corgi taking a walk with a flower in its mouth" .

▲Image: 6PenArt screenshot

Then, based on these four pictures, a video was generated with the prompt "A puppy running in spring".

As you can see, this video is still in the "dynamic picture" state, which is far from a video.

Besides this, Boolean vectors were not included in this comparison.

Because from the product experience, Boolv.video of Bool Vector is more like the concept of an AI editor. When we input a prompt word, the system will automatically break the prompt word into multiple scripts and storyboards, and then write the copy and output multiple videos. After the video is generated, the user can edit each storyboard, replace the video, change the narration and sound, etc.

▲Image: Screenshot from boolv.video

However, the video generation capability of boolv.video is actually very limited. It can neither understand deep semantics nor generate video content that accurately corresponds to the prompt words.

Among the products we tested above, strictly speaking, only PixVerse and Morph Studio are large models focused on video generation. The other products are evolved from early AIGC applications of text-to-image and image-to-image.

▲Image: Testing whether the product is focused on video generation

Looking back, we have compiled the above-mentioned products based on multiple tests.

From the ability to understand the prompt words, the logical expression ability of the picture, the expression of picture details, to the quality of video generation, picture consistency, stability and fluency, etc.

After testing 8 products, after comprehensive comparison, PixVerse and Morph Studio from AiShi Technology have relatively good overall capabilities, VegaAI from Right Brain Technology ranks second, YiYing AI ranks third, and Pixeling ranks fourth.

Finally, from the entire evaluation, from the threshold of use, to product functions, to content generation capabilities, the various products of Chinese startups have their own advantages. But overall, among Chinese startups, Aishi Technology's PixVerse has a slightly higher overall capability and is the product with the most Sora temperament in China. The second is Morph Studio. These two constitute the first echelon of China's video generation models.

Next are VegaAI, Yiying AI, and Pixeling, which are in the second tier (Shengshu Technology was not included in the evaluation because its product was suspended) , and finally Promeai, 6PenArt, and boolv.video are in the third tier.

The following is a summary of the Self-Quadrant Assessment:

3. Use productivity tools to create a data flywheel

In fact, by comparing the product launches of domestic technology giants and startups so far, we will find that large companies are slower, while startups’ products and user scale are growing faster.

Robin Li once mentioned: Big companies make small innovations, and small companies can make big changes.

If you want to truly break through the fierce competition, at present, in addition to the choice of technical route and the capabilities of the product itself, comprehensive dimensions such as product usage scenarios, user experience, industry applications, etc. are still the key to the competition of video generation models.

In terms of product usage scenarios, as mentioned earlier, one type of company is focusing on developing new tools while the other type of company is embedding technology into certain mature products. These are two completely different routes.

For tool-type products, a core manifestation of product strength lies in whether it can become a productivity tool.

Let's take a brief look at the development history of Midjourney and we will find that Midjourney V5 is a critical tipping point in the history of Wenshengtu. Whether from the perspective of effect, accuracy, speed and other factors, V5 has officially transformed from a "toy" into a productivity tool. This breakthrough in product capabilities has brought a large-scale influx of users, the data flywheel has begun to turn, and the effect is changing with each passing day.

▲Figure: A comparison of the generation effects of V1-V6 made by netizens, source: X

Comparing with the "V5 moment", we find that the video generation model is also about to reach a singularity.

Through real-world evaluations, we found that the videos generated by PixVerse are more valuable in terms of subject consistency, motion smoothness, motion amplitude, and clarity.

Under the premise of productivity tools, there are also two product routes. One is the professional tool route as practiced by Adobe, which makes professionals more professional. The other is like Word, which allows ordinary people to become productive.

On this issue, Pika founder Guo Wenjing said in an interview that Pika is not a movie-making tool, but a product for daily consumption. PixVerse's idea is clearer. Compared with Pika's tiered subscription business model, PixVerse continues to be open to the world for free while its user volume and video effects are in the first echelon in the world. This is difficult for other video generation products to achieve.

It is precisely because of its user-friendly attitude and leading video generation effects that PixVerse's flywheel started to turn. According to third-party data monitoring platforms, PixVerse's user scale is currently at the same level as Pika, and its visit volume far exceeds other mainstream video generation products in China. (Data source: similarweb.com)

▲Comparison of PixVerse, Pika, and Runway product pages in February 2024

▲Comparison of data of major domestic cultural and video products in February

▲Data trends of major domestic cultural and video products

Through research, we found that Aishi Technology is also actively sponsoring/hosting various AI competitions at home and abroad, promoting the rapid implementation of technology while also accelerating the realization of technology inclusion. In this process, more and more users are also feeling the advantages of its product PixVerse.

In addition, Aishi Technology has done a very good job in user ecology. Every day, a large number of video contents created with PixVerse appear on X, covering English, Chinese, Japanese, Spanish and other regions. This is an advantage that other domestic brands do not have at all, and it also reflects the choice of the market to a certain extent.

"The first advantage of PixVerse is that it's free, free, and free; the second advantage is that it is easy to operate and effective. I just need to put the picture in, without writing any prompts, and let PixVerse decide the movement of the picture by itself, and I can often get satisfactory results. I hope PixVerse can achieve larger movements and longer and more stable videos." Feedback from the winner of the 2024 MIT AI Film Hackathon Best Film Nominee Award.

Zi Quadrant believes that free does not mean giving up commercialization, but in the early stages of product polishing, it is through this method that real user experience and high-quality video data generated by users are obtained, and then fed back to the video generation model, which speeds up the iteration and forms a data-training flywheel.

IV. Conclusion

Overall, the technology of the entire video generation model in China is still imitating foreign countries, but startups led by Aishi Technology have found their own development rhythm and model, and are catching up through comprehensive capabilities such as product design, user scale, and operational strategy.

In contrast, Sora is not yet open to the public, and it is unknown whether it can withstand a large number of users online at the same time. Whether it can generate accurate and consistent 1-minute videos every time remains to be tested.

Therefore, it is not necessary to find a Chinese version of Sora. Chinese video large-scale model companies, represented by Ai Shi Technology, have already embarked on a new and independent upward curve.

Author: Luo Ji, Su Yi

Source: WeChat public account: Zixiangxian (ID: zixiangxian)

<<:  Behind the ranking of China's overseas short drama apps: Did the "bosses" make money?

>>:  Rumors are flying everywhere, is the originator of internet celebrity desserts really going to be out of fashion?

Recommend

Good planning is the driving force for a company's rapid growth

In the eyes of many people, planning is a rigid pl...

Three ways to celebrate the Qixi Festival

‍‍Find different role positioning, and the brand&#...

How to create something new through cross-border activities?

As the co-branding mechanism of major brands matur...

Can I cancel my participation in a product event on Shopee? How can I cancel?

Merchants who open stores on Shopee know that just...

How does Amazon use big data? How is it specifically reflected?

The reason why Amazon is so powerful is that the b...

Is it necessary to pay a deposit on Shopee? How to join Shopee?

Most e-commerce platforms require certain conditio...

Do I need to purchase goods to copy sell on Amazon?

For sellers who want to sell products on Amazon, a...

Xiaohongshu sells cars, Mercedes-Benz relies on legs, NIO relies on face

Different dressing styles and shooting methods mak...

Interpretation of Pinduoduo Temu’s Illegal Behavior

Pinduoduo Temu platform prohibits sellers from ind...

Three sheep sold fake Moutai, who was hurt by the Douyin leader

Why did the Three Sheep sell fake wine and shift t...