Is being a “nanny” to AI the way out for people like Tianya?

Is being a “nanny” to AI the way out for people like Tianya?

Tianya, a long-established Chinese community, has been in the "ICU" for a whole year, and bankruptcy seems inevitable. However, the recent news that an American counterpart has boarded the AI ​​express has brought a glimmer of hope to Tianya.

In April last year, Tianya Community was “disconnected” from the Internet due to overdue payment of data center fees.

The crux of the problem is lack of money. Tianya Community said that the crisis came from the intensified liquidity difficulties in recent years and the arrears of telecom IDC fees, which led to the suspension of access to Tianya Community.

The next time I heard news about Tianya was at the end of February this year, when the National Enterprise Bankruptcy and Reorganization Case Information Network published a piece of information that "Tianya Community Network Technology Co., Ltd. was subject to bankruptcy review."

Although Tianya has denied the rumors of its impending bankruptcy, the possibility of Tianya seeing the light of day again is even slimmer.

In the United States, a declining old content platform has found a sideline business by relying on the wave of AI and has made a lot of money - the third-party photo hosting platform Photobucket once had 70 million users and occupied nearly half of the US online photo market. Today, Photobucket is no longer glorious, with only about 2 million people still using it.

As the saying goes, "a lean camel is bigger than a horse." Photobucket, which has been forgotten by most people, still has tens of billions of photos and videos after years of accumulation. This is exactly what AI companies suffering from "data hunger" need most.

Amid the AI ​​boom, more and more companies are coming to Photobucket. Based on the ongoing negotiations, the content in Photobucket's hands may be worth billions of dollars.

It would be strange if an AI company that has money but lacks data does not make a deal with an old community that has no money but has accumulated massive amounts of content.

The news that Photobucket is negotiating a deal with an AI company was revealed by Reuters.

Interestingly, there was a sentence in the report that was deleted in the subsequent editing: "The company expects its first-quarter operating income to grow 10 times to nearly $4.9 billion."

What does 4.9 billion US dollars mean?

Photobucket, as a photo hosting website, was free at first. Around the millennium, Internet users surged, and people were happy to upload photos to a dedicated website to record their lives or share them. Moreover, after uploading pictures to Photobucket, people can also embed them directly on other websites such as MySpace, saving the trouble of uploading them repeatedly. Some sellers also use pictures hosted by Photobucket on eBay or Amazon.

In this way, at its peak, Photobucket accounted for 2% of US Internet traffic.

It seems that Photobucket must go from free to paid. However, Photobucket was a little too impatient. In 2017, Photobucket suddenly changed the third-party display to a paid subscription service of $399 per year. There was no sufficient warning in advance for this move, and many users found that the Photobucket pictures they embedded on other websites could not be displayed, and instead a prompt "pay to unlock" appeared.

You should know that Photobucket already had 100 million registered users at the time, and about 60 million pictures from third-party websites could not be displayed properly under this "upgrade". Amid the controversy, Photobucket changed its annual subscription model to a monthly payment model the following year, and it has been used ever since.

Photobucket went downhill from then on. In the following years, Photobucket experienced power outages at its service center, service interruptions, privacy leaks and other "accidents", and gradually turned from a popular photo website into an Internet scrap product. The company's size also shrunk from 120 employees at its peak to 40 people.

The most expensive paid plan offered by Photobucket currently costs $8 per month. According to the latest report, there are 2 million users using Photobucket, and even if all of them pay $8 per month, they can only contribute $190 million per year. And this is only the income for Photobucket, without deducting the costs of storage, maintenance, operation, etc.

Although Photobucket has lost a lot of users in the past 20 years, it has always kept users' pictures unless they log out. Even when it no longer supports free accounts, Photobucket clearly notifies users: your photos are still there, you just need to start paying now to see them again.

Users who have abandoned Photobucket have been complaining on social media, saying that they frequently receive emails from Photobucket asking for reconciliation and are no longer able to bear it.

Since the pictures are there and the massive amount of content is stored on the server, why not use them to make money? Licensing the platform content to AI companies and earning $4.9 billion is a huge sum for Photobucket.

Why would an AI company choose Photobucket, a “faded product”?

The answer is simple: there is a lack of data. Take OpenAI's GPT series of models as an example. GPT-3 used 300 billion tokens, and GPT-4 used 12 trillion tokens. The amount of tokens required for GPT-5, which is already on the way, is between 60 trillion and 100 trillion.

“Scale is everything” has become a battle cry for AI. In 2020, Johns Hopkins physicist Jared Kaplan published a seminal paper on AI showing that large language models perform better with more training data, just as students learn more by reading more books.

The publicly available data on the Internet is not inexhaustible in front of the big models. According to the artificial intelligence research organization Epoch, all high-quality available data may be exhausted in 2026, and the speed at which the Internet produces data may not be able to keep up with the speed at which the ever-expanding big models consume it.

The paths that “data-hungry” AI companies take to obtain data can be summarized as follows: if it’s free, use it directly; if it’s their own, use it directly and don’t let others use it; if it can be paid for, pay for it; if it can’t be bought even if paid, think of a way to get it if necessary.

Recently, The New York Times reported that OpenAI used content from Google's YouTube when training GPT-4. Direct access is definitely not an option, as Google won't allow it. So OpenAI came up with an idea and created a speech recognition tool called Whisper, which transcribed more than 1 million hours of YouTube videos and then fed it to the model.

Previously, the text-to-video tool Sora, which has not yet been opened to the public, has also aroused suspicion from the outside world. In an interview, Mira Murati, the chief technology officer of OpenAI, did not directly respond to the question of "whether to use content from platforms such as YouTube, Instagram, and Facebook to train Sora". When she heard the question, her complicated expression even became an Internet meme.

YouTube CEO Neal Mohan responded on April 5, saying there is no evidence that OpenAI used YouTube videos to train Sora, but if OpenAI did so, it would be a "clear violation" of YouTube's terms of use.

It might be naive to think that YouTube is trying to protect its users (or creators). Mohan also mentioned in the interview that Google did use some content from YouTube to train its large model Gemini.

On the other hand, Mark Zuckerberg of the giant Meta also regards platform data as his competitive advantage. Zuckerberg once said bluntly: "The next key part of our strategy is to learn from unique data." "On Facebook and Instagram, there are hundreds of billions of publicly shared pictures and tens of billions of public videos."

Elon Musk, who last year criticized Microsoft and threatened to sue it for using X's data to train AI, also quietly updated X's privacy policy, saying that it would use social media data to train machine learning and AI models. When questioned by netizens, Musk simply admitted: "I will only use public information (for training), not private messages or any private data."

Companies that have a large amount of UGC (user-generated content) and also do AI themselves do not sell the data, but only use it for themselves. Other AI companies either take risks and use it secretly, or look for companies that have content but are willing to sell it.

ShutterStock and Reddit are both "big sellers" active in the data trading market.

Image website ShutterStock has cooperated with almost all the big AI companies, including but not limited to OpenAI, Meta, Google, and Amazon, to reach agreements to use their images to train AI. The initial price of each transaction ranged from US$20 million to US$50 million, and the transaction size was subsequently expanded.

As the AI ​​wave surges, Reddit, the "American forum", realizes that its data is vital and valuable to AI companies. Last year, Reddit began negotiations with a series of AIGC leading companies to discuss the issue of paid data use. To put it bluntly, if you don't pay for authorization, don't even think about using the content of this leading American forum to feed AI. The negotiations have made progress. In February this year, Reddit reached an agreement with Google to license data to it for AI training, with a contract value of about $60 million per year.

Under such circumstances, it is only a matter of time before established communities like Photobucket are targeted.

Photobucket CEO Ted Leonard said he is in talks with several tech companies to license 13 billion pieces of content (photos and videos), with each photo costing between 5 cents and $1, and videos costing more than $1.

One buyer told Leonard they wanted more than a billion videos, more than Photobucket had. At the time of the negotiations, Photobucket was sitting on billions of dollars worth of content.

Photoshop, for its part, updated its user terms last October to grant the platform the “unrestricted right” to sell any uploaded content for use in training AI systems.

Leonard even said that data licensing could replace the company's advertising sales business.

The busy data trading market may provide a "side job" for declining or even dead UGC platforms.

It is unknown how much content Tianya has accumulated, but a few statistics can give us a glimpse of its scale. At its peak, Tianya's daily visits reached 20 million.

In the golden age of Chinese communities, the saying "Everyone's topic is created by Tianya" was popular. Many first-generation Internet celebrities were born here, such as Sister Furong, Jipin Xiaoyueyue, and Xilige. Many best-selling books were hatched here, such as "Ghost Blowing Lights", "Those Things in the Ming Dynasty", "Northeast Past: Twenty Years of Underworld", "Forensic Qin Ming", etc.

The usefulness of Chinese forums for AI training is also gaining attention.

A study shows that the "retarded bar" in Baidu Tieba shows excellent data training effects.

This study was jointly completed by the Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, the Institute of Automation, Chinese Academy of Sciences, the University of Waterloo and many other universities and research institutions, and proposed a high-quality Chinese guidance optimization dataset. The Chinese guidance optimization dataset was used to train models of different types and sizes, and the impact of various data sources on model performance was explored. In the test, Baidu Tieba's "Retarded Bar" scored quite high.

The "Retarded Bar" has 300 members, who are not really related to intellectual disabilities, but "pretend to be retarded" and make some brain-burning remarks. For example, "If the high school enrollment rate is not high, why not directly recruit college students?" or "Why didn't my parents invite me to their wedding?" Researchers speculate that the questions in the "Retarded Bar" may have enhanced the logical reasoning ability of AI.

This is the spark created by the collision between national creation and AI. Community content can sometimes provide unexpected surprises.

However, standing between community content and AI are users.

Just like Photobucket is busy updating its user terms, "content rights confirmation" on the Chinese Internet has always been a problem.

On the one hand, Chinese Internet platforms have long formed the habit of embedding authorization clauses in user terms. The "Privacy and Copyright" agreement of Tianya in 2017 that can be found so far states: "For any content published and uploaded to this website by users, this community enjoys permanent, irrevocable, free, non-exclusive rights of use and sublicense in any form and carrier worldwide, including but not limited to modification, reproduction, distribution, exhibition, adaptation, compilation, publication, translation, information network dissemination, broadcasting, performance and other rights determined by laws and regulations such as creation and copyright law."

After Tianya was “powered off”, the business of selling “Tianya God Post Collection” on the Internet became popular. Tianya once said in the restart announcement that it had noticed the great popularity of Tianya God Post on various platforms and “planned to develop a group of senior members from now on and open a Tianya God Post paid area on the Tianya community platform after replying to the visit.”

At the end of the relaunch announcement, Tianya said, "Whether it is pre-ordering a '99 yuan Tianya magic post service' or a '299 yuan one-to-one data download service', it is a very important boost to Tianya's relaunch." At the end of the article, Tianya attached a QR code for purchase.

On the other hand, whether the platform has the right to license user content to other companies for training AI remains to be discussed.

Users are quite wary of this.

Last year, Xiaohongshu updated its user terms and conditions, stating in the "User Content and Information Authorization" that "you grant xxx company a free, irrevocable, non-exclusive license to use the content without geographical restrictions," and that "the above license includes the right and permission to use, copy and display protected personal images, portraits, names, trademarks, brands, logos and other marketing and promotional materials and materials in user content." In addition, at that time, some illustrators questioned the suspected plagiarism of AI tools, which aroused the illustrators' concerns that the platform would use their uploaded works to train AI. Many illustrators publicly boycotted and announced that they would stop updating on the platform.

Today, PhotoBucket’s CEO was interviewed and admitted that the platform has licensing agreements with AI companies, but not every AI company is confident in its content.

Daniela Braga, CEO of Defened.ai, said she avoids acquiring content from platforms like Photobucket, preferring instead to license it from the original creators of the photos: “I think that’s very dangerous,” she said. “If there’s something AI-generated that resembles a photo of someone who never gave their permission, that’s a problem.”

References:

1. QuantumBit: "How come idiots become the best Chinese AI training data? Chinese Academy of Sciences and others: Ranked first in 8 tests, far exceeding Zhihu, Douban and Xiaohongshu"

2. Daily Economic News: "Tianya Community Restart Schedule Announced. Who Will Spend 99 Yuan to Buy "Tianya Magic Post Service"? "

3. TechFox: "After 23 years of establishment, the community that carries the memories of countless people has closed down..."

4. Titanium Media: "Titanium Media Exclusive | Tianya Community App "Resurrected", Online App Store But Cannot Be Used Normally, The Company Is Raising 10 Million Yuan"

Blue Whale Finance: "Nostalgia is worthless? "Tianya Restart" postponed after "bankruptcy mystery"

<<:  Who is the most influential short play creator? Xinbang officially released the "Short Play Creator Influence List"

>>:  The film and television industry is trapped in the "mini program short drama anxiety"

Recommend

Family members! Who knows! Can private domain traffic be done this way?

As the traffic dividend gradually declines, it is ...

Top 10 Marketing Keywords in 2022 | Year-end Summary

In the context of the new era, some new trends hav...

How to avoid tax on Amazon? Introduction to common overseas tax issues

Today, I will introduce you to Amazon. After we op...

What is a Shopee master account? How to set it up?

After registering a Shopee store, you need to set ...

为何「春日限定」出不了爆款?

In order to create a hit product or a big buzz in ...

What categories are the best for Shopee at the moment? How to choose?

The prospects of the cross-border e-commerce indus...

Brand development principles

Brand means the image of products and quality comm...

Why is there always marketing backlash in recent years?

Nowadays, publicity and marketing are always criti...