Copyright battle: The Sword of Damocles hangs over AI creations-AI-php.cn

Author | JAMES VINCENT

Translator | Zhu Xianzhong

1. Generative artificial intelligence is becoming more and more popular, but how legal is its product?

Generative artificial intelligence has become increasingly popular since last year. Companies like Microsoft, Adobe, and GitHub are integrating the technology into their products; startups are raising hundreds of millions of dollars to compete; the software is even having a cultural impact, with text-to-image AI models spawning countless models. Because of culture. But listen carefully to any industry discussion about generative AI, and you'll hear, in private, advocates and critics alike murmuring the question in increasingly concerned tones: Are these, in fact, legal?

The problem arises because of the way generative AI systems are trained. Like most machine learning software, they work by identifying and replicating patterns in data. But because the training data these programs use to generate code, text, music, and art is itself created by humans, scraped from the web, and much of it is itself copyrighted in some way.

For AI researchers living in the distant past (aka the “2010s”), this isn’t a big deal. At the time, state-of-the-art models could only generate blurry black-and-white images of faces the size of a fingernail. There is no apparent threat to humans. But in 2022, when a mere amateur can replicate an artist’s style in a matter of hours using software like Stable Diffusion, or when companies sell AI-generated prints and social media filters, these are all things that are alive and well. Issues of legality and ethics become even more pressing when it comes to designer knockoffs.

2. Is it legal for generative artificial intelligence models to be trained on copyrighted data?

Take Disney illustrator Hollie Mengert, who found her artistic style cloned for an artificial intelligence experiment by a mechanical engineering student in Canada. The student downloaded 32 of Monguet's works and spent several hours training a machine learning model that could reproduce her style. As Mongert told tech expert Andy Baio (who covered the case): "Personally, it feels like someone is doing the job that I do, using the things that I learned — which I did from 2011 Been a working artist since art school - and used it to make art that I don't agree with and don't allow."

But is this fair? Is there anything Mongert can do about this?

To answer these questions and understand the legal landscape of generative AI, The Verge spoke to a range of experts, including lawyers, analysts, and employees at AI startups. Some say with confidence that these systems have a high risk of copyright infringement and may face serious legal challenges in the near future. Others argue the opposite: everything currently happening in the field of generative AI is legal, and any lawsuits are bound to fail.

“I see people on both sides being very confident in their position, but the reality is no one knows,” Baio, who has been closely following the generative AI scene, told the Verge personnel. "Anyone who says they are confident they know how this matter will play out in court would be wrong." Andres Guadamuz, a scholar of intelligence and intellectual property law, said that although there are many unknown factors, there are also several key questions from which many uncertainties on this topic unfold. First, can you copyright the output of a generative AI model? If so, who owns it? Second, if you own the copyright to the inputs used to train the AI, do you have any legal claims on the model or the content it creates? Once these questions are answered, a larger question arises: How do you deal with the impact of this technology? What legal restrictions can or should be imposed on data collection? Can there be peace between the people who build these systems and those who need the data to create them?

Let’s answer these questions one by one.

Copyright battle: The Sword of Damocles hangs over AI creations 3. Output question: Can you obtain copyright for something created by an artificial intelligence model?

At least for the first question, the answer is not too difficult. In the United States, there is no copyright protection for works that are solely machine-generated. However, it seems likely that copyright will be recognized where the creator can demonstrate substantial human investment.

In September of this year, the U.S. Copyright Office approved the registration of a comic book for the first time, which was created with the help of text-to-image AI Midstravel, an AI artwork generator similar to Open AI company’s DALL·E product software). The comic is a complete work: 18 pages of narrative, with characters, dialogue and traditional comic layout. The comic's copyright registration has not actually been revoked, despite reports that the U.S. Copyright Office is reviewing its decision. It appears that one factor in the review will be the level of human investment involved in producing the comics. Kristina Kashtanova, the artist who created the work, told ipwatchdog.com that the U.S. Copyright Office asked her to "provide details of my process to show that there were a large number of human beings involved in the creation of this graphic novel Participation.” (The U.S. Copyright Office itself does not comment on specific cases.)

According to Guadamuz, this will be an ongoing issue when it comes to generating content for AI-powered Copyright is granted for the work. "If you just type 'cat by van Gogh,' I don't think that's enough to get copyrighted in the United States," he said. "But if you start experimenting with prompts and make a few images, start fine-tuning the images, start working with torrent data, and start engineering more, I can totally see that being copyrighted."

4. The copyright for the output products of the artificial intelligence model may depend on the degree of human participation

Considering this criterion, the vast majority of the output of the generative artificial intelligence model Most likely cannot be protected by copyright. They are usually mass-produced using a few keywords as prompts. But a more involved process yields a better case. These may include controversial works, such as an AI-generated print that won a National Art Fair competition. In this case, the creator said he spent weeks honing the prompts and manually editing the finished work, indicating considerable intellectual investment.

Computer scientist Giorgio Franceschelli, who has written about copyright issues in artificial intelligence, said measuring human input was "particularly true" for deciding cases in the European Union. And in the UK – another major jurisdiction focused on by Western AI startups – the law is different again. Unusually, the UK is one of the few countries to provide copyright for works that are solely computer-generated, but it considers the author to be "the person who made the arrangements necessary for the creation of the work". Again, there is the issue of multiple "readers" (is this "person" the developer of the model or its operator?), but it provides precedence for granting some kind of copyright protection.

Ultimately, though, Guadamuz warns, registering a copyright is only the first step. "The U.S. Copyright Office is not a court," he said. "If you want to sue someone for copyright infringement, you need to register, but ultimately it will be the courts that decide whether this is legally enforceable."

##5. Input question: Can you use copyrighted data to train an artificial intelligence model?

For most experts, the biggest questions about artificial intelligence and copyright relate to the data used to train these models. Most systems are trained on large amounts of content obtained from the web; it can be text, code, or images. For example, the training dataset for Stable Diffusion—one of the largest and most influential text-to-painting technologies—contains billions of images pulled from hundreds of domains; from personal blogs on WordPress and Blogspot to DeviantArt There are art platforms like Shutterstock and stock image sites like Shutterstock and Getty Images. In fact, the training data sets for generative AI are so huge that there’s a good chance you’re already one of them.

The rationale used by artificial intelligence researchers, startups, and multi-billion dollar tech companies is that use of these images (at least in the United States) is protected by the fair use doctrine, which aims to In encouraging the use of copyrighted works to promote freedom of expression.

Vanderbilt Law School professor Daniel Gervais explains that there are many considerations when deciding whether something is a fair use factor. Gervais specializes in intellectual property law and has written extensively about how this intersects with artificial intelligence. However, he said two factors "stand out more". "What is the purpose or nature of the use, and what is the impact on the market." In other words: whether the use case changes the nature of the material in some way (often described as a "transformative" use), and whether it changes the nature of the material by communicating with the original author. Are their livelihoods threatened by competing works?

6. Training generative AI on copyrighted data may be legal but you may be using the model in illegal ways

Given the liability associated with these factors, Gervais said there is a "high likelihood" that the copyrighted data training system will be used fairly. But this is not necessarily true for generated content. In other words: you can train an artificial intelligence model using someone else's data, but what you do with the model may be infringing. Think of it like the difference between making counterfeit money to go to the movies and trying to buy a car with counterfeit money.

Consider the same text-to-image AI model deployed in different scenarios. If the model is trained on millions of images and used to generate new images, then this is extremely unlikely to constitute Piracy. In the process, the training data has been transformed and the output does not threaten the market for the original art. But if you fine-tune your model on 100 photos of a particular artist and generate images in the same style, an unhappy artist could make a strong case against you.

Gervais said, “If you gave an AI 10 Stephen King novels and then said ‘make a Stephen King novel,’ then you would be directly talking to Stephen King. Competition. Is this a fair use of artificial intelligence? Probably not."

Crucially, however, between the poles of fair and unfair use, there are countless situations where inputs, purposes and outputs are all balanced in different ways, and may affect in some way any legal ruling.

Ryan Khurana, chief of staff at generative AI company Wombo, said most companies selling these services are aware of the differences. "Intentionally using hints based on copyrighted works to generate output... violates every major player's terms of service," he told the Verge in an email. But, he added, "this is difficult to enforce" and the company feels more The interest is "in coming up with ways to prevent models from being used in copyright-infringing ways... rather than limiting training data." This is especially true for open source text-to-image models like Stable Diffusion, which can be trained and used with zero supervision or filtering. The company may be covering itself, but it may also be facilitating uses that infringe copyright.

Another variable in determining fair use is whether the training data and models were created by academic researchers and nonprofit organizations. This often strengthens fair use defenses, and startups know this. For example, Stability AI does not directly collect training data for the model, nor does it train the model behind the software. Instead, it funded and coordinated this work by academics, and the stable diffusion model was licensed from a German university. This allows Stability AI to turn models into commercial services (DreamStudio) while maintaining legal distance from their inventors.

Baio calls this approach "AI data cleaning." He noted that this approach has been used before in creating facial recognition AI software, pointing to the example of MegaFace, a dataset compiled by researchers at the University of Washington by scraping photos from Flickr. "Academic researchers took the data, cleaned it, and then used it by commercial companies," Baio said. He said the data - including millions of personal photos - was held by "facial recognition company Clearview AI, law enforcement departments and the Chinese government”. This repeatedly tested and proven cleaning process may help protect the creators of generative AI models from liability.

There’s one final twist to all this, however, as Gervais points out that there’s currently a lot of controversy surrounding the pending Supreme Court cases involving Andy Warhol and Prince. The interpretation of fair use may change in the coming months. The case involved Warhol's use of Prince's photographs to create artwork. Is this fair use, or copyright infringement?

Gervais said: “The Supreme Court often gets the benefit of the doubt, so when they do, they usually do something significant. I think they’ll do the same here. ". "And, while waiting for the Supreme Court to change the law, there is a risk in saying anything is settled law."

7. How can artists and artificial intelligence companies achieve peace?

Even if the training of generative artificial intelligence models is found to be covered by fair use, this will hardly solve the problem in the field. This will not appease artists who are angry that their work is used to foster business models, nor will it necessarily apply to other areas of generative AI, such as code and music. With this in mind, the question is: What remedies, technical or otherwise, can be introduced to allow generative AI to flourish while providing credit or compensation to the creators whose work makes the field possible?

The most obvious suggestion is to license the data and pay its creators. For some, though, this will kill the industry. Bryan Casey and Mark Lemley, authors of "Fair Learning," said the training data set was so large that "there was no reasonable option to license all the underlying photos." , video, audio files or text for new purposes”. Allowing any copyright claim, they argue, "is tantamount to saying that instead of the copyright owner being compensated, the use is not allowed at all." Allowing “equitable learning,” as they call it, not only encourages innovation but also allows for the development of better AI systems.

Others point out, however, that we have already solved copyright issues of similar scale and complexity and can do so again. Several experts interviewed by The Verge cited a comparison to the era of music piracy, when file-sharing programs were built on massive copyright infringement and thrived before legal challenges emerged that led to new protocols that respected copyright.

Earlier this month, Matthew Butterick told the Verge: “So, in the early 2000s, you had Napster, and everyone loved it, But it was completely illegal. And today, we have companies like Spotify and iTunes." Currently, Matthew Butrick is a lawyer who sues companies for collecting data to train artificial intelligence models. “How do these systems come about? With companies doing licensing deals and bringing content in legally. All the stakeholders are involved and make it work. To me, the idea that something like that is impossible to happen with artificial intelligence is a disaster. 8, companies and researchers are already experimenting with ways to compensate creators

##Wombo’s Ryan Khurana predicts a similar result. “Music has by far the most complex copyright rules because of the different types of licenses, the variety of copyright holders, and the various intermediaries involved,” he told the Verge. “Given the nuances of the legal issues surrounding AI The difference, I think the whole field of generation will evolve into a licensing system similar to music."

Other alternatives are also being trialled. Shutterstock, for example, said it plans to set up a fund to compensate individuals who sell their work to AI companies to train their models, while DeviantArt created a metadata tag for images shared online that warns AI researchers not to tamper with its content. (At least one small social network, Cohost, already uses the label on its site and says it "will not rule out legal action" if researchers are found to have gone out of their way to remove its images.) However, these methods have met with criticism from Art different groups in the community. Will a one-time license fee make up for lost livelihoods? How can scratchless tags being deployed now help artists whose work is already used to train commercial artificial intelligence systems?

For many creators, it seems the damage has been done. But AI startups are at least proposing new approaches for the future. A clear advance is that AI researchers only need to create databases where there is no possibility of copyright infringement - either because the material is properly licensed or because it was created for the specific purpose of AI training of. One such example is “The Stack” – a dataset used to train artificial intelligence designed specifically to avoid accusations of copyright infringement. It only contains code with the most permissive open source licenses and provides developers with an easy way to delete data upon request. Its creators say their model can be used across the industry.

“The Stack’s approach can definitely be applied to other media,” Yacine Jernite, director of machine learning and society at Hugging Face, which created The Stack in partnership with ServiceNow, told The Verge . “This is an important first step in exploring the widespread consent mechanisms that work best given the rules of the platforms from which AI training data is extracted.” Jernite said Hugging Face hopes to help create how AI researchers treat A “fundamental transformation” of the creator. But so far, the company's approach has been unusual.

9. What will happen next?

No matter where we fall on these legal issues, various players in the field of generative artificial intelligence are already preparing for something. Companies that make millions from this technology are solidifying their position by repeatedly claiming that everything they do is legal (while probably hoping that no one actually questions that claim). On the other side of the “no man’s land,” copyright holders have made tentative stances without fully committing to action. Getty Images recently banned AI content because of the potential legal risks it posed to clients (CEO Craig Peters told the Verge last month): “I don’t think it’s the responsible thing to do; on the contrary, I think it's probably illegal). And the music industry trade group RIAA claims that AI-powered music mixers and extractors infringe members' copyrights (although they haven't launched any actual legal challenges ).

However, with the recent class action lawsuit filed against Microsoft, GitHub and OpenAI, the first shot in the artificial intelligence copyright war has been fired. The case accuses the above three companies of failing to obtain The open source code was deliberately copied through the AI coding assistant Copilot without proper permission. In an interview with The Verge, lawyers for the lawsuit said that this may set a precedent for the entire field of generative artificial intelligence.

“However, once someone reveals the truth, I think lawsuits will start flying everywhere. ”

Meanwhile, both Guadamuz and Baio said they were surprised there haven’t been more related legal challenges. “Honestly, I’m stunned,” Guadamou said "But I think that's partly because these industries are afraid of being the first to lose (a prosecution)," Mutz said. However, once someone reveals the truth, I think lawsuits will start flying everywhere. "

One difficulty, Baio said, is that many of the people most affected by this technology - artists and others - simply are not in a good position to mount a legal challenge. "They There are no resources," he said. "This type of litigation is very expensive and time-consuming and should only be pursued if you know you will win. That's why I've thought for a while that the first lawsuits surrounding AI art would come from stock image sites. They seem poised to lose significantly from this technology, they can clearly prove that their large corpus was used to train these models, and they have the funds to take it to court. "

Guadamuz agreed. "Everyone knows how expensive it will be," he said. "Whoever files the lawsuit will get a ruling in the lower court and then They will appeal, and then appeal again, and ultimately, this may go all the way to the Supreme Court. ”

##Original link: https://www.theverge.com/23444685/generative-ai-copyright-infringement-legal-fair-use-training-data

Translator Introduction

Zhu Xianzhong, 51CTO community editor, 51CTO expert blogger, lecturer, computer teacher at a university in Weifang, and a veteran in the freelance programming industry.

The above is the detailed content of Copyright battle: The Sword of Damocles hangs over AI creations. For more information, please follow other related articles on the PHP Chinese website!