Uncategorized

Data privacy and intellectual property have hidden risks

Responsible use of artificial intelligence in public research is a challenge for scientists, publishers, and the Semantic Scholar corpus in the light of the advent of Chatbots

Since ChatGPT’s arrival in November 2022, it seems that there’s no part of the research process that chatbots haven’t touched. Generative artificial intelligence can now perform literature searches, write manuscripts, grant applications and peer-review comments. Yet, because the tools are trained on huge data sets — that often are not made public — these digital helpers can also clash with ownership, plagiarism and privacy standards in unexpected ways that cannot be addressed under current legal frameworks. Users have a responsibility to ensure they are using the tools in a responsible manner as genAI increasingly goes into the public domain.

The leader of the Data Provenance Initiative, Shayne Longpre, sees the DPA’s attempts to source data ethically as admirable but he thinks the opt-in standard will be difficult to sell. He says if you work under this regime, you are either going to pay a lot or be data-starved. “It could be that only a few players, large tech companies, can afford to license all that data.”

The evidence suggests that publishers are acting in accordance with scientists. Daniel Weld, chief scientist at the AI search engine Semantic Scholar, based at the University of Washington in Seattle, has noticed that more publishers and individuals are reaching out to retroactively request that papers in the Semantic Scholar corpus not be used to train AI models.

It’s only now that the international policy is catching up with the rise of Artificial Intelligence and clear answers to many questions, like who owns what rights when it comes to data, will be more than a few years away. “We are now in this period where there are very fast technological developments, but the legislation is lagging,” says Christophe Geiger, a legal scholar at Luiss Guido Carli University in Rome. “The challenge is how we establish a legal framework that will not disincentivize progress, but still take care of our human rights.”

Tudorache believes that the act is an acknowledgement of the new reality of Artificial Intelligence. “We’ve had many other industrial revolutions in the history of mankind, and they all profoundly affected different sectors of the economy and society at large, but I think none of them have had the deep transformative effect that I think AI is going to have,” he says.

Academics often sign their IP over to institutions or publishers, giving them less leverage in deciding how their data are used. But Christopher Cornelison, the director of IP development at Kennesaw State University in Georgia, says it’s worth starting a conversation with your institution or publisher if you have concerns. These entities could be in a better position to broker a licensing agreement with an artificial intelligence company. “We certainly don’t want an adversarial relationship with our faculty, and the expectation is that we’re working towards a common goal,” he says.

Scientists can now detect whether visual products, such as images or graphics, have been included in a training set, and have developed tools that can ‘poison’ data such that AI models trained on them break in unpredictable ways. “We basically teach the models that a cow is something with four wheels and a nice fender,” says Ben Zhao, a computer-security researcher at the University of Chicago in Illinois. Zhao worked on one such tool, called Nightshade, which manipulates the individual pixels of an image so that an AI model associates the corrupted pattern with a different type of image (a dog instead of a cat, for example). Unfortunately, there are not yet similar tools for poisoning writing.

Specialists broadly agree that it’s nearly impossible to completely shield your data from web scrapers, tools that extract data from the Internet. Some steps can be taken to add an extra layer of oversight, such as making resources open and available only by request, or hosting data locally on a private server. Several companies, including OpenAI, Microsoft and IBM, allow customers to create their own chatbots, trained on their own data, that can be isolated in this way.

It might feel like missing out on a golden chance if you abstain from using Genai. But for certain disciplines — particularly those that involve sensitive data, such as medical diagnoses — giving it a miss could be the more ethical option. “Right now, we don’t really have a good way of making AI forget, so there are still a lot of constraints on using these models in health-care settings,” says Uri Gal, an informatician at the University of Sydney in Australia, who studies the ethics of digital technologies.

Other publishers, such as Wiley and Oxford University Press, have brokered deals with AI companies. Taylor & Francis, for example, has a US$10-million agreement with Microsoft. The Cambridge University Press (CUP) has not yet entered any partnerships, but is developing policies that will offer an ‘opt-in’ agreement to authors, who will receive remuneration. In a statement to The Bookseller magazine discussing future plans for the CUP — which oversees 45,000 print titles, more than 24,000 e-books and more than 300 research journals — Mandy Hill, the company’s managing director of academic publishing, who is based in Oxford, UK, said that it “will put authors’ interests and desires first, before allowing their work to be licensed for GenAI”.

Representatives of the publishers Springer Nature, the American Association for the Advancement of Science (which publishes the Science family of journals), PLOS and Elsevier say they have not entered such licensing agreements — although some, including those for the Science journals, Springer Nature and PLOS, noted that the journals do disclose the use of AI in editing and peer review and to check for plagiarism. (Springer Nature publishes Nature, but the journal is editorially independent from its publisher.)

In fields that are linked to professional success and prestige, it is not always possible to get an Attribution for your research work. Evan Spotte-Smith isn’t using Artificial Intelligence for ethical reasons or because he doesn’t like it, but he dislikes removing people’s names from their work for other reasons. Research has shown that members of groups that are marginalized in science have their work published and cited less frequently than average5, and overall have access to fewer opportunities for advancement. AI stands to further exacerbate these challenges, Spotte-Smith says: failing to attribute someone’s work to them “creates a new form of ‘digital colonialism’, where we’re able to get access to what colleagues are producing without needing to actually engage with them”.

When contacted for comment, an OpenAI spokesperson said the company was looking into ways to improve the opt-out process. A research company believes that artificial intelligence offers huge benefits to society. “We respect that some content owners, including academics, may not want their publicly available works used to help teach our AI, which is why we offer ways for them to opt out. We are also examining what other tools might be useful.

The technology underlying genAI, which was first developed at public institutions in the 1960s, has now been taken over by private companies, which usually have no incentive to prioritize transparency or open access. The inner mechanics of Genai chatbot are a black box and are often not understood by the people who created them. It is virtually impossible to know what went into a model’s answer to a prompt. Organizations such as OpenAI have so far asked users to ensure that outputs used in other work do not violate laws, including intellectual-property and copyright regulations, or divulge sensitive information, such as a person’s location, gender, age, ethnicity or contact information. Studies have shown that the tools can do both.

“There’s an expectation that the research and synthesis is being done transparently, but if we start outsourcing those processes to an AI, there’s no way to know who did what and where the information is coming from and who should be credited,” he says.

A graduate of the University of Montreal, Poisot is known for her research on the world’s biodiversity. Poisot hopes that his work will join other studies being considered at the 16th Conference of the Parties (COP16) to the United Nations Convention onBiological Diversity in Cali, which is scheduled for later this year. Even though there are real stakes to every piece of science we produce, it is both exciting and frightening since policymakers and stakeholders are looking at it.