ChatGPT vs GDPR – Privacy’s Newest Pandora’s Box

Amal Bushara
February 24, 2023

Chances are, if you’re breathing, you’ve also come across the internet’s latest obsession: OpenAI’s highly capable chatbot, ChatGPT. Just two months after its release, it set the record for the fastest growing user base, amassing some 100 million active users.[1]To put into perspective, it took TikTok about nine months after its global launch to reach 100 million users and Instagram two and half years.[2]

Many will argue that the commotion surrounding the AI is justified. After all it can pass academic exams, generate code, write poetry and just about everything in between.

But whilst users and industries alike rush to work out just how they can use the technology to their advantage, there has been little discussion on the ethical and legal concerns, primarily in terms of data privacy.

How it Works

To comprehend the challenges posed by large AI-based models such as ChatGPT, we must understand how the OpenAI system works and what type of data it uses.

ChatGPT is not an online bot that searches the web for information. It has instead been pre-trained on a large amount of text, such as novels, articles, and websites. This means it does not gather any new live information from the internet when users interact with it. Instead, it generates responses based on the data it already has.

Because ChatGPT is trained on a large dataset of text, including but not only from the internet, personal information about individuals may be included. If you’ve ever left a bad review or posted an article, there’s a strong chance ChatGPT has read it and used it to train – all without your consent.

Lawfulness of Data-scraping

Scraping for publicly available information is not within itself a prohibited act under the General Data Protection Regulation (“GDPR”) framework. However, the GDPR applies in full irrespective of whether the data was made public prior to its collection or not.

This means that where companies like OpenAI collect personal data from publicly accessible sources, they are required to as a minimum establish a lawful basis under Article 6 of the GDPR.

Given that consent was not obtained and noting the commercial nature and intent for the use of personal data, it’s unclear whether there is a lawful basis which OpenAI can rely on.

According to Alexander Hanff, member of the European Data Protection Board’s (EDPB) support pool of experts, OpenAI could potentially rely on legitimate interests as set out under Article 6(1)(e).

However, he added that to do this “They would need to show that this legitimate interest overrides the right for the individuals to privacy – and they won’t win this case.”[3]

Legitimate interest is indeed the most flexible lawful basis for processing under the GDPR. It is certainly easy to see its appeal to businesses that are constantly looking for ways to generate profit from your data.

But, if there are any lessons learnt from the Clearview AI case, it’s that in the absence of consent, national data protection authorities don’t take kindly to the collection of personal data where this is to feed AI-powered services. Especially, where there is lack of awareness around the data collection, which appears to be the case with OpenAI, as it makes no mention of the collection of personal data from publicly available sources in its privacy notice.

Whilst examining the lawfulness of data scrapping, we must also consider that the internet is filled with sensitive data, particularly special category data. This means that in addition to being noncompliant with Article 6 of the GDPR, OpenAI is also likely to be infringing on the provisions set out under Article 9.

Right To Be Forgotten

When AI expert and social entrepreneur, Miguel Luengo-Oroz asked ChatGPT “can neural networks forget?” he received the following response:

“Artificial intelligence systems, like neural networks, do not have the ability to forget in the same way that humans do. The network may adjust its weights to better fit the new data, which could result in it making different predictions for the same input. However, this is not the same as forgetting in the sense that the network still has access to all of the information it has learned previously, it is just prioritizing the new information more highly.” ^{^[4]}

This is particularly concerning considering that Article 17 of the GDPR grants individuals the right to be forgotten, otherwise known as the right to erasure.

The underlying reason, as to why it is difficult for AI systems to forget, is due to sheer complexity. AI models like ChatGPT are stochastic, meaning there is a degree of uncertainty or randomness involved in the hypothesis and outcomes they generate. [5]

This makes it difficult to correlate data points used to train the AI model with the decision-making rules they establish.

Without the means of tracing certain data points and carefully extracting data from an AI model, data in a sense becomes immortalised and individuals are prevented from exercising their individuals’ rights concerning the use of their personal data.

Currently, OpenAI has not provided any clear pathways to determine whether your data was collected, or more worryingly, whether it can be extracted from the model.

In theory, companies like OpenAI can remove data from AI models by completely retraining it without the data point in question. However, this is not a practical solution, as it requires excessive effort and substantive resources.

Beyond being impracticable, AI experts also point out that by removing data from machine learning models, their accuracy may be compromised.[6] Essentially, they may not be able to perform in the manner we have come to expect them to. This is not necessarily true where only one individual exercises their right to be forgotten, but it certainly is where hundreds of thousands look to do the same.

Rethinking AI Models

In order to allow for the erasure of certain data points or user information, some researchers and businesses are working on various techniques, which are still in the early phases of development, thus it is unclear how practical or effective they will be.[7]

One promising approach proposed in 2019 by researchers from the universities of Toronto and Wisconsin-Madison involves segregating training data into multiple pieces or shards. Each shard is processed separately before the results are integrated into a final model. This means where you have to extract a data point from an AI model, you don’t have to retrain the entire model, but rather only a part of it.

While the concept is promising, the researchers who founded the technique say it is not without its limitations. For instance, by decreasing the amount of training data per shard, this could lead to lower-quality outcome and accuracy. In addition, the technique doesn’t address the central issue, which is how can we remove from the model all the traces of a selected data without having to retrain it. [8]

Nevertheless, it’s crucial that these efforts are underway and that research continues to develop in the field often known as ‘machine unlearning’, because the right to be forgotten is in fact only one of the many rights guaranteed under the GDPR that is likely to be affected by the restricted functionality of ChatGPT and other similar tools.

The simple reason for that is that it’s not feasible to uphold rights including the right to rectification, the right to object and the right to restriction where there is little understanding on where data is held within an AI model.

Similarly, the stochastic nature of ChatGPT and other tools will likely make is difficult for individuals to be informed in a clear and concise information about how their personal data is used, as required under the right to be informed.

Parting Thoughts

Google has recently unveiled its own conversational AI called Bard, and if one thing is for certain is that others will soon follow.

The development of elaborate models capable of generating human like responses is not within itself an issue. In fact, large language models may truly transform how we use technology and automate some tasks.

However, the unchecked development of technology will surely by at the cost of the rights and freedoms of the individuals whose data is used to fuel the technology.

Consequently, legislators need to be at the forefront of this battle, ensuring that technology develops always in accordance with the principle of privacy by design.

Further, they must work hand in hand with the AI community to address issues, including the traceability and extraction of personal data contributed to training specific AI models.

As Lueno- Oroz has said: “Maybe the future of AI is not just about learning it all –the bigger the data set and the bigger the AI model, the better— but about building AI systems that can learn and forget as humanity wants and needs”.

[1] https://www.reuters.com/technology/chatgpt-sets-record-fastest-growing-user-base-analyst-note-2023-02-01/

[2] https://www.zdnet.com/article/what-is-chatgpt-and-why-does-it-matter-heres-everything-you-need-to-know/

[3] https://www.infosecurity-magazine.com/news-features/chatgpt-shortfalls-data-protection/

[4] https://www.forbes.com/sites/ashoka/2023/01/25/we-forgot-to-give-neural-networks-the-ability-to-forget/?sh=44424d816853

[5] Machine unlearning: The duty of forgetting | by Salvatore Raieli | Towards Data Science

[6] https://www.forbes.com/sites/ashoka/2023/01/25/we-forgot-to-give-neural-networks-the-ability-to-forget/?sh=44424d816853

[7] https://blog.avast.com/chatgpt-data-use-legal

[8] Can AI Learn to Forget? | April 2022 | Communications of the ACM