Federated Learning: A Novelty to Break the Quagmire of Isolated Data Islands?

An onerous challenge faced by heavily regulated industries, from healthcare to finance, when training machine learning algorithms is what is commonly referred to as “data islands”. Confronted with increasingly enhanced privacy restrictions, organisations from these industries regularly struggle to share their data for the purpose of machine learning development, and consequently turn into data islands secluded from each other. Due to its limited size, the data island of a single organisation per se would not suffice to train a model which requires the processing of a tremendous amount of data.

The concept of Federated Learning

To tackle this predicament, federated learning (FL) is gaining global attention. FL is a new approach to collaboratively training a machine learning model, where each participating party downloads a pre-trained foundation model, trains it on their own data and only shares local model updates, while the original data are stored locally and not transferred. Seemingly, FL is a promising solution for preserving privacy through data decentralisation and minimisation. Benefiting from the distributed datasets design, the original data do not leave participating parties, which effectively avoids gathering and crunching data in a central location.

A typical application of FL is the predicting services provided to mobile users, such as Apple’s emoji suggestion and Google Keyboard.[1] By leveraging the local models calculated and sent from each mobile device, technology companies can train the accuracy of a prediction model without accessing users’ original data. Mobile users would reciprocally relish an accurate prediction for the emoji or next word while not sharing their input history with Apple or Google. Another potential example is to train an AI model for the diagnosis of disease without sharing the original medical data, which falls into stricter regulation as special categories of data under data protection law. In the framework of FL, every hospital shares local model updates to renew and build a powerful model in the central server for detecting disease, for example, screening COVID-19 from Chest X-ray images[2].

Privacy Challenges

Whilst it might appear that FL could become a one-size-fits-all approach to train machine learning algorithms, there are certain privacy issues waiting to be resolved before the wide implementation of FL.

The first and foremost issue, which is also our biggest concern, is indirect data leakage. Although the original data which could identify the data subject will not be disclosed to the central server, the sharing model updates could still indirectly leak potentially identifiable data due to the features and correlations analysed and preserved by these updates. By observing model updates and their changes over time, a malicious attacker could extract features and correlations, and continue to deduce whether a given data subject is present in the original data or not. This unlawful processing is called an “inference attack”[3]. For example, through monitoring of recurrent model updates on a predicting service like Google Keyboard, which is trained on users’ text data, an attacker could extract sensitive text patterns and potentially deduce the combination of passwords to the bank account. This privacy risk could beget more devastating consequences if the FL is utilised to train disease-detecting algorithms on medical data. An attacker could potentially be able to infer whether the data subject is diagnosed with a particular disease, such as Alzheimer’s, from changes on model updates developed from the treatment data of Alzheimer’s patients.

To mitigate this risk of indirect data leakage, FL developers, who determine the purposes and meanings of the processing of original data and thus act as data controller, could implement privacy-enhancing technologies (PETs) to guarantee the security of processing as required under data protection law. Two of these technologies – differential privacy and homomorphic encryption – could potentially prevent data leakage in the FL framework. Under differential privacy, the precise values of some original data are altered to generate noise to disorient the attacker[4]. Yet, adding data noise may significantly compromise model accuracy. Depending on the circumstances, this trade-off between privacy and accuracy could make the utilisation of this technology questionable. Homomorphic encryption could arguably be used as a pragmatic alternative, whereby local model updates are encrypted by each participating party before being transferred to the central server – only after a sufficient number of local models have been aggregated, can they be decrypted.[5]

The second issue identified surrounds data deletion where a user who initially consented to their data being processed for the purpose of training machine learning models decides to withdraw their consent. According to the law, a data subject is entitled to request the controller to delete their data if consent is withdrawn. In the case of FL, effective compliance with the right to withdraw consent would require FL developers to not only delete the leaving parties’ data, but also to delete the associated model updates from the central server. However, by the time a user withdraws consent, it is challenging to erase the influence of their data on the central model. To prevent having to retrain the model from scratch, a method proposed by IBM in 2022 for unwinding the model only to the point at which the now-erased data were added appears to be an appropriate solution[6]. This solution could satisfy data protection requirements and at the same time ensure the continuation of the FL training. However, given this method is still in the experimental stage, it is necessary to track its progress to fully understand its implications to data privacy.

Conclusion

As a state-of-the-art technology, FL offers an efficient way to mitigate the isolation of data islands, which not only enables new applications in heavily regulated industries while preserving data privacy, but may also facilitate the training of machine learning models with data on a global scale. Despite being an area still under development, it is possible that we will soon witness the large-scale implementation of FL for the purpose of training machine learning models.


[1] Q. Li, W. Zeyi, H. Bingsheng. A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection. https://arxiv.org/abs/1907.09693. Accessed 03 March 2023.

[2] F. Ines, A. Sourour, K. Yousri, M. Khan. Federated Learning for COVID-19 screening from Chest X-ray Images. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7979273/. Accessed 03 March 2023.

[3] Information Commissioner’s Officer. Privacy Attacks on AI Models. https://ico.org.uk/about-the-ico/media-centre/ai-blog-privacy-attacks-on-ai-models/. Accessed 03 March 2023.

[4] IBM. What is Federated Learning?. https://research.ibm.com/blog/what-is-federated-learning. Accessed 03 March 2023.

[5] Information Commissioner’s Officer. Chapter 5 of draft anonymisation, pseudonymisation and privacy enhancing technologies guidance: Privacy-enhancing technologies (PETs). https://ico.org.uk/media/about-the-ico/consultations/4021464/chapter-5-anonymisation-pets.pdf. Accessed 03 March 2023.

[6] IBM. Federated Learning: How to Efficiently Erase a Client in FL?. https://research.ibm.com/publications/federated-unlearning-how-to-efficiently-erase-a-client-in-fl. Accessed 03 March 2023.

Share:

More Posts

Data Ethics is Business Ethics

Data ethics as a distinct area of deliberation is growing rapidly and has numerous subfields, such as ethics in machine learning; AI ethics; ethics of

Ethics and the presumption of data reuse

Data (platforms) and widening the presumption of data reuse The rationale underlying big data-driven healthcare, research, and commerce is that linkage and integration of datasets

Send Us A Message