Active vs passive digital data collection: ethical risks of an ambiguous distinction

Alex McKeown
November 17, 2023

Introduction

Digital data collection offers enormous possibilities for richer health research and more detailed insights than can be achieved by standard methods such as questionnaires, and interviews. It can be carried out in a less contrived way than these methods, as it can be done while individuals are engaged in normal daily activities, rather than setting time aside specifically to, for example, answer survey questions.

Given their ubiquity, smartphones are ideal devices for data collection, as they can harvest vast quantities of data relevant to numerous aspects of health. However, the collection of personal health data should, ideally, only be done if the person whose data is to be collected has consented to this in advance. The nature of consent in the digital data collection era is not straightforward, however; it is complex and has ethical implications which must be negotiated.

Understanding the ethical risks that follow from this complexity is vital if you or your organisation collects, handles, uses, links, or shares personal data – especially (but not only) where this relates to health – and want to ensure that you can design and implement ethically robust data handling and governance processes.

In this article I focus in particular on an important aspect of digital data collection which contributes to the complexity of the context: the nature of the distinction between active and passive data collection. These two kinds of data appear at first sight to be distinct, and each has its own ethical ramifications. However, there are also reasons to question the extent of the reality of the distinction between them, despite the apparent difference between them.

A central issue in digital health data governance is the ethical permissibility of how data is collected and used. Ethical permissibility follows from being able to uphold sufficiently the privacy rights of individuals while simultaneously ensuring that the state can protect the health, wellbeing, and safety of individuals and the wider population.

Establishing this balance is complex, as individual and societal interests can sometimes conflict. Because both have a legitimate claim to protection, some negotiation must be achieved. So, judgements of permissibility depend on whether modes of collection, linkage, sharing, and use are proportionate. The demand that decisions are proportionate is uncontroversial; but what counts as proportionate requires analysis, and it is in this matter that the article culminates.

To lay the ground for this analysis, first we outline some key issues of significance relating to active and passive data.

Distinguishing Active and Passive Data Collection

Active data is the apparently more straightforward of the two. It broadly resembles a digital version of how you might ordinarily, for example, carry out research using traditional methods, insofar as when data is actively collected, consent is explicitly sought and given, and the person providing the data does so in response to specific questions or well-defined parameters.

The drawback of this, however, is that there may be some artificiality in the mode of collection, as the participant must take time out from normal activities to answer specific questions, and so on. In this respect it is a less naturalistic form of data and, to that extent, it might lack richness in providing insights into choices, decisions, wellbeing, and health.

Passive data collection, while more naturalistic, is also more complex. Passive data is collected automatically, in the background of daily activity rather than through overt engagement with the collection tool. For example, activity tracking apps collect real-time data without the need for ‘active’ user interaction, although the user will have given consent for the app to collect the data.

However, data might also be collected passively where consent is more ambiguous. For example, metadata about one’s location, choices, activities, is constantly being logged, and although consent might technically have been given for each kind of data point individually, it is not obvious whether consent has been given for the aggregation of these (Mittelstadt and Floridi, 2016). This is significant because the more data points available to be linked for a given individual, the more that can be known about that individual.

So, the distinction between active and passive data has the appearance of clarity and coherence, but it might in fact be ambiguous, and several important ethical issues follow from this.

The first is that insofar as consent has been sought and granted prior to collection, there is a sense in which all data can be said to have been collected actively. Although someone might not be aware at the moment of collection that data is being collected, if consent to collect the data has been sought and granted, the granting undermines the apparent clarity of the distinction between ‘active’ and ‘passive’, to the extent that to be passively collected means that the individual is unaware of its collection. If someone knows that certain data will be collected at some point subsequent to having agreed to it, it is arguable that a component of the data collection process has been active.

The implication here is that someone who consents to naturalistic data collection compromises their defence if they are subsequently unhappy with what occurs, given that they could have known that the data would be collected. After all, the person could have refused to agree to collection in the first place. However, the credibility of this response depends on being certain that the individual was fully aware of all possible consequences of the subsequent data collection, and it is not obvious that this is always necessarily the case.

(Big) Digital Data Linkage

Part of the value of big data derives from linking data that previously might have not been useful and so not deemed significant (Metcalf and Crawford, 2016). In the current era, users invariably leave digital traces of all their digital interactions, however innocuous each trace might be in isolation. So, irrespective of whether data is collected actively or passively, it is important to understand the implications of being able to use data previously considered innocuous for a better understanding of the causal interactions relevant to, for example, making predictions about future retail preferences or health states.

Because passive data collection is so naturalistic, it is unobtrusive (Harari et al, 2016), which makes it easy to forget about, thus making the status of consent from moment to moment unclear. This matters, because it creates a trade-off between the value of naturalistic data for generating insights that might be useful for commercial organisations or healthcare providers, and how this naturalism might undermine robust and unambiguous standards of consent.

The ethical concern here is not that an app collects data passively as such. Rather, what matters is how these data sets might be linked in ways that are hard to predict or anticipate, and whether meaningful consent has been given to this being done. Because passive data collection is less intrusive, consent requires a higher level of understanding by the participant. Moreover, a probable consequence of the volume and complexity of data and how it is produced is that, eventually, data about any given individual will be collected and used in a way that the individual did not intend, anticipate, or consent to (Jardine et al, 2015). This is not only a risk for individuals, but also a challenge for the public acceptability of data-driven health research or commercial analysis, if there is widespread scepticism about the security of one’s data once one provides it.

Consent

Although active data might appear to guarantee consent more robustly than passive data, it might not in fact ensure that someone would be happy with all uses of the data. Tweets, for example, are publicly available unless the user sets their account to private. As information posted on X / Twitter is publicly available unless the user has set their account to private, the data can be used by anybody, for any lawful purpose, without necessarily seeking permission.

An obvious reply here is that it is an individual’s responsibility whether or not to consent and nobody is compelled to use X / Twitter. Nevertheless, this illustrates that just because consent is active, it may not be exhaustive (Rivers et al, 2014). With this in mind, we might question whether responsibility should be placed purely on individuals, given the huge variety of uses to which data can be put, or whether those organisations collecting the data should give greater consideration to the ethical implications of how users generate data via their chosen platforms.

There are instances in which the X / Twitter case is relevant for healthcare in particular, as the question of responsibility is pertinent to both. We are all, no doubt, aware of a growing emphasis on the purported need for individuals to take more responsibility for maintaining their health, rather than only relying on clinicians to restore their health once they become ill (Swan, 2012), given pressures on scarce resources and increasing understanding of lifestyle determinants of health and disease.

While laudable in certain respects, the risk of increasingly shifting of responsibility onto individuals is that too much responsibility is placed on them. Although the rhetoric of responsibility for health is often framed in terms of empowerment, it can mask the reality that some proportion of the population is likely to be stranded, less able to take full responsibility for living healthily, and, in a digital context, less aware of the extent to which their health might be subject to continual monitoring.

Ubiquity

In the contemporary big data era, data collection and production are ubiquitous (Chen et al, 2012). This has significant ethical implications. As I mentioned earlier, smartphones are ideal data collection devices because of the ease with which they unobtrusively accompany daily activities. However, most breaches of medical data involve mobile phones, and there are many ways this can occur: someone might leave their phone somewhere or they may leave their phone unlocked, in both cases making the content vulnerable; they might be targeted by hackers; or they might use apps with security vulnerabilities.

Whatever the source of the risk, all are relevant to the public acceptability of digital data collection, given that this depends on certain standards of privacy and security being met.

The ubiquity of smartphones is also associated with environments geared towards their use. For example, smartphones can be used to pay for travel, they can help with navigation by using GPS, they can search for amenities desired by the user, and so on. Although active permission will have to be given by the user for the phone to collect their data and use it for a given function without revealing their identity, increasing the number of data points compromises the user’s anonymity in an environment arranged for the benefit of the user. As cities become increasingly ‘smart’, and not only users but their environments become characterised by ubiquitous data collection and production, the risk of over-collection and inadvertent identification through triangulation of numerous data points is high (Li et al, 2020).

This matters if personal information is collected from which the user has a strong interest in remaining unidentifiable.

A consequence of ubiquitous data collection and production is that individuals might not grasp the implications of granting permission to numerous providers of digital data services that emerge as those permissions aggregate. Of course, it is always possible to secure consent from most people for both active and passive data collection, and the importance of the giving of consent cannot be overlooked. However, this is still vulnerable to the reply that some individuals might be unaware of what follows from granting an increasing number of permissions, in terms of the risks to their anonymity that this entails.

The value of big data follows from the increased predictive accuracy available from increasing the and linking the amount of data points for a given individual (Bauer et al, 2017). This holds true in commercial and healthcare contexts, and for those collecting the data there is, therefore, an interest in maximising the amount of data that can be harvested.

However, the risk of ineffective data governance increases as the amount of data proliferates. For example, the aggregation of data will mean that copies are made and digital footprints remain beyond the boundaries of the tool that was first used to collect a particular data point. And the risk here is that it will not be obvious to all users what the implications are of allowing their personal data to be increasingly linked.

The risks associated with linked personal data and the traces that are left by their linkage has been baldly summarised by Lyon (2014), who states simply that ‘metadata is surveillance’. Data linkage is done to achieve increasingly fine-grained insights into choices, motivations, health states, and so on. Even if one fastidiously reads all the small print associated with digital data collection tools before agreeing to their data being harvested, the risk of identification remains, simply by being in a hyperconnected environment. The exponential growth in surveillance – even if one resists the negative connotations of the word – increasingly makes one transparent, irrespective of one’s wishes. Simultaneously, however, because of uncertainty about attribution of accountability in data linkage, the identity of who or what organisation should be held responsible for surveillance becomes increasingly opaque.

This is highly ethically relevant in healthcare, where some of the data that one might provide is of a personal nature and in which one is likely to be acutely interested in retaining control of one’s identity and anonymity.

Error

Important concerns follow from collecting more data than can be governed effectively, and these relate to risks of error either in predictions about an individual, or in wrongly identifying a given individual. The more sources of data that are being used, the more uncertainty and ‘noise’ there will be in the eventual aggregation of the data (Zhu et al, 2015).

Moreover, as data – whether collected actively or passively – become more complex, how algorithms arrive at predictions based on those data become increasingly opaque, because the purported effectiveness of the big data approach derives from the ability of algorithms to make predictions that go beyond human capacity.

In circumstances of increasing data complexity, therefore, the risk increases that the algorithm might make erroneous predictions which human mediators will not be able to identify as incorrect.

Risks of algorithmic error do not follow only from algorithms only. We have all become aware of risks caused by the algorithm having been programmed by a human. For example, if biases or assumptions held by programmers are unacknowledged, they can influence the predictions that the algorithm makes, and in these cases the algorithm’s outputs will be inaccurate (Monteith and Glenn, 2016).

Proportionality, Individual and Public Goods

The justification for collecting, storing, linking digital data is grounded in its value for achieving some individual or public goal. Individual and public goals, however, might not always coincide and can be in tension (McKeown et al, 2019).

For example, it might be in the public’s interest that individuals identified as ‘likely’ to engage in criminal behaviour are subjected to surveillance and tracking, but this is not necessarily in the interest of the tracked individual, given the threat to their liberty. Alternatively, those interests might coincide, for example in infection transmission tracking during a pandemic.

In both cases, we can broadly say that tracking and surveillance are proportionate to the risk. However, in the pandemic case, arguments for surveillance and tracking that otherwise might be considered unduly intrusive, are predicated on the public value of doing so, given that tracking is not restricted to specific individuals on the basis that their behaviour is judged as ‘likely’ to be criminal. This is to say that the purported public good in this case overrides the importance of a minority of individuals’ right to privacy, despite their not having acted in ways which would usually warrant such an intrusion.

Implicit in judgements like this is that how numerous the beneficiaries are is used to determine moral permissibility. As such nothing, on first sight at least, is off the table with respect to what degree of individual privacy could be justifiably sacrificed in the name of public good. However, while the demand that data usage decisions should be proportionate might often appear straightforward, determining proportionality is complex.

Möller (2013) suggests the principle of proportionality ‘is best understood as providing guidance in the structured resolution of a conflict between a (prima facie) right and another right or a public interest’. The standard approach to determining proportionality has four tests, which Möller (Ibid.) summarises as follows:

‘…first, the interference must serve a legitimate goal; second, it must be suitable for the achievement of that goal (suitability, rational connection); third, there must not be a less restrictive but equally effective alternative (necessity); and fourth and most importantly, the interference must not be disproportionate to the achievement of the goal (balancing, proportionality in the strict sense)’

Key among these components is the final test – proportionality proper, or ‘balancing’. Whether it is proportionate to override the interests of individuals in the name of public good fundamentally requires a weighing of these concerns.

In our case, the terms of acceptability for collecting, linking, and otherwise using active or passive digital data requires a balance being struck between a public and a private good, which may be in tension. Although each test is challenging, as the final one is the least empirically verifiable, it is, As Schlink (2012) writes, ‘the most contested step’, because ‘balancing involves not facts but values and value judgements’. Proportionality, then, culminates in a value judgement which finds an equilibrium between competing moral considerations, finding in favour of one or other via a reasoned justification.

The unavoidable subjectivity of this balancing underlines the need for the care and the seriousness with which proportionality testing should be carried out, not least because where individual and state or other institutional or organisational interests are in tension, there is likely to be an imbalance in power in favour of the latter, and against those of the individual.

This is not to say, of course, that state or organisation and individual interests necessarily conflict. For instance, it might well be in an individual’s interest that services such as healthcare, which operate in the name of the state, monitor them and protect them from harm. Indeed, Möller (Ibid.) goes on to underline how the two sets of interests can be complementary, using the example of a newspaper’s publication of military plans. Here, it might be true in one regard that there is a tension between a state’s interest in national safety and the newspaper’s freedom of speech. However:

‘That a marketplace of ideas exists—a marketplace in which the press follows, monitors, and criticizes the state’s and also the military’s actions—is also in the state’s own interest; the state has an interest in its citizens’ use and enjoyment of freedom of speech’

Although it is reasonable to track and intervene in certain kinds of behaviour, proportionality demands that this is done with due consideration for the freedoms of individuals, as far as possible. It is vital for a proportionate response that individuals are able to live according to what they value, since, as Möller (Ibid.) makes clear:

‘…what is protected by constitutional rights is not an entitlement to live one’s life in accordance with some objectively valuable way of life, but rather one to live it in accordance with the agent’s (‘subjective’) self-conception: the agent is prima facie entitled to live his life in accordance with his views on who he is and who he would like to be’

The importance of this is constituted by the absence of an objective standard of value in ethics and its role in just and fair governance. Since proportionate decision-making is always a negotiation between legitimate interests (this is to say, apart from instances where something that someone wishes to do is categorically ethically forbidden, such as torture), so individuals are entitled to have their interests taken seriously in terms of what is done with their personal information.

However, as the analysis I’ve carried out here shows, what is in fact proportionate depends on the facts of particular individuals in particular circumstances, as overt tracking and monitoring might be unreasonably intrusive in instances where individuals pose no risk to themselves or others.

So, while the demand for proportionality in data governance – encompassing the modes of data collection and the uses to which they are put – is uncontroversial, the question of what is proportionate in a particular instance requires careful case-by-case consideration. This in turn is especially important when considering the ethical implications of active and passive forms of data collection, given the number of variables – capacity, age, ubiquity, consent, socio-economic factors, and so on – which come to bear on the legitimacy of the two forms of data collection, and all the more so when health is at stake.

Conclusion: why does all this mean you should hire IGS?

What I’ve tried to highlight here is the complexity of the data-ubiquitous world in which we live, even if this complexity is not always obvious, partly because data communication infrastructures are often designed to, in some sense, make life more convenient (even if the systems involved are often in practice imperfect or have flaws which inhibit their efficiency.

Any organisation which collects, handles, analyses, uses, links, personal data for commercial or health -related purpose has an ethical obligation to ensure that it does so in a way that promotes trust in those data governance processes. Trustworthiness cannot and should not be assumed: the onus is on the organisation as the custodian of the personal data that it gives appropriate consideration to the interests of those people whose data it holds.

I have used the example of the distinction between active and passive data to show something of the extent to which an organisation should consider the implications of its data governance processes, and the care that it should take when thinking about possible consequences further down the line for the individuals whose data it holds, given the likelihood of data linkage occurring in ways that were unpredictable or opaque to those individuals.

Categories such as active and passive data might appear distinct, but appearances can be deceptive, not least in an environment in which nobody, if we are honest, is entirely sure exactly what we have and have not consented to being known about us or open to inference about us from data points whose collection we have consented.

Given that apparently clear distinctions can be revealed as unstable when subjected to detailed critical scrutiny; given that this instability can have potentially troubling consequences which might pose a threat to privacy or liberty; and given that it is the ethical responsibility of organisations who are custodians of personal data to ensure that they have engaged as far as possible with those people whose data it wishes to collect what the consequences – both intended and unintended – might be, you and your organisation should take steps to do this.

If you find the analysis here useful in terms of thinking in a more detailed way about what the (data) ethical implications, limitations, or risks are of your organisation’s data governance processes, and helps you to identify areas where expert advice could help you to mitigate these risks, our data ethics consultants at IGS are ideally placed and available to do so.