Fraudulent Participants

Or why the Scottish accent is why we can't have nice things

May 07, 2024

Privacy is complicated. If you’ve spent more than a few minutes seriously thinking about it, you have already realized this. Like any other value, it comes with trade-offs, limits, and downsides. A truly private society cannot exist, and would probably suck pretty hard if it did – forget the complications of law enforcement, simple things like me telling friends about my day would become fraught because they might share an aspect of an anecdote with someone else without my explicit knowledge or consent. The moment someone else knows a piece of information, you immediately lose control of that information to some degree. Now, in the case of me telling a funny story about how I accidentally grabbed someone else’s coffee because our names are the same, is probably harmless, but again, I’m talking about perfect privacy, and the moment you start talking in absolutes things have a way of falling apart. 1

Moral value tends to work the same way. A lot of consequentialist-type systems 2 allow trade-offs between values to some extent – we generally place a high value on honesty, but we might agree that life is more important in at least some cases, so if we’re going to allow dishonesty we need to be satisfied that the gain is worth the downside. Kant’s example of the murderer is a good example here – most people would agree that lying in that case is at least somewhat permissible. Let’s treat some value (privacy) as having infinite weight 3. Any compromise of that value is definitionly doing more harm than is gained by that compromise, which makes it morally impermissible, even in the face of what we would consider atrocities – killing a person is more permissible than searching someone’s house against their will when their neighbor heard screams. Torturing a child is preferable to surveilling a known druglords factory. Flooding a hospital with poison gas is moral in comparison to running a licence plate of a car seen fleeing a crime scene. It’s the Repugnant Conclusion only worse. As John Henry Newman said:

The Church aims, not at making a show, but at doing a work. She regards this world, and all that is in it, as a mere shadow, as dust and ashes, compared with the value of one single soul. She holds that, unless she can, in her own way, do good to souls, it is no use her doing anything; she holds that it were better for sun and moon to drop from heaven, for the earth to fail, and for all the many millions who are upon it to die of starvation in extremest agony, so far as temporal affliction goes, than that one soul, I will not say, should be lost, but should commit one single venial sin, should tell one wilful untruth, though it harmed no one, or steal one poor farthing without excuse. She considers the action of this world and the action of the soul simply incommensurate, viewed in their respective spheres; she would rather save the soul of one single wild bandit of Calabria, or whining beggar of Palermo, than draw a hundred lines of railroad through the length and breadth of Italy, or carry out a sanitary reform, in its fullest details, in every city of Sicily, except so far as these great national works tended to some spiritual good beyond them.

I think most people who aren’t moral absolutists would say that that conclusion offends even our most basic moral intuitions and reasonings.

So the question about privacy is not about ‘how to be more private’, but ‘does this increase in a world in the trade-offs are worth it?” As the head of Australia’s spy agency recently argued: “privacy is important but not absolute”. While I have the strongest confidence he meant “and therefore we have the absolute right to invade whatever privacy you think you do have for any or no reason whatsoever”, his statement is not in and of itself wrong. No right is absolute, because the moment you treat something as absolute you make it impossible to trade off against, which as we’ve seen results in absurd, repugnant conclusions.

I don’t want this post to be misunderstood. I believe privacy is important. I believe society is massively out-of-balance in regard to privacy, and all of the political and economic powers are incentivised, and actively trying, to worsen that imbalance for their own agendas, which are not the same as the interest of average people. The surveillance state and surveillance capitalism are both extremely real and extremely harmful, both in direct and indirect ways, and although we’re already seeing abuses of the status quo by government, corporate, and individual actors, I truly believe things are going to get much, much worse.

But the ideal society is not the maximally private one. I’m not sure exactly what it looks like, but it’s transparently obvious it’s not that. In addition, by treating privacy as a sacred value, you lose ability to properly understand it and how it works on a conceptual level, and thus how to better protect it. In the same way that someone who thinks about privacy as the ability to control revelation or concealment of information is going to be able to better protect their privacy than someone who thinks privacy and secrecy are synonymous, if you understand privacy as existing in a network of values navigated to achieve some material ends then you’re better able to understand the tensions that exist within it.

In order to illustrate this, I’m going to largely use an experience I’ve had as a case study. I’m doing this in part to justify why I’ve been absent for months (doing a study takes a lot of time and energy at the best of times), but also to highlight aspects I hadn’t thought about, out of a hope to help others think about these things.

I want to be clear with my goal here. My goal here is not to argue for or against privacy, or even for or against any particular solution to solve this particular problem. My goal is to acknowledge some of the complexity of respecting participants privacy to the maximum amount – there’s real, serious risk in doing that. I’m in a genuinely difficult position now specifically because I tried to do what I sincerely believe is the right thing.

I’m talking about this in the context of research, but the principles generalise pretty neatly, I think.

Context

So over the last few months I’ve been working on completing a study. The details of what the study are about aren’t important, so I’ll either exclude them or if something must be said I’ll tell an obvious lie. But the process of doing research has really informed how I view privacy, especially in a research context. For various theoretical and logistical reasons, I was focusing on a specific population (let’s pretend Zimbabwean oil sheikhs), and for practical reasons I was mostly – though not solely – gathering the data online.

Out of a sense of research ethics and concern about privacy, I wanted to gather as little information about my participants as possible. There’s a rule that ‘you can’t leak what you don’t know’, and a general ethical obligation to protect your participants as much as possible. So while I needed to gather certain demographic information so I could describe my sample and possibly discern any odd patterns in the data, and I needed to record the data being gathered itself, I took steps to limit what I could gather – I actively prevented my system from gathering IP addresses, I did not require participants to show their face if interviewed online, and I required no identifying information from the participants beyond very broad demographic information. I didn’t need it, there was no value to having it, so I didn’t gather it. In general, I think most privacy-conscious people would approve.

Recruitment was extremely poor for months, despite participants being well paid in anonymous gift cards. I chased organisations to promote my study, I posted on social media, I put up flyers, I argued with the ethics committee about letting me hand people flyers 4. I worked really, really hard and got basically nowhere, until one of my supervisors (who is basically a rockstar in a closely related area to mine) posted about it on their social media. Suddenly, I was inundated with participants! More than I knew what to do with!

Problem

Then I started to notice trends. I won’t go into detail, but suffice it to say I had what I believed – and still believe – strong evidence that at least some of these participants were not Zimbabwean oil sheikhs, but were merely pretending to be in order to get the money I was paying participants. This was a disaster – not only was I losing funds I was already limited on (which meant losing legitimate participants) but if I used data I was not confident was from my target demographic I risked invalidating not only my study but any work that built on it. But if I were to exclude someone who is a part of my population based on suspicion alone, then I’m running the risk of presenting a sample driven not by rigorous adherence to criteria, but my own biases.

So I went to my supervisors, because helping me with these issues is part of their job. They asked some good questions, then suggested some things I could check, but they were either things which had been present in the confirmed-legitimate sample, or I had prevented myself from checking, like IP addresses. In addition, for various reasons I was on a tight time limit – I couldn’t slam on the breaks, sort out the issues, and then proceed with more confidence.

Hence the conundrum: do I accept these people at their face and risk introducing inappropriate data and building bad theory based on that, or do I cut participants (who might be legitimate, and thus doing the previous mistake) and have less data to build theory on top of, lessening the credibility of my work, and sabotaging my own long-term career in the process? Or do I avoid the situation in the first place by gathering more information about my participants which I was not planning on analysing? And if we choose the last option, how far do we go? IP addresses are pretty trivial to fake with a VPN or strategic use of TOR, so do I require geolocation data? Verified addresses? Formal ID?

Analysis

Eligibility fraud is a broad term which basically boils down to pretending to be eligible for some benefit (e.g., welfare, some program, participation in research) which you wouldn’t normally be eligible for. As you can imagine, estimates on prevalence in the case of welfare and government programs are very politicised and vary widely depending on who’s doing the estimating and specific context, but one attempt to quantify it in a research context suggested over half of people who completed their survey were fraudulent, and they actually had pretty decent controls in place (e.g., participants had to show paperwork and ID), while another found about 7% were, while yet a third (paywalled) estimated 80%. This article talks about cases where participants attempted to pretend to be trained medical professionals yet were apparently unable to answer even basic questions about the job 5. Part of the reason for the range there is I think because it can be extremely difficult to establish whether any given participant is in fact fraudulent – sometimes it’s reasonably clear, but it’s often not. Anecdotally, many researchers I’ve spoken to who do online research have said they avoid giving cash incentives precisely because of this issue – it’s literally not worth the hassle.

Fraudulent participants can take multiple forms. The most common is where people pretend to be something they aren’t (e.g., Zimbabwean oil sheikhs, nurses) in order to gain monetarily, but other forms can be the same participant doing the study multiple times, or doing up bots in order to perform tasks repeatedly, in order to get paid multiple times or increase the chance of being rewarded in a lottery incentive. It is speculated many of these individuals live in poorer countries where money goes a lot further, but that’s speculation and I can’t find any rigorous evidence of that.

Privacy question

In research, there’s a common policy of not asking questions you don’t need to, or doing more screening than you need to. This is partly driven by pragmatic reasons – the more screening you do, or the more questions you ask, the less recruitment you’ll get. However, as I mentioned previously, it’s also partially an ethical concern – the more information you gather, the potentially greater risk to the participants should your information leak, or the greater risk of potentially biasing the researcher with irrelevant information, especially if you’re working with potentially vulnerable populations. For example, one of the reasons (among others) that I think a large proportion of the people in my sample are not eligible is because almost all of the influx had a specific accent which is relatively uncommon in my target population – I was meant to be talking to Zimbabwean oil sheikhs but over half had strong Scottish accents. Is that evidence of something problematic going on, or do I have a prejudice against Scottish people? To the best of my googling, about 3% of the target population should have had that accent, and over half of my sample did. I’m obviously using silly examples – the Scottish accent is objectively awesome – but we can all imagine situations where it’s a lot more ambiguous. This concern is a part of why we have clear criteria set up in advance.

So it seems at least clearly plausible there’s some fraud going on on the part of at least some of the participants. Let’s skip over the question of how I handle that and focus on the question of how we prevent it coming up – leaving aside the question of how common, it seems pretty clear that fraudulent participants do happen at least sometimes. There are certain low-hanging fruit that can be done, but even they come with certain drawbacks. In-person interviews are probably easier to control, but that limits your pool geographically. Screening questionnaires probably help, but answers that rely on specific knowledge (e.g., triage procedures for nurses) can be googled, and if done at the “participants” leisure can be fairly easily bypassed, and isn’t an option at all if you’re not targeting a specific-knowledge group. If doing online interviews, requiring participants to have the camera on seems tempting, but doesn’t actually address most problems, and if you’re capturing their face now you’ve got to control that information, especially if they are potentially vulnerable.

And, of course, there’s the privacy concerns. If I require to see formal ID of all my participants, that’s requiring them to disclose potentially sensitive information to me. In some cases that’s justified, but in other cases really not. And even if it is, how is that different to an online business requiring my physical address when they have no plausible need for it?

I said at the start that the purpose of this piece is to explore the tensions within the concept of privacy. It’s not just a simple matter of ‘more privacy = good’, or even a matter of ‘only gather the information you need’, because how do you define what you need? I sincerely, legitimately thought that I didn’t need the IP address of participants, but as it is that could have been very useful to me in weeding out non-valid participants.

So I guess the next time you see a company or person asking for more information that it doesn’t appear they need, maybe think about why they might be asking about it. Yes, some reasons are illegitimate, but sometimes there’s needs that aren’t immediately obvious to outsiders.

As always, Scottish people are why we can’t have nice things.

There’s an idea which I’ve had floating about in my brain for a few years which I call ‘the problem of infinity’. Basically, it’s the observation that as soon as you inject infinity into a system, the system has a tendency to break. Imagine an economy, and one person has infinite money – not ‘more money than they could ever realistically spend’, literally infinite. To that person, money would have zero value, so they would treat it as we treat breathing. So a huge amount of money would be constantly flowing into the economy far faster than it could leave it, which would result in hyperinflation, which would crash the economy, thus breaking the system.

I’m personally very leery of consequentialism and its reasoning, as it inevitably ends up viewing people as instruments to achieve a goal rather than morally relevant agents in themselves, but it’s a popular framework, and even I’d agree it has its uses.

What the rationalists call a ‘sacred value’, or one which cannot ever be traded off against because its value is always massively greater than any benefit you would gain from it.

I won’t talk about it here, but if you ever want me to go into an angry rant ask me about my view of research ethics committees. Let’s just say negotiating the meaning of the word ‘non-obnoxious’ is an ongoing argument after several weeks, and they used a rule about xenotransplantation (that is, the process of implanting tissue from one species to another) to force me to keep participant information beyond any reasonably necessary time in a system they definitely knew was insecure and had been recently breached. No, I don’t understand how they reached that conclusion either, but they were very explicit that this was non-negotiable, and when I asked them to explain the connection of a section about xenotransplantation to retaining information they refused. This was pretty typical.

If you can’t see the text, try a different browser. I don’t know why that happens, but it worked for me.

Tony Ziade

May 9

Interesting post, here's my take on it:

"until one of my supervisors (who is basically a rockstar in a closely related area to mine) posted about it on their social media. Suddenly, I was inundated with participants! More than I knew what to do with!"

This is the first part I would look into: is my largest participant pool possibly of Scottish origin? Your supervisor might have a lot of Scottish followers without realizing. This eliminates the silly answer (though I'm pretty sure you've already thought of this).

If that's not the case (and even if it is) I would look more into privacy-respecting ways to identify unique user sessions.

I don't have any idea of how the study is conducted, of how any of the information is collected, of whether you're using a server, a text file, a carved rock, an excel sheet or a notes app to keep track of everything, however these are the initial thoughts that came to mind upon reading the post. Let me know if you have any further questions, objections or need any help with anything :)

I definitely agree that with any prize incentive there should be some layer of authentication, and I definitely advocate for that layer to be privacy-respecting. However I also know from experience that this is way harder than it looks. Here are some thoughts I had while reading:

First thing that came to mind was ephemerality. You can get the best of both words (or close enough to that) by collecting some kind of data for a very limited amount of time. Let's say you collect an IP address and other attributes you choose, keep these for a specific period of time that you deem fit, for example 7-14 days, just to make sure there isn't duplication in user submissions. You ensure the purging process goes smoothly and is up to your standards. You also ensure all PII data is deleted after the study is done.

Second thing that came to mind was some kind of privacy-respecting non-fingerprinting cookie. While it can be circumvented by either hitting "decline" on the cookie banner or clearing browser cookies, I think this approach gets rid of some of the low-hanging fruit. The cookie left on the participant's computer doesn't have to include anything else except a binary value of whether the participant has completed the study or not.

Third thing that came to mind was generating some kind of cryptographic hash from a uniquely selected attribute from the user's machine. This could be an IP address, MAC address or any identifiable attribute that wouldn't even get collected in the first place as the hashing can happen on the participant's machine. I think this is your best bet, but I don't know how hard this is to pull off, as it's not an easy process. This is kind of close to what MullvadVPN currently does (although they definitely implement more sophisticated techniques which I won't go into). You can also use things like cryptocurrency as the reward because it's a little more anonymous, or something like manually mailing rewards which definitely is a gigantic and expensive operation with a ton of drawbacks, but I'll still throw it out there as an option if the targeted population happens to be close enough :)

You can use at-rest encryption to mitigate the effect of the insecure database, provided you have full access to it. If you don't have access to it you can encrypt everything using your own key and upload the encrypted results, provided the database supports it. I would need more info about the behind the scenes to know if this is feasible.

All of these solutions are technical and might be challenging to implement, especially if you lack control over the database or data, or if you're unfamiliar with their implementation. However, given the information provided, these are the ones I would recommend.

Expand full comment

2 replies by Alan Smith and others

2 more comments...

Psyvacy

Discussion about this post