"until one of my supervisors (who is basically a rockstar in a closely related area to mine) posted about it on their social media. Suddenly, I was inundated with participants! More than I knew what to do with!"
This is the first part I would look into: is my largest participant pool possibly of Scottish origin? Your supervisor might have a lot of Scottish followers without realizing. This eliminates the silly answer (though I'm pretty sure you've already thought of this).
If that's not the case (and even if it is) I would look more into privacy-respecting ways to identify unique user sessions.
I don't have any idea of how the study is conducted, of how any of the information is collected, of whether you're using a server, a text file, a carved rock, an excel sheet or a notes app to keep track of everything, however these are the initial thoughts that came to mind upon reading the post. Let me know if you have any further questions, objections or need any help with anything :)
I definitely agree that with any prize incentive there should be some layer of authentication, and I definitely advocate for that layer to be privacy-respecting. However I also know from experience that this is way harder than it looks. Here are some thoughts I had while reading:
First thing that came to mind was ephemerality. You can get the best of both words (or close enough to that) by collecting some kind of data for a very limited amount of time. Let's say you collect an IP address and other attributes you choose, keep these for a specific period of time that you deem fit, for example 7-14 days, just to make sure there isn't duplication in user submissions. You ensure the purging process goes smoothly and is up to your standards. You also ensure all PII data is deleted after the study is done.
Second thing that came to mind was some kind of privacy-respecting non-fingerprinting cookie. While it can be circumvented by either hitting "decline" on the cookie banner or clearing browser cookies, I think this approach gets rid of some of the low-hanging fruit. The cookie left on the participant's computer doesn't have to include anything else except a binary value of whether the participant has completed the study or not.
Third thing that came to mind was generating some kind of cryptographic hash from a uniquely selected attribute from the user's machine. This could be an IP address, MAC address or any identifiable attribute that wouldn't even get collected in the first place as the hashing can happen on the participant's machine. I think this is your best bet, but I don't know how hard this is to pull off, as it's not an easy process. This is kind of close to what MullvadVPN currently does (although they definitely implement more sophisticated techniques which I won't go into). You can also use things like cryptocurrency as the reward because it's a little more anonymous, or something like manually mailing rewards which definitely is a gigantic and expensive operation with a ton of drawbacks, but I'll still throw it out there as an option if the targeted population happens to be close enough :)
You can use at-rest encryption to mitigate the effect of the insecure database, provided you have full access to it. If you don't have access to it you can encrypt everything using your own key and upload the encrypted results, provided the database supports it. I would need more info about the behind the scenes to know if this is feasible.
All of these solutions are technical and might be challenging to implement, especially if you lack control over the database or data, or if you're unfamiliar with their implementation. However, given the information provided, these are the ones I would recommend.
Thanks for the response! Yes, I had thought of a lot of that, and maybe I should have gotten more into that side of things in the post. In short, a lot of the reasons I didn’t implement those practices is "yes, that would be better, but I'm not allowed to do that because the ethics committee says it's unethical."
For example, take your point about gathering the information I "needed", but only keeping it for a short time. That sounds like a great compromise, because it is! But there's a rule (apparently, personally I think they're misinterpreting it pretty radically) where if I gather *any* information, I need to keep it on an unsecure system for at least 5 years (and potentially up to 14). I wanted to delete information a lot more potentially sensitive and less necessary, and it was a *major* argument, which I ended up losing because it was made clear I would not be allowed to proceed unless I kept this information.
My compromise for the information I did end up gathering was to encrypt it before it gets anywhere near the system I have to use, and anything I anticipate wanting to delete is encrypted seperately. And when I'm done with it, whoops, lost the key (which is very long and basically impossible to brute force for anyone but the NSA). The information is still technically there - oddly enough, the rule was it has to be kept, but not necessarily readable. That tells you more than how well these rules are written and enforced than any number of anecdotes I could share.
Either way, the point was less 'these problems cannot be technically solved', but more exploring the idea of 'more privacy/less information gathered is necessarily always a simple matter'. The idea of "you can't leak what you don't have" is a good one, so you shouldn't gather more information than you need, all else being equal. But figuring out exactly what information you specifically *actually* need, turns out that's a lot harder than people think!
Hearing about how terribly inconsiderate and poorly written these rules are only makes me disappointed and frustrated. Rules should be clear, fair, and considerate of everyone involved, especially in large academic contexts such as these.
Your plan with "losing" the key sounds good, and the requirement of the data merely being present on the system no matter the format is laughable at best but honestly just sad.
This post helped me think a little bit differently about the balance between usability and data protection, especially on larger scales like these, so thanks for sharing your experience!
Best of luck with the rest of the study and whatever plan you come up with next. Don't hesitate to hit me up if you need any help :)
Interesting post, here's my take on it:
"until one of my supervisors (who is basically a rockstar in a closely related area to mine) posted about it on their social media. Suddenly, I was inundated with participants! More than I knew what to do with!"
This is the first part I would look into: is my largest participant pool possibly of Scottish origin? Your supervisor might have a lot of Scottish followers without realizing. This eliminates the silly answer (though I'm pretty sure you've already thought of this).
If that's not the case (and even if it is) I would look more into privacy-respecting ways to identify unique user sessions.
I don't have any idea of how the study is conducted, of how any of the information is collected, of whether you're using a server, a text file, a carved rock, an excel sheet or a notes app to keep track of everything, however these are the initial thoughts that came to mind upon reading the post. Let me know if you have any further questions, objections or need any help with anything :)
I definitely agree that with any prize incentive there should be some layer of authentication, and I definitely advocate for that layer to be privacy-respecting. However I also know from experience that this is way harder than it looks. Here are some thoughts I had while reading:
First thing that came to mind was ephemerality. You can get the best of both words (or close enough to that) by collecting some kind of data for a very limited amount of time. Let's say you collect an IP address and other attributes you choose, keep these for a specific period of time that you deem fit, for example 7-14 days, just to make sure there isn't duplication in user submissions. You ensure the purging process goes smoothly and is up to your standards. You also ensure all PII data is deleted after the study is done.
Second thing that came to mind was some kind of privacy-respecting non-fingerprinting cookie. While it can be circumvented by either hitting "decline" on the cookie banner or clearing browser cookies, I think this approach gets rid of some of the low-hanging fruit. The cookie left on the participant's computer doesn't have to include anything else except a binary value of whether the participant has completed the study or not.
Third thing that came to mind was generating some kind of cryptographic hash from a uniquely selected attribute from the user's machine. This could be an IP address, MAC address or any identifiable attribute that wouldn't even get collected in the first place as the hashing can happen on the participant's machine. I think this is your best bet, but I don't know how hard this is to pull off, as it's not an easy process. This is kind of close to what MullvadVPN currently does (although they definitely implement more sophisticated techniques which I won't go into). You can also use things like cryptocurrency as the reward because it's a little more anonymous, or something like manually mailing rewards which definitely is a gigantic and expensive operation with a ton of drawbacks, but I'll still throw it out there as an option if the targeted population happens to be close enough :)
You can use at-rest encryption to mitigate the effect of the insecure database, provided you have full access to it. If you don't have access to it you can encrypt everything using your own key and upload the encrypted results, provided the database supports it. I would need more info about the behind the scenes to know if this is feasible.
All of these solutions are technical and might be challenging to implement, especially if you lack control over the database or data, or if you're unfamiliar with their implementation. However, given the information provided, these are the ones I would recommend.
Thanks for the response! Yes, I had thought of a lot of that, and maybe I should have gotten more into that side of things in the post. In short, a lot of the reasons I didn’t implement those practices is "yes, that would be better, but I'm not allowed to do that because the ethics committee says it's unethical."
For example, take your point about gathering the information I "needed", but only keeping it for a short time. That sounds like a great compromise, because it is! But there's a rule (apparently, personally I think they're misinterpreting it pretty radically) where if I gather *any* information, I need to keep it on an unsecure system for at least 5 years (and potentially up to 14). I wanted to delete information a lot more potentially sensitive and less necessary, and it was a *major* argument, which I ended up losing because it was made clear I would not be allowed to proceed unless I kept this information.
My compromise for the information I did end up gathering was to encrypt it before it gets anywhere near the system I have to use, and anything I anticipate wanting to delete is encrypted seperately. And when I'm done with it, whoops, lost the key (which is very long and basically impossible to brute force for anyone but the NSA). The information is still technically there - oddly enough, the rule was it has to be kept, but not necessarily readable. That tells you more than how well these rules are written and enforced than any number of anecdotes I could share.
Either way, the point was less 'these problems cannot be technically solved', but more exploring the idea of 'more privacy/less information gathered is necessarily always a simple matter'. The idea of "you can't leak what you don't have" is a good one, so you shouldn't gather more information than you need, all else being equal. But figuring out exactly what information you specifically *actually* need, turns out that's a lot harder than people think!
Five to fourteen years sounds absolutely insane.
Hearing about how terribly inconsiderate and poorly written these rules are only makes me disappointed and frustrated. Rules should be clear, fair, and considerate of everyone involved, especially in large academic contexts such as these.
Your plan with "losing" the key sounds good, and the requirement of the data merely being present on the system no matter the format is laughable at best but honestly just sad.
This post helped me think a little bit differently about the balance between usability and data protection, especially on larger scales like these, so thanks for sharing your experience!
Best of luck with the rest of the study and whatever plan you come up with next. Don't hesitate to hit me up if you need any help :)