When research data includes personal and sensitive data, privacy issues can emerge, as the EU’s GDPR states. Researchers using sensitive data need to align to legal and ethical obligations, thus they should eliminate any traceable reference to, or presence of, personal identifiers: anonymisation and pseudoanonymisation are two different methods which allow to mitigate privacy risks when processing sensitive personal data.
Anonymisation
It involves modifying or removing personal data so completely that it becomes impossible to re-establish a link between the data and the individuals which it is taken from. This process is irreversible and is often used for data sharing/publication and long-term data preservation.
Anonymised data is thus no longer considered personal data according to the GDPR.
In a research context, a hospital discussing statistical insights on patient outcomes with external analysts might aggregate or generalise certain details – such as grouping exact ages into broader age ranges – to ensure that no one can reconstruct identities, even when combining multiple records and external sources.
Pseudonymisation
Pseudonymisation, differently to anoymisation, replaces identifiable information (such as names or social security numbers) with a code or pseudonym, rather than removing it as in the case of anonymisation. Pseudonymisation thus reduces any direct identifiability (for example, names are replaced with codes); yet those who have access to one or more files which contain information on the relationship between names and codes can still decipher individuals’ identities.
For this reason, pseudonymised data is subject to GDPR obligations, just like opersonal data.
Researchers and organisations often rely on this technique when it is useful or necessary because it allows to link data back to individuals for follow-up or longitudinal studies, as is the case in clinical trials, fr example, so that the trial’s participants can be contacted for further stages of the research remains.

How?
The choice between anonymisation and pseudonymisation often depends on whether a study needs to maintain the ability to follow up participants.
Anonymisation is particularly suitable for releasing datasets to the public or archiving data over long periods, such as in open access repositories, because it completely removes the identifiability of individuals.
Pseudonymisation is generally used when ongoing interaction with participants is expected, or when future data collection is anticipated. A pseudonymised dataset may still expose individuals to the risk of re-identification if unauthorised parties obtain the key, or if external information sources make it possible to match coded entries with real identities.
When publishing datasets on repositories such as UNIMI’s Dataverse, careful consideration needs to be given to whether the data is anonymised or pseudonymised, as this distinction has significant implications for GDPR compliance. If the dataset is strictly anonymised – meaning, there is no realistic way to link the information back to individuals – it will generally fall outside the scope of GDPR and can be published openly on a repository without additional restrictions. Importantly, the procedures and tools which you have used to anonymise should be always documented in both published datasets and DMPs. In contrast, a pseudonymised dataset that still contains indirect identifiers or coded links to the original data remains subject to GDPR because it is personal data. Researchers who choose to publish such pseudonymised information must apply the same safeguards they would use for any personal data, such as restricting access to sensitive files, encrypting identifiers, and storing the key that links codes to real identities in a secure location. By clearly documenting whether the data is anonymised or pseudonymised, and specifying any necessary conditions for access, researchers can responsibly share their findings in an online repository while maintaining privacy standards.
Here is a useful presentation about the anonymisation of qualitative data.
In many research projects, software tools such as Amnesia facilitate both anonymisation and pseudonymisation by combining techniques such as generalisation, suppression, and approaches such as k-anonymity. Generalisation replaces specific values (for example, an exact age of 34 becomes an age range of 30 to 35), and suppression removes particularly revealing information that might uniquely identify someone, such as a rare occupation or a small geographical area. Researchers often determine the appropriate level of risk reduction by applying criteria such as k-anonymity, which seeks to ensure that each record is indistinguishable from at least a certain number of others in the dataset.
Importantly, regarding the protection of personal data, please contact UNIMI Privacy and Dpo Support Office.
