When research data includes personal and sensitive data, privacy issues can emerge. Researchers using this type of data need to fulfill legal and ethical obligations, eliminating the traceable reference to, or presence of, personal identifiers: anonymisation and pseudoanonymisation are two different methods to mitigate privacy risks in data processing.
Anonymisation
It involves modifying or removing personal data so completely that it is impossible to re-establish a link between the data and the individuals it relates to. This process is irreversible by nature and is often used when data needs to be shared widely or stored for long periods of time without compromising the privacy rights of the individuals concerned. Once information has been successfully anonymised, it is no longer considered personal data under the GDPR and therefore falls outside of its regulatory scope. In a practical research context, a hospital sharing statistical insights on patient outcomes with external analysts might aggregate or generalise certain details - such as grouping exact ages into broader bands - to ensure that no one can reconstruct identities, even when combining multiple records and external sources.
Pseudonymisation
On the other hand, it replaces identifiable information (such as names or social security numbers) with a code or pseudonym, reducing direct identifiability while still allowing for the possibility of re-identification if someone gains access to the link between the pseudonym and the original data. Because there is still the potential to link a particular dataset to a specific individual, pseudonymised data must be treated under the same GDPR obligations as any other form of personal data. Researchers and organisations often rely on this technique when it is useful, or necessary, to retain the ability to link data back to individuals for follow-up or longitudinal studies. A clinical trial might assign codes such as 'ID-001' or 'ID-002' to participants, while maintaining a separate secure file of their names. Those analysing the results of the trial will only see the coded records, but the ability to re-identify participants if they need to be contacted for further stages of the research remains.

How to?
The choice between anonymisation and pseudonymisation often depends on whether a study needs to maintain the ability to follow up participants. Anonymisation is particularly suitable for releasing datasets to the public or archiving data over long periods, such as in open access repositories, because it completely removes the identifiability of individuals. Pseudonymisation is generally used when ongoing interaction with participants is expected, or when future data collection is anticipated. However, a pseudonymised dataset may still expose individuals to the risk of re-identification if unauthorised parties obtain the key, or if external information sources make it possible to match coded entries with real identities.
When publishing datasets on repositories such as UNIMI's Dataverse, careful consideration needs to be given to whether the data is anonymised or pseudonymised, as this distinction has significant implications for GDPR compliance. If the dataset is strictly anonymised - meaning, there is no realistic way to link the information back to individuals - it will generally fall outside the scope of GDPR and can be published openly on a repository without additional restrictions. Importantly, the procedures and tools which you have used to anonymise should be always documented in both published datasets and DMPs. In contrast, a pseudonymised dataset that still contains indirect identifiers or coded links to the original data remains subject to GDPR because it is personal data. Researchers who choose to publish such pseudonymised information must apply the same safeguards they would use for any personal data, such as restricting access to sensitive files, encrypting identifiers, and storing the key that links codes to real identities in a secure location. By clearly documenting whether the data is anonymised or pseudonymised, and specifying any necessary conditions for access, researchers can responsibly share their findings in an online repository while maintaining privacy standards.
Here is a useful presentation about the anonymisation of qualitative data.
In many research projects, software tools such as Amnesia facilitate both anonymisation and pseudonymisation by combining techniques such as generalisation, suppression, and approaches such as k-anonymity. Generalisation replaces specific values (for example, an exact age of 34 becomes an age range of 30 to 35), and suppression removes particularly revealing information that might uniquely identify someone, such as a rare occupation or a small geographical area. Researchers often determine the appropriate level of risk reduction by applying criteria such as k-anonymity, which seeks to ensure that each record is indistinguishable from at least a certain number of others in the dataset.
Importantly, regarding the protection of personal data, please contact UNIMI Privacy and Dpo Support Office.