R is a free, open-source, multi-platform statistical environment that has become the de facto standard for quantitative analysis in the social sciences, and in many more research fields.
Because every step - from data import to the final plot - is expressed in shareable script files, R guarantees reproducibility and transparency, fully aligned with FAIR principles and the Research Data Management policy and practices of the University of Milan.

Beyond a programming language, R is a fast-evolving ecosystem whose global community quickly fills any methodological gap. For data-fairness this means:
- End-to-end transparency – scripts and notebooks document the entire workflow.
- Zero cost, open licences – free to install on any OS; integrates with Git, Docker and Singularity.
- Rapid innovation – hundreds of new packages appear every year; e.g. 2024-25 saw dsld (discrimination analysis), keras3 (deep learning), extended survey functions for ultra-complex samples, and causal-inference tools such as causalweight and causalQual.
How R fits into the research data management cycle
- Plan - In the DMP declare open formats (CSV, Parquet) and version-controlled scripts.
- Collect - Use httr2 or rvest to fetch data; auto-generate metadata.
- Analyse - Organise code in clear folders (R/ data/ output/); document with Quarto.
- Share - Deposit datasets and .Rmd files in Data@UNIMI – Dataverse, obtain a DOI, and link it in the paper.
- Preserve - Save an renv.lock file so anyone can rerun the project even a decade later. The lock-file captures: package names and exact versions, their sources, CRAN mirror settings, and relevant OS hashes.
The advantages of R
- Free and open source: public code, anyone can improve it - meaning that No licence fees, vendor-lock-in-free.
- Script base: Every operation lives in .R or notebook files - meaning that Readers can inspect and replicate each step.
- Wide community: Thousands of new packages, forums, conferences - meaning that Someone has likely solved a problem similar to yours.
- Integration with RDM: Plays well with Git, Dataverse, Docker, Quarto - meaning that Makes FAIR compliance straightforward.
Glossary
Term | Meaning |
---|---|
FAIR | Findable, Accessible, Interoperable, Reusable |
Pipeline | Automated sequence of steps (e.g. import → clean → analyse → plot) |
Matching | Technique to pair treated and control cases with similar features |
Diff-in-diff | Compares two groups before/after an intervention to estimate causal effects |
Topic modeling | Algorithm that groups words into recurring “topics” |
Shapefile / geodati | File format that stores geographic boundaries |
Key packages: what they do in simple terms
Package | Aim | Example |
---|---|---|
tidyverse | Read, tidy, transform & visualise data | Excel, but automated and row-limit-free |
survey + srvyr | Handle complex samples (weights, strata) | Re-weight a survey to match the real population |
lme4 / brms | Mixed-effects & Bayesian models | Evaluate students within classes within schools without mixing levels |
MatchIt, did, causalweight | Causal inference with observational data | Build “twin” groups (treated vs control) to see if a programme truly matters |
quanteda, tidytext, stm | Text mining & topic modelling | Summarise thousands of tweets as if they were survey answers |
sf, tmap, leaflet | Geodata & mapping | Colour municipalities by vote share or unemployment with a click |
targets / drake | Reproducible pipelines | One button to rebuild the entire project with automatic logs |
Quarto, R Markdown, Shiny | Reporting & dashboards | Word + PowerPoint + Excel, but fully synced to your code |
Resources for getting started (basic to advanced level)
- R for Data Science” (2ª ed.) – Complete and free online to the use of R for social sciences.
- TidyTuesday – A podcast e and a learning community of Data (here you find the podcast e and the official GitHub repo).
- Cheat-sheet RStudio – two-page PDF with the most common commands.
First steps
- Install R from https://cran.r-project.org, install RStudio.
- Open RStudio and run: > install.packages(c("tidyverse", "survey", "sf")). You’re ready to replicate the first examples and embark on your FAIR, reproducible-analysis journey. Happy coding!

Use case walk-throughs
Research question | Workflow and core packages |
---|---|
“How has trust in the media changed in Europe since 2002?” | 1. Import SPSS files (via haven) 2. Apply weights (via survey) 3. Compute means/variances (via srvyr) 4. Plot trends (via ggplot2) |
“What was the polling-station swing in Lombardy, 2018-2023?” | 1. Fetch tweets (via rtweet) 2. Clean text (via quanteda) 3. Build user-retweet graph via (igraph) 4. Visualise clusters (via ggraph) |
“Does the rent-bonus reduce arrears?” | 1. Combine administrative registers pre/post policy 2. Create matched pairs (via MatchIt) 3. Estimate diff-in-diff (via did) 4. Check balance (via cobalt) |
“Which themes recur in parliamentary speeches, 1948-2024?” | 1. Import transcripts (via readtext) 2. Tokenise & drop stop-words (via tidytext) 3. Fit STM topic model (via stm) 4. Plot topic trends (via ggplot2) |