R for Data Science starter - Research data management @UNIMI

R is a free, open-source, multi-platform statistical environment that has become the de facto standard for quantitative analysis in the social sciences, and in many more research fields.

Because every step – from data import to the final plot – is expressed in shareable script files, R guarantees reproducibility and transparency, fully aligned with FAIR principles and the Research Data Management policy and practices of the University of Milan.

Beyond a programming language, R is a fast-evolving ecosystem whose global community quickly fills any methodological gap. For data-fairness this means:

End-to-end transparency – scripts and notebooks document the entire workflow.
Zero cost, open licences – free to install on any OS; integrates with Git, Docker and Singularity.
Rapid innovation – hundreds of new packages appear every year; e.g. 2024-25 saw dsld (discrimination analysis), keras3 (deep learning), extended survey functions for ultra-complex samples, and causal-inference tools such as causalweight and causalQual.

How R fits into the research data management cycle

Plan – In the DMP declare open formats (CSV, Parquet) and version-controlled scripts.
Collect – Use httr2 or rvest to fetch data; auto-generate metadata.
Analyse – Organise code in clear folders (R/ data/ output/); document with Quarto.
Share – Deposit datasets and .Rmd files in Data@UNIMI – Dataverse, obtain a DOI, and link it in the paper.
Preserve – Save an renv.lock file so anyone can rerun the project even a decade later. The lock-file captures: package names and exact versions, their sources, CRAN mirror settings, and relevant OS hashes.

The advantages of R

Free and open source: public code, anyone can improve it – meaning that No licence fees, vendor-lock-in-free.
Script base: Every operation lives in .R or notebook files – meaning that Readers can inspect and replicate each step.
Wide community: Thousands of new packages, forums, conferences – meaning that Someone has likely solved a problem similar to yours.
Integration with RDM: Plays well with Git, Dataverse, Docker, Quarto – meaning that Makes FAIR compliance straightforward.

Glossary

Term	Meaning
FAIR	Findable, Accessible, Interoperable, Reusable
Pipeline	Automated sequence of steps (e.g. import → clean → analyse → plot)
Matching	Technique to pair treated and control cases with similar features
Diff-in-diff	Compares two groups before/after an intervention to estimate causal effects
Topic modeling	Algorithm that groups words into recurring “topics”
Shapefile / geodati	File format that stores geographic boundaries

Key packages: what they do in simple terms

Package	Aim	Example
tidyverse	Read, tidy, transform & visualise data	Excel, but automated and row-limit-free
survey + srvyr	Handle complex samples (weights, strata)	Re-weight a survey to match the real population
lme4 / brms	Mixed-effects & Bayesian models	Evaluate students within classes within schools without mixing levels
MatchIt, did, causalweight	Causal inference with observational data	Build “twin” groups (treated vs control) to see if a programme truly matters
quanteda, tidytext, stm	Text mining & topic modelling	Summarise thousands of tweets as if they were survey answers
sf, tmap, leaflet	Geodata & mapping	Colour municipalities by vote share or unemployment with a click
targets / drake	Reproducible pipelines	One button to rebuild the entire project with automatic logs
Quarto, R Markdown, Shiny	Reporting & dashboards	Word + PowerPoint + Excel, but fully synced to your code

Resources for getting started (basic to advanced level)

R for Data Science” (2ª ed.) – Complete and free online to the use of R for social sciences.
TidyTuesday – A podcast e and a learning community of Data (here you find the podcast e and the official GitHub repo).
Cheat-sheet RStudio – two-page PDF with the most common commands.

First steps

Install R from https://cran.r-project.org, install RStudio.
Open RStudio and run: > install.packages(c(“tidyverse”, “survey”, “sf”)). You’re ready to replicate the first examples and embark on your FAIR, reproducible-analysis journey. Happy coding!

Use case walk-throughs

Research question	Workflow and core packages
“How has trust in the media changed in Europe since 2002?”	1. Import SPSS files (via haven) 2. Apply weights (via survey) 3. Compute means/variances (via srvyr) 4. Plot trends (via ggplot2)
“What was the polling-station swing in Lombardy, 2018-2023?”	1. Fetch tweets (via rtweet) 2. Clean text (via quanteda) 3. Build user-retweet graph via (igraph) 4. Visualise clusters (via ggraph)
“Does the rent-bonus reduce arrears?”	1. Combine administrative registers pre/post policy 2. Create matched pairs (via MatchIt) 3. Estimate diff-in-diff (via did) 4. Check balance (via cobalt)
“Which themes recur in parliamentary speeches, 1948-2024?”	1. Import transcripts (via readtext) 2. Tokenise & drop stop-words (via tidytext) 3. Fit STM topic model (via stm) 4. Plot topic trends (via ggplot2)