File formats are usually named as follows: prefix.suffix . The prefix is used to identify the file, the suffix indicates the type of file. So, for example, the file datamanagementplan.docx is: a data management plan saved in a Microsoft word format, which is a proprietary format.
What is the difference between a proprietary and an open format and which formats exist for different types of files?
It is important to know whether research file formats are open or proprietary, because this influences the re-use and the interoperability of research data, which are two key FAIR principles.
A proprietary file format is produced for profit by a business, such as Microsoft. This means that products associated to a proprietary software, such as Microsoft word which is associated to Microsoft, can only be used after purchase. To open files saved in proprietary formats, it is necessary to have the proprietary software. This limits interoperability of research data: not all researchers might have proprietary softwares installed and thus they might not be able to open research data save in proprietary formats nor re-use the research data for future research.
An open file format is produced with public-domain softwares which can be accessed, downloaded and used by anyone. This means that files saved in open formats have less restrictions than proprietary formats, enhancing interoperability as well as the re-use of research data.
To enhance interoperability, researchers can either work with open formats from the very beginning of their research, or alternatively they can decide to convert files saved in proprietary formats into open formats.
Below you can consult some examples of propietary and open file formats which exist for different types of files.
An important note on tabular data files when using Data@UNIMI
As the data files are updated in a dataset by the user, the Dataverse application tries to process and convert them into an archival format: this processes is called ‘Ingestion‘. The goal of the ingest process is to extract the data content from the user’s files and archive it in an application-neutral, easily-readable format. There can be multiple reasons for which the ‘ingestion’ process may fail, especially when it comes to tabular data.
If you are willing to upload an Excel file the Dataverse application will automatically try to ingest it and convert it in an open format tabular file. For this operation to be successful, please, do check the following point:
- If an Excel file has multiple sheets, only the first sheet of the file will be ingested (i.e. the file will not be available for preview). The other sheets will be available when a user downloads the original Excel file. To have all sheets of an Excel file ingested and searchable at the variable level, upload each sheet as an individual file in your dataset.
- You may encounter ingest errors after uploading an Excel file if the file is formatted in a way that can’t be ingested by the Dataverse software. Ingest errors can be caused by a variety of formatting inconsistencies, including: line breaks in a cell, blank cells, single cells that span multiple rows, missing headers. Check here how to automatically remove carriage returns and formatting.
- if the file contains tables but also images, graphs, captions or other, only the table will be ingested and visible. Keep you Excel as simple as possible!
More information about Tabular data file ingest are available in the Dataverse User Guide, whilst an example of an Excel file that was successfully ingested is available in the Dataverse Sample Data GitHub repository.