GitHub - jacob-gg/structuring-and-sharing-data: Guidance for structuring and sharing (tabular) data with a biostatistician.

Guidance document: Structuring and sharing tabular data with a biostatistician

Jacob Goldstein-Greenwood
Senior Biostatistician
University of Michigan Kellogg Eye Center
Last updated: 2025-12-11

Cleaning and restructuring messy data often occupies a large portion of a biostatistician's billable time working on a project. Adhering to a few principles for structuring and recording data is one of the most effective ways for researchers to save money and ensure that analyses can be delivered on their ideal timelines.

This guidance document applies to tabular data: Data that can be structured as rows and columns.

At the bottom of the document are two example data sets. One is structured per the principles below; the other is not.

File types

.csv is an ideal file format for researchers and statisticians working with tabular data. Straightforward user interfaces (e.g., Excel) let researchers add to and update .csv data sets as needed, and .csv files are easy to read and work with in a statistician's programming language of choice (R, Python, etc.).

.txt (with a consistent delimiter), .tsv, and similar delimited file formats are fine as well.

.xlsx is likewise fine, but Excel's native file format, by mere allowance, can subtly encourage unhelpful data-recording and data-structuring practices (mixing raw data with analyses/visualizations; using formatting—colors, fonts, etc.—to communicate information, and so on).

Basic structure

Each column should be a variable; each row should be an observation. The first row should contain variable names.

The file should contain only rows (observations) and columns (variables). We must religiously separate raw data, analysis code, and results. For example:

Don't include notes that aren't neatly contained in a designated notes variable(s)
Don't mix summary statistics, cross tabulations, or other analyses with raw data
- E.g., don't compute mean logMAR in row 21 underneath individual logMAR values in rows 2–20
Don't include visualizations
- No Excel bar graphs living underneath the data on which they're based

Naming conventions

Use only alphanumeric characters and underscores for variable names: No spaces or special characters (beyond _). Don't use a number as the first character of a variable name.

Good variable names:
height_cm_visit_1
logMAR_OD
household_income_usd

Bad variable names:
1st_visit_height_cm
logMAR (OD)
Household income $

Keep variable names concise. Use a data dictionary to provide additional details on variables—how or when values were measured; cautionary notes for analysis/interpretation; etc. height_cm_visit_1 is far preferable to height_in_centimeters_at_first_visit_as_measured_by_RA_at_check_in, even though the latter follows the guidance to use only alphanumeric characters and underscores. Instead, provide the "extra" information ("measured by RA at check-in") in a data dictionary, if relevant.

Missing data

Leave missing data as missing; let absence communicate absence. There's rarely a need to use a placeholder value to represent missing data (e.g., n/a, -99, missing).

If there's a need to explain or record why a value is missing, use an additional variable to capture that information. For example, if it's important to know why values of a height variable are missing, include a variable called height_missing_reason.

Each cell contains one (and only one) piece of data

Data sets in the wild commonly include entries like $200, 72 in, 6.25 secs, etc.

These squash together multiple pieces of information: Values and units. Instead, include only values of numeric variables in the cells (or equivalent) of the data file, and include information like units elsewhere—in variable names (income_usd, height_in, time_secs), data dictionaries, or both.

Likewise, consider a data set with a variable recording the car that each participant drives. An entry like 2018 Toyota Camry compresses three pieces of information into one value: Year, make, and model. Instead, use a separate variable to capture each distinct piece of data: year, make, model.

Don't communicate information with aesthetics

Cell shading, text colors, font choices, and the like should not communicate information. If the control cases are shaded blue and the experimental cases are shaded red, drop the shading and add a variable called condition (control/treatment). If bolded subject IDs reflect cases that need to be excluded from analysis, drop the bolding and add a variable called exclude (TRUE/FALSE).

Consistency

Consistency is everything. When a data-structuring principle must be sacrificed or ignored, try to do so consistently.

For example, when recording dates, regardless of whether you use yyyy-mm-dd, mm/dd/yyyy, or some other standard, ensure that dates are recorded in the same way across every observation (and ideally, across every date variable).

Likewise, mixing hyphens (-) and en dashes (–) to indicate negative values is risky: Unless such inconsistency is communicated early and clearly, (some) negative values may lose their sign during analysis.

Example data sets

Good:

id	height_in	income_usd_k	logMAR_OD	logMAR_OS	condition	car_year	car_make	car_model	income_missing_reason
a	68	65	.2	0	control	2018	Toyota	Camry
b	70	110	0	.1	control	2015	Toyota	Corolla
c	65	80	.1	.2	trt	2016	Honda	Accord
d	64		-0.1	-0.1	trt	2023	Volkswagen	Jetta	refused
e	72	45	0	0	trt	2021	Toyota	Yaris

Bad:

id/condition	height	income ($)	logMAR (OD)	logMAR (OS)	car?
a / control	68 in	65,000	.2	0	2018 Toyota Camry
b / control	70 in	110,000	0	.1	2015 Toyota Corolla
c / trt	65 in	80,000	.1	.2	2016 Honda Accord
d / trt	64 in	N/A, refused	-0.1	–0.1	2023 Volkswagen Jetta
e / trt	72 in	45,000	0	0	2021 Toyota Yaris

In an actual data file (a .csv, for example), we could imagine the "bad" data set being further worsened by the presence of extraneous notes (an additional unstructured row reading: "Data recorded by RA between January and April 2025"); an embedded visualization (a dot plot of income by condition sitting to the right of the tabular data); and so on.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Guidance document: Structuring and sharing tabular data with a biostatistician

Table of contents

File types

Basic structure

Naming conventions

Missing data

Each cell contains one (and only one) piece of data

Don't communicate information with aesthetics

Consistency

Example data sets

Good:

Bad:

About

Uh oh!

Releases

Packages

jacob-gg/structuring-and-sharing-data

Folders and files

Latest commit

History

Repository files navigation

Guidance document: Structuring and sharing tabular data with a biostatistician

Table of contents

File types

Basic structure

Naming conventions

Missing data

Each cell contains one (and only one) piece of data

Don't communicate information with aesthetics

Consistency

Example data sets

Good:

Bad:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages