Skip to content

Guidance for structuring and sharing (tabular) data with a biostatistician.

Notifications You must be signed in to change notification settings

jacob-gg/structuring-and-sharing-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 

Repository files navigation

Guidance document: Structuring and sharing tabular data with a biostatistician

Jacob Goldstein-Greenwood
Senior Biostatistician
University of Michigan Kellogg Eye Center

Last updated: 2025-12-11


Cleaning and restructuring messy data often occupies a large portion of a biostatistician's billable time working on a project. Adhering to a few principles for structuring and recording data is one of the most effective ways for researchers to save money and ensure that analyses can be delivered on their ideal timelines.

This guidance document applies to tabular data: Data that can be structured as rows and columns.

At the bottom of the document are two example data sets. One is structured per the principles below; the other is not.


Table of contents

File types
Basic structure
Naming conventions
Missing data
Each cell contains one (and only one) piece of data
Don't communicate information with aesthetics
Consistency
Example data sets


File types

.csv is an ideal file format for researchers and statisticians working with tabular data. Straightforward user interfaces (e.g., Excel) let researchers add to and update .csv data sets as needed, and .csv files are easy to read and work with in a statistician's programming language of choice (R, Python, etc.).

.txt (with a consistent delimiter), .tsv, and similar delimited file formats are fine as well.

.xlsx is likewise fine, but Excel's native file format, by mere allowance, can subtly encourage unhelpful data-recording and data-structuring practices (mixing raw data with analyses/visualizations; using formatting—colors, fonts, etc.—to communicate information, and so on).

Basic structure

Each column should be a variable; each row should be an observation. The first row should contain variable names.

The file should contain only rows (observations) and columns (variables). We must religiously separate raw data, analysis code, and results. For example:

  • Don't include notes that aren't neatly contained in a designated notes variable(s)
  • Don't mix summary statistics, cross tabulations, or other analyses with raw data
    • E.g., don't compute mean logMAR in row 21 underneath individual logMAR values in rows 2–20
  • Don't include visualizations
    • No Excel bar graphs living underneath the data on which they're based

Naming conventions

Use only alphanumeric characters and underscores for variable names: No spaces or special characters (beyond _). Don't use a number as the first character of a variable name.

Good variable names:
height_cm_visit_1
logMAR_OD
household_income_usd

Bad variable names:
1st_visit_height_cm
logMAR (OD)
Household income $

Keep variable names concise. Use a data dictionary to provide additional details on variables—how or when values were measured; cautionary notes for analysis/interpretation; etc. height_cm_visit_1 is far preferable to height_in_centimeters_at_first_visit_as_measured_by_RA_at_check_in, even though the latter follows the guidance to use only alphanumeric characters and underscores. Instead, provide the "extra" information ("measured by RA at check-in") in a data dictionary, if relevant.

Missing data

Leave missing data as missing; let absence communicate absence. There's rarely a need to use a placeholder value to represent missing data (e.g., n/a, -99, missing).

If there's a need to explain or record why a value is missing, use an additional variable to capture that information. For example, if it's important to know why values of a height variable are missing, include a variable called height_missing_reason.

Each cell contains one (and only one) piece of data

Data sets in the wild commonly include entries like $200, 72 in, 6.25 secs, etc.

These squash together multiple pieces of information: Values and units. Instead, include only values of numeric variables in the cells (or equivalent) of the data file, and include information like units elsewhere—in variable names (income_usd, height_in, time_secs), data dictionaries, or both.

Likewise, consider a data set with a variable recording the car that each participant drives. An entry like 2018 Toyota Camry compresses three pieces of information into one value: Year, make, and model. Instead, use a separate variable to capture each distinct piece of data: year, make, model.

Don't communicate information with aesthetics

Cell shading, text colors, font choices, and the like should not communicate information. If the control cases are shaded blue and the experimental cases are shaded red, drop the shading and add a variable called condition (control/treatment). If bolded subject IDs reflect cases that need to be excluded from analysis, drop the bolding and add a variable called exclude (TRUE/FALSE).

Consistency

Consistency is everything. When a data-structuring principle must be sacrificed or ignored, try to do so consistently.

For example, when recording dates, regardless of whether you use yyyy-mm-dd, mm/dd/yyyy, or some other standard, ensure that dates are recorded in the same way across every observation (and ideally, across every date variable).

Likewise, mixing hyphens (-) and en dashes (–) to indicate negative values is risky: Unless such inconsistency is communicated early and clearly, (some) negative values may lose their sign during analysis.


Example data sets

Good:

id height_in income_usd_k logMAR_OD logMAR_OS condition car_year car_make car_model income_missing_reason
a 68 65 .2 0 control 2018 Toyota Camry
b 70 110 0 .1 control 2015 Toyota Corolla
c 65 80 .1 .2 trt 2016 Honda Accord
d 64 -0.1 -0.1 trt 2023 Volkswagen Jetta refused
e 72 45 0 0 trt 2021 Toyota Yaris

Bad:

id/condition height income ($) logMAR (OD) logMAR (OS) car?
a / control 68 in 65,000 .2 0 2018 Toyota Camry
b / control 70 in 110,000 0 .1 2015 Toyota Corolla
c / trt 65 in 80,000 .1 .2 2016 Honda Accord
d / trt 64 in N/A, refused -0.1 –0.1 2023 Volkswagen Jetta
e / trt 72 in 45,000 0 0 2021 Toyota Yaris

In an actual data file (a .csv, for example), we could imagine the "bad" data set being further worsened by the presence of extraneous notes (an additional unstructured row reading: "Data recorded by RA between January and April 2025"); an embedded visualization (a dot plot of income by condition sitting to the right of the tabular data); and so on.

About

Guidance for structuring and sharing (tabular) data with a biostatistician.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published