Tidy Data, Archives, Metadata

Week 3 of Data Science for NGA LTER REU Students

Liz Dobbins

NGA LTER / Axiom Data Science

2023-08-03

Last Week

  1. Intro to Programming (Python)
  2. Programming Best Practices
  3. Practiced Best Practices
  1. Best Practices Solution

Data Life Cycle

Data Life Cycle. DataONE Best Practices

Signature Data Example

  • Plan: LTER has extensive data management
  • Collect: Multiple PIs, many years
  • Assure: Best quality. Nicely formatted.
  • Describe: There is some metadata on the NGA website
  • Preserve: Website includes links to archive
  • Discover: Informal
  • Integrate: Future possibilities
  • Analyze: Python/pandas

Julie Lowndes and Allison Horst

Definition of “Tidy Data”

  1. Each variable forms a column.
  2. Each observation forms a row.
  3. Each type of observational unit forms a table.

Wickham, Hadley. 2014. “Tidy Data”. Journal of Statistical Software 59 (10):1-23. https://doi.org/10.18637/jss.v059.i10.

Example of Messy Data

A table of weights:

Plot SpeciesA SpeciesB
1 3.5 1.2
2 2.8 4.2


  • the variable weight is found in multiple columns
  • there are 2 types of species so those are actually variables
    • variables should not be used as column headers

Same Data, Now Tidy

Plot Species Weight
1 A 3.5
1 B 1.2
2 A 2.8
2 B 4.2


  • each row is an observation
  • queries are easier

Other Qualities of Tidy Data

  • Units not included in cell with data
  • Visual indicators (colors, fonts, italics) not used
  • Consistent names
  • Consistent date formats
  • Short, descriptive language (avoid abstract codes)
  • Use consistent value for missing data (NaN, -9999, blank OK for pandas)
  • Data uniquely assigned to a single table
  • Saved as plain text format (CSV)

Data Carpentry Ecology Lesson Exercise

  1. Work with a partner
  2. Open survey_data_spreadsheet_messy.xlsx in the Google Drive
  3. Identify what is wrong with the spreadsheet
  4. Discuss how you might fix it

After you go through this exercise, we will discuss as a group

Where to Discover Data

How to Discover Data


Scientific Data Discovery Streaming Video
Informally between researchers your mom’s emails
Via project or institutional website a link at nbc.com
Referenced in a journal article via a blog review
Discoverable within specialized archive, or repository AppleTV or Netflix
Discoverable in network of repositories (Data.gov, DataONE) IMDB

LTER Data Management Requirements

  • Sites must have an integrated Information Management System
  • Data available online within two years of data collection
  • Sites should submit data to repositories
  • Long-term (>20 years) usability of data
  • Metadata

Where is NGA LTER Data?

  • NGA Data Catalog
    • Portal supplied by DataONE
  • Data and Metadata is stored in the Research Workspace member node

Data Portals are Powered by Metadata



Data Discovery Using DataONE Activity

Exploring the DataONE Data Catalog