The main goal of tidy data is making it easy for a computer to work with the data. Let’s start by looking at some messy data and thinking about what makes it messy and what we could do to improve it.
Do the exercise on Improving Messy Data.
- Put the data up on the screen
- Ask the class for things they would improve and how to fix them
- Start a list of the rules they come up with on the board
- Talk through anything that can be improved about their answers
- Add in any rules that are missing from the list below at the end
General rules
- Be consistent
- Make it a rectangle
- One row for each data point
- One column per type of information
- Every cell contains one value
- Minimize redundancy using multiple tables
- Don’t use colors, fonts, or anything purely visual as data
- Use good null values (not -999, blanks good, some prefer NA etc. but language specific)
- Save data in plain text files
- Use good names