Transposing data

There are so many different ways to enter data. Most people concentrate on getting their data into a spreadsheet format before considering what the proper format for their analysis may be.  It may become necessary for some analyses to switch the orientation of your dataset from a columnar format to a wider format or the reverse, from a wide format to that of a long column.  In this blog post I will show how to use PROC Transpose and the DATA step to go back and forth between these 2 formats.

A wide format is often used when we enter data for a repeated measures type of design.  In this case each line of data will represent one experimental unit.  As an example, when I collect data for a repeated measures trial, I will  have one line of data for each unit (whether it be a person, plot, animal, etc…) and I will add a new column for each time I take a measurement.  If I collect data once a month for 6 months I will have a column for each month I collect data.

Wide format:

ID TREATMENT HT1 HT2 HT3 HT4 HT5 HT6
01 A 12 18 19 26 34 55
02 A 10 15 19 24 30 45
03 B 11 16 20 25 32 50
04 B 9 11 14 22 38 42

A column type format would consist of a line of data for each measurement taken on an individual.  If we have 6 measurements taken on one individual, we would have 6 lines of data for each individual.

Columnar format:

ID TREATMENT WEEK HEIGHT
01 A 1 12
01 A 2 18
01 A 3 19
01 A 4 26
01 A 5 34
01 A 6 55

Which format you will use will depend on your data entry preferences and the analysis you will be performing.

Reading in the datasets

Data wide;
input id trmt$ ht1 ht2 ht3 ht4 ht5 ht6;
datalines;
1 a 12 18 19 26 34 55
2 a 10 15 19 24 30 45
3 b 11 16 20 25 32 50
4 b  9 11 14 22 38 42
;
Run;

Data columnar;
input id trmt$ week ht;
datalines;
1 a 1 12
1 a 2 18
1 a 3 19
1 a 4 26
1 a 5 34
1 a 6 55
2 a 1 10
2 a 2 15
2 a 3 19
2 a 4 24
2 a 5 30
2 a 6 45
3 b 1 11
3 b 2 16
3 b 3 20
3 b 4 25
3 b 5 32
3 b 6 50
4 b 1 9
4 b 2 11
4 b 3 14
4 b 4 22
4 b 5 38
4 b 6 42
;
Run;

Creating a wide file from a columnar file

Proc transpose data=columnar out=wide_out;   <-  out=new dataset name for transposed                                                                                     data
Run;

Proc print data=wide_out;
Run;

Not quite what we want, we need to tell SAS that we would like the file transposed by ID and trmt.  So let’s add a BY statement:

Proc transpose data=columnar out=wide_out;
by id trmt;
Run;

Proc print data=wide_out;
Run;

Closer, but now we have a row for something called week and ht.  We do not need the rows that are labelled “week”.  To remove these we will tell SAS that we are only interested in the “variable” called ht by adding a VAR statement to our PROC:

Proc transpose data=columnar out=wide_out;
by id trmt;
var ht;
Run;

Proc print data=wide_out;
Run;

Much closer…  now we would like to change the column headings from COL1 to Week1 by adding a PREFIX= option in the PROC statement:

Proc transpose data=columnar out=wide_out prefix=week;
by id trmt;
var ht;
Run;

Proc print data=wide_out;
Run;

For all intents and purposes we now have a dataset we can use for a repeated measures ANOVA using PROC GLM.  The heading _NAME_ may be distracting, so let’s change its name to something more meaningful by adding a NAME= option in the PROC statement:

Proc transpose data=columnar out=wide_out prefix=week name=measure;
by id trmt;
var ht;
Run;

Proc print data=wide_out;
Run;

Creating a columnar file from a wide file

Moving in this direction, you have a couple of options, using the DATA Step or using PROC Transpose.

Using the DATA Step

Data columnar (keep=id trmt week ht);   <- in the new dataset called columnar only keep                                                                         the variables called id, trmt, week and ht

that means for this dataset the variables, ht1,                                                                          ht2, ht3, ht4, ht5, ht6 will be deleted from the                                                                          new dataset called columnar
set wide;
week = 1; ht = ht1; output columnar;   <-create a new variable called week with a value                                                                      of “1” for the first week’s measurement

.                                                                  create a new variable called ht with the value of                                                                      ht1

.                                                                 output = save the data into the new dataset                                                                             called columnar
week = 2; ht = ht2; output columnar;
week = 3; ht = ht3; output columnar;
week = 4; ht = ht4; output columnar;
week = 5; ht = ht5; output columnar;
week = 6; ht = ht6; output columnar;
Run;

Nice and straightforward and takes advantage of the power of the DATA Step.

Using PROC Transpose

Let’s take the same approach as above.

Proc transpose data=wide out=column_out;
Run;

Proc print data=column_out;
Run;

This provides us with a dataset with a row for each “variable” – id and the 6 ht measures – not quite what we are looking for.  We know that we need a file that will have the 6 measures for each ID.  So let’s try adding a BY statement for ID.

Proc transpose data=wide out=column_out;
by id;
Run;

Proc print data=column_out;
Run;

Close, notice how there is nothing for TRMT.  So let’s try adding it to the BY statement:

Proc transpose data=wide out=column_out;
by id trmt;
Run;

Proc print data=column_out;
Run;

Much better and if you look carefully at the dataset we have the correct format, but let’s clean it up a bit. Let’s add a NAME= option in the PROC statement to change _NAME_ to week:

Proc transpose data=wide out=column_out name=week;
by id trmt;
Run;

Proc print data=column_out;
Run;

The last think that we can change is the COL1 to HT1.  We will not be able to change this variable name to HT only without using an additional DATA Step:

Proc transpose data=wide out=column_out name=week prefix=ht;
by id trmt;
Run;

Proc print data=column_out;
Run;

To clean up this dataset to match the dataset we created with the DATA step, we would need to change the variable name HT1 to HT, and the values of WEEK from ht1, ht2, ht3, etc… to 1, 2, 3, etc…

Data column_out2 (drop=ht1);
set column_out;
ht = ht1;
if week = “ht1” then week = 1;
if week = “ht2” then week = 2;
if week = “ht3” then week = 3;
if week = “ht4” then week = 4;
if week = “ht5” then week = 5;
if week = “ht6” then week = 6;
Run;

Proc print data=column_out2;
Run;