Introducing SAS
SAS is a programming language. That is, you type sets of instructions to process your research data and then tell it to clean or analyze one dataset at a time. SAS programs can be as simple as a single line, like this example which sets the page numbers in the output to begin at page one:
options pageno = 1;
To get SAS to reset the page number, you will type (or copy and paste) the instructions into the SAS Editor window and then push the run button.
SAS requires only a few lines to create a new dataset. The details will be explained later but below I am making a temporary dataset called older, which only includes the people in a permanent dataset called rhinitis who are greater than 80 years old:
data work.older; set source.rhinitis; where age > 80; run;
Real projects can use hundreds or even many thousands of lines to do complex tasks. SAS does not pay attention to capitalization or spacing (unless the code is inside of quotes or if spaces create new words) but you will see code with white spaces added using the space key or tab to make the programs easier to read. You will also see comments added to SAS programs. These notes will appear as green text between * ; or between /* */.
Datasets look like a grid of data in a Microsoft Excel workbook. That is, you typically will have one line per subject (or one line per patient visit) and one column for each thing you are assessing. Those features which you are measuring are called variables. The columns holding the characteristics you are assessing (i.e., the variables) can be either numeric or character types and should have names that are reasonably short (SAS will let you use 32 letters but 5 to 15 letters are usually enough to be descriptive). While SAS can be flexible about the names you use for your variables, begin variable names, with a letter of the alphabet and include nothing other than letters or numbers in the names. Because SAS does not want you to use spaces in the variable names I suggest you make the first letter of each word, after the first word, in a variable uppercase. This “camel” case style makes it relatively easy to decode the meaning of variables like dateOfRadiation.
When you load data into SAS it will guess if the variable is character or numeric based on what it sees. A missing number in a numeric variable is shown as just a decimal place and a missing character variable is displayed as an empty cell in the grid. SAS can not do math on character variables.
To examine or change values of a character variable you will use quotes. Every statement in SAS ends in a semicolon. So this line sets the character variable sex to the letters Male:
sex = “Male”;
SAS knows that the sex variable is character because of the quotes. Because it is a character variable, SAS will know you can't calculate the average of sex. That is the difference between numeric and character variables. SAS will not do math on character data.
This statement sets the age at diagnosis variable, which is actually named ageAtDiag, to the number 21:
ageAtDiag = 21;
SAS knows that the ageAtDiag variable is numeric because of the lack of quotes. SAS will be able to use the ageAtDiag variable in formulas.
How Libraries Work
SAS predates modern operating systems like Windows or Mac OS. So, instead of referring to folders full of datasets, SAS uses the word library for a place on your hard drive that has datasets. Instead of having to type C:\Projects\Books\Presenting\data\sasData\asthma.sas7bdat you can create a library reference to that folder like this:
libname source "C:\Projects\Books\Presenting\data\sasData\";
and then refer to the dataset as source.asthma. Notice that the file suffix .sas7bdat is automatically handled by SAS. That line of code calls the library source but you could use another name if you prefer. Library names need to be short (8 or fewer letters) and they should begin with a letter of the alphabet. To use the datasets associated with this book and not accidentally modify them, you will download the data into a folder and then set the location using a libname statement like this:
libname source "C:\Projects\Books\Presenting\data\sasData" access = readonly;While you can save your data anywhere on your harddrive, it may be easiest to create the folders in the path above and download the data into that folder because the sample code used with the book includes that file path.
It is critical that you remember that you will need to run the libname statement once every time you restart SAS but after that is done you can use any dataset in that library/folder.
Formats
This section has a lot of details which you will want to know to make your data look good. Don't panic over the amount of details. You can download a program that will format the datasets used in the book here. The first time you read this section, worry about the gist, not the details.
Programs like Microsoft Excel allow you to display dates as January 18, 1967 or 01/18/1967 or 18/01/67. This is possible because Excel stores the date as the number of days that have elapsed since 1900 and it simply tweaks the display format. SAS works similarly. It stores dates as the number of days since 1960. To have the dates display rationally, you will want to apply a date format to the values. For example, to change the display format for date of diagnosis to look like 01/18/1967, you will do something like this in your code:
Format dateOfDiag mmddyy10.;
or if you prefer 18JAN1967 you would include a line like this:
Format dateOfDiag date8.;
A display format does not change the values of the dates, it simply changes how they will appear. SAS also lets you create your own formats so variables can hold “code numbers” and the values can be displayed intelligently. If you have two variables, one holding sex of a patient and another holding the sex of the patient's partner, you can create a display format called sex like this:
proc format library = work; value sex 0 = "Male" 1 = "Female"; run;
And apply it to a dataset like this:
data work.foramatted; set work.raw; format patientIsFemale sex.; format partnerIsFemale sex.; format dateOfBirth mmddyy10.; run;
While the variables will display with words like Male and Female, the values themselves will remain 0 and 1. While very efficient, it will cause you headaches when you begin because you will need to remember that 1 means Female when you check values with code.
The example above created and used a numeric format which changed the appearance of a numeric variable. SAS will also let you change the appearance of a character variable. The syntax is very similar but it uses a $ before the name of the format and it quotes the values whose appearance will change. The code below creates a character format called sexMF which can be used to cause M appear as Male and F as Female. The data step code which follows proc format tells SAS to make a dataset called aNewDataSet which is identical to someData but when it is displayed, the variable called theSex will show the values Male and Female instead of M and F.
proc format library = work; value $sexMF "M" = "Male" "F" = "Female"; run; data work.aNewDataSet; set work.someData; format theSex $sexMF.; run;
Many SAS analysis procedures display variables in alphabetical order. SAS considers the space character " " as the first letter of the alphabet. You will see many examples in code where there are spaces at the beginning of formats to set the order in which things appear. For example, a format can be created, with a space as the first letter, to tell SAS to display Male before Female. This works because the " " comes before "F" in alphabetical order.
proc format library = work; value $sexMF "M" = " Male" "F" = "Female"; run;
Don’t worry if that is confusing right now. You will see many examples throughout the book. The important thing to remember is that formats change the appearance of values.