There is also a posting on Ellis SAS tips for Experienced SAS programmers
It focuses on issues when using large datasets.
Randy’s SAS hints for New SAS programmers, updated Feb 21, 2015
-
ALWAYS
begin and intermix your programs with internal documentation. (Note how I combined six forms of emphasis in ALWAYS: color, larger font, caps, bold, italics, underline.) Normally I recommend only one, but documenting your programs is really important. (Using only one form of emphasis is also important, just not really important.)
A simple example to start your program in SAS is
******************
* Program = test1, Randy Ellis, first version: March 8, 2013 – test program on sas features
***************;
Any comment starting with an asterisk and ending in a semicolon is ignored;
- Most common errors/causes of wasted time while programming in SAS.
a. Forgetting semicolons at the end of a line
b. Omitting a RUN statement, and then waiting for the program to run.
c. Unbalanced single or double quotes.
d. Unintentionally commenting out more code than you intend to.
e. Foolishly running a long program on a large dataset that has not first been tested on a tiny one.
f. Trying to print out a large dataset which will overflow memory or hard drive space.
g. Creating an infinite loop in a datastep; Here is one silly one. Usually they can be much harder to identify.
data infinite_loop;
x=1;
nevertrue=0;
do while x=1;
if nevertrue =1 then x=0;
end;
run;
h. There are many other common errors and causes of wasted time. I am sure you will find your own
- With big datasets, 99 % of the time it pays to use the following system OPTIONS:
options compress =yes nocenter;
or
options compress =binary nocenter;
binary compression works particularly well with many binary dummy variables and sometimes is spectacular in saving 95%+ on storage space and hence speed.
/* mostly use */
options nocenter /* SAS sometimes spends many seconds figuring out how to center large print outs of
data or results. */
ps=9999 /* avoid unneeded headers and page breaks that split up long tables in output */
ls=200; /* some procs like PROC MEANS give less output if a narrow line size is used */
*other key options to consider;
Options obs = max /* or obs=100, Max= no limit on maximum number of obs processed */
Nodate nonumber /* useful if you don’t want SAS to embed headers at top of each page in listing */
Macrogen /* show the SAS code generated after running the Macros. */
Mprint /* show how macro code and macro variables resolve */
nosource /* suppress source code from long log */
nonotes /* be careful, but can be used to suppress notes from log for long macro loops */
; *remember to always end with a semicolon!;
- Use these three key procedures regularly
Proc contents data=test; run; /* shows a summary of the file similar to Stata’s DESCRIBE */
Proc means data = test (obs=100000); run; /* set a max obs if you don’t want this to take too long */
Proc print data = test (obs=10); run;
I recommend you create and use regularly a macro that does all three easily:
%macro cmp(data=test);
Proc Contents data=&data; Proc means data = &data (obs=1000); Proc print data = &data (obs=10); run;
%end;
Then do all three (contents, means, print ten obs) with just
%cmp(data = mydata);
- Understand temporary versus permanent files;
Data one; creates a work.one temporary dataset that disappears when SAS terminates;
Data out.one; creates a permanent dataset in the out directory that remains even if SAS terminates;
Define libraries (or directories):
Libname out “c:/data/marketscan/output”;
Libname in “c:/data/marketscan/MSdata”;
Output or data can be written into external files:
Filename textdata “c:/data/marketscan/textdata.txt”;
- Run tests on small samples to develop programs and then Toogle between tiny and large samples when debugged.
A simple way is
Options obs =10;
*options obs = max; *only use this when you are sure your programs run.
OR, some procedures and data steps using End= dataset option do not work well on partial samples. For those I often toggle between two different input libraries. Create a subset image of all of your data in a separate directory and then toggle using the libname commands;
*Libname in ‘c:/data/projectdata/fulldata’;
Libname in ‘c:/data/projectdata/testsample’;
Time spent creating a test data set is time well spent.
You could even write a macro to make it easy. (I leave it as an exercise!)
- Use arrays abundantly. You can use different array names to reference the same set of variables. This is very convenient;
%let rhs=x1 x2 y1 y2 count more;
Data _null_;
Array X {100} X001-X100; *usual form;
Array y {100} ; * creates y1-y100;
Array xmat {10,10} X001-X100; *matrix notation allows two dimensional indexes;
Array XandY {*} X001-X100 y1-y100 index ; *useful when you don’t know the count of variables in advance;
Array allvar &rhs. ; *implicit arrays can use implicit indexes;
*see various ways of initializing the array elements to zero;
Do i = 1 to 100; x{i} = 0; end;
Do i = 1 to dim(XandY); XandY{i} = 0; end;
Do over allvar; allvar = 0; end; *sometimes this is very convenient;
Do i=1 to 100 while (y(i) = . );
y{i} = 0; *do while and do until are sometimes useful;
end;
run;
- For some purposes naming variables in arrays using leading zeros improves sort order of variables
Use:
Array x {100} X001-X100;
not
Array x {100} X1-X100;
With the second, the alphabetically sorted variables are x1,x10,x100, x11, x12,..,x19, x2,x20 , etc.
- Learn Set versus Merge command (Update is for rare, specialized use)
Data three; *information on the same person combined into a single record;
Merge ONE TWO;
BY IDNO;
Run;
- Learn key dataset options like
Obs=
Keep=
Drop=
In=
Firstobs=
Rename=(oldname=newname)
End=
- Keep files being sorted “skinny” by using drop or keep statements
Proc sort data = IN.BIG(keep=IDNO STATE COUNTY FROMDATE) out=out.bigsorted;
BY STATE COUNTY IDNO FROMDATE;
Run;
Also consider NODUP and NODUPKEY options to sort while dropping duplicate records, on all or on BY variables, respectively.
- Take advantage of BY group processing
Use FIRST.var and LAST.var abundantly.
USE special variables
_N_ = current observation counter
_ALL_ set of all variables such as Put _all_. Or when used with PROC CONTENTS, set of all datasets.
Also valuable is
PROC CONTENTS data = in._all_; run;
- Use lots of comments
* this is a standard SAS comment that ends with a semicolon;
/* a PL1 style comment can comment out multiple lines including ordinary SAS comments;
* Like this; */
%macro junk; Macros can even comment out other macros or other pl1 style comments;
/*such as this; */ * O Boy!; %macro ignoreme; mend; *very powerful;
%mend; * end macro junk;
- Use meaningful file names!
Data ONE TWO THREE can be useful.
- Put internal documentation about what the program does, who did it and when.
- Learn basic macro language; See SAS program demo for examples. Know the difference between executable and declarative statements used in DATA step
17. EXECUTABLE COMMANDS USED IN DATA STEP (Actually DO something, once for every record)
Y=y+x (assignment. In STATA you would use GEN y=x or REPLACE Y=X)
Do I = 1 to 10;
End; (always paired with DO, can be nested nearly unlimited deepness)
INFile in ‘c:/data/MSDATA/claimsdata.txt’; define where input statements read from;
File out ‘c:/data/MSDATA/mergeddata.txt’; define where put statements write to;
Goto johnny; * always avoid. Use do groups instead;
IF a=b THEN y=0 ;
ELSE y=x; * be careful when multiple if statements;
CALL subroutine(); (Subroutines are OK, Macros are better)
INPUT X ; (read in one line of X as text data from INFILE)
PUT x y= / z date.; (Write out results to current LOG or FILE file)
MERGE IN.A IN.B ;
BY IDNO; * Match up with BY variable IDNO as you simultaneously read in A&B;
Both files must already be sorted by IDNO.
SET A B; * read in order, first all of A, and then all of B;
UPDATE A B; *replace variables with new values from B only if non missing in B;
OUTPUT out.A; *Write out one obs to out.A SAS dataset;
OUTPUT; *Writes out one obs of every output file being created;
DELETE; * do not output this record, and return to the top of the datastep;
STOP; * ends the current SAS datastep;
18. Assignment commands for DATA Step are
only done once at the start of the data step
DATA ONE TWO IN.THREE;
*This would create three data sets, named ONE TWO and IN.THREE
Only the third one will be kept once SAS terminates.;
Array x {10} x01-x10;
ATTRIB x length =16 Abc length=$8;
RETAIN COUNT 0;
BY state county IDNO;
Also consider
BY DESCENDING IDNO; or BY IDNO UNSORTED; if grouped but not sorted by IDNO;
DROP i; * do not keep i in final data set, although it can still be used while the data step is running
KEEP IDNO AGE SEX; *this will drop all variables from output file except these three;
FORMAT x date.; *permanently link the format DATE. To the variable link;
INFORMAT ABC $4.;
LABEL AGE2010 = “Age on December 31 2010”;
LENGTH x 8; *must be assigned the first time you reference the variable;
RENAME AGE = AGE2010; After this point you must use the newname (AGE2010);
OPTIONS NOBS=100; One of many options. Note done only once.
19. Key Systems language commands
LIBNAME to define libraries
FILENAME to define specific files, such as for text data to input or output text
TITLE THIS TITLE WILL APPEAR ON ALL OUTPUT IN LISTING until a new title line is given;
%INCLUDE
%LET year=2011;
%LET ABC = “Randy Ellis”;
20. Major procs you will want to master
DATA step !!!!! Counts as a procedure;
PROC CONTENTS
PROC PRINT
PROC MEANS
PROC SORT
PROC FREQ frequencies
PROC SUMMARY (Can be done using MEANS, but easier)
PROC CORR (Can be done using Means or Summary)
PROC REG OLS or GLS
PROC GLM General Linear Models with automatically created fixed effects
PROC FORMAT /INFORMAT
PROC UNIVARIATE
PROC GENMOD nonlinear models
PROG SURVEYREG clustered errors
None of the above will execute unless a new PROC is started OR you include a RUN; statement.
21. Formats are very powerful. Here is an example from the MarketScan data. One use is to simply recode variables so that richer labels are possible.
Another use is to look up or merge on other information in large files.
Proc format;
value $region
1=’1-Northeast Region ‘
2=’2-North Central Region ‘
3=’3-South Region ‘
4=’4-West Region ‘
5=’5-Unknown Region ‘
;
value $sex
1=‘1-Male ‘
2=‘2-Female ‘
other=‘ Missing/Unknown’
;
*Three different uses of formats;
Data one ;
sex=’1’;
region=1;
Label sex = ‘patient sex =1 if male’;
label region = census region;
run;
Proc print data = one;
Run;
data two;
set one;
Format sex $sex.; * permanently assigns sex format to this variable and stores format with the dataset;
Run;
Proc print data = two;
Run;
Proc contents data = two;
Run;
*be careful if the format is very long!;
Data three;
Set one;
Charsex=put(sex,$sex.);
Run;
*maps sex into the label, and saves a new variable as the text strings. Be careful can be very long;
Proc print data =three;
Run;
Proc print data = one;
Format sex $sex.;
*this is almost always the best way to use formats: Only on your results of procs, not saved as part of the datasets;
Run;
If you are trying to learn SAS on your own, then I recommend you buy:
The Little SAS Book: A Primer, Fifth Edition (or an earlier one)
Nov 7, 2012
by Lora Delwiche and Susan Slaughter
Beginners introduction to SAS. Probably the best single book to buy when learning SAS.