|
|
|
||||
|
|
|
Introduction to S-PLUS is a two-part computer workshop taught by the
Research Computing Support Group of Information Technology and Communication,
University of Virginia. The workshop is an overview of layout and procedures of
S-PLUS for Windows including file operations, data definition procedures,
running basic descriptive statistics, data transformation procedures, and basic
analytical procedures. Procedures are learned through hands-on work related to
an actual research question.
For a schedule of the next S-PLUS workshop, please check out the courses page of the ITC Research Computing Support Group, (http://www.itc.Virginia.EDU/research/courses.html).
This document is the second part of the Introduction to S-PLUS workshop; the first part is also available online.
Prerequisites: This document presumes that you have familiarity with Windows 98/2000/XP and its commands. It will not review any DOS or Windows concepts such as filenames, paths, booting up, erasing files, using the mouse, scrolling, et cetera. If you are not yet comfortable with Windows and must use S-PLUS, please see the pointers in the appendix of Introduction to S-PLUS: Part I.
Table of Contents Part II
1. Reading Other Types of Data
First, we’ll download the same set of files that were used in Part I. To download the files for the class to your local machine, perform these steps for each file.
1. Right-click your mouse on the file name below and select the option to "Save Target As" (using Windows Internet Explorer).
2. Navigate the “Save As” window to the local directory C:\temp and save the file there.
These files are currently at: http://www.itc.virginia.edu/research/splus/training/splus6
Once you have started S-PLUS, remember to attach C:\temp as a Chapter with label temp at Position 1 by using the menu selections File, Chapters, Attach/Create Chapter.
Next, we’ll re-read the bank.sdd data to make sure that we are all using identical data. Use the menu selections File, OPEN, and browse to C:\temp\bank.sdd. If you are prompted to overwrite the bank data set, click OK.
Next we’ll learn how to import data that is not in an S-PLUS data file. The following are some types of formats which can be read or imported into S-PLUS or into which you can save or export your S-PLUS data file:
· Microsoft Access (*.mdb) file.
· Many other formats.
1.1 Reading in a Microsoft Excel File
Here is the bank data in an Excel spreadsheet. This is the file that we will import into S-PLUS. For spreadsheet and tab-delimited files, S-PLUS can automatically read variable names contained within the first row of the data file.

Reading in an Excel file is straightforward. Under the menu choice FILE, choose Import Data then From File…

Browse to the location of the file, in this case, C:\temp.

Select the file bank.xls from C:\temp and clicked OK. Then S-PLUS will display the following dialog box:

Remember that there already is a data set called bank. So give this new dataset a name other than bank, or the bank data will be overwritten. For our class, let’s all call this new dataset bank1 as shown above.
1.2 Reading in a Tab-Delimited Data File
Now we will import the file bank.dat into S-PLUS. This is how bank.dat appears in a text editor. The variable names are in the first row and each variable value for each observation is separated by a tab. Notice that variable names are in the first row.

To read in plain ASCII text data that is delimited by tabs (a common raw data format), you simply need to go to FILE menu, choose Import Data, then From File… and browse to the file to be imported. S-PLUS recognizes that bank.dat is a file whose columns are separated by white space (spaces or tabs), and fills the File Format in automatically as shown below.

To keep from overwriting our other bank datasets, let’s call this one: bank3.
S-PLUS automatically keeps track of every command that you submit in what is called a history log. While you have not directly entered any commands up to this point, text versions of the commands you have submitted with the mouse are appended to the log as well.
You can save the history as a script file or include parts of it in a script file. This is a useful feature for either replicating procedures or as a reference for what you actually did (e.g. if you forgot the precise values you used to recode a variable).

To view the History log, select from the menu bar: Window, History and Display, then click OK on the dialog box as shown below.

Note: displaying in Reverse Order is useful when many commands have been submitted and you are interested in viewing the last 2 or 3 without scrolling through a very long list. When you are working with the Commands window, to repeat or save individual commands, select Window, History and Commands. This window displays only those commands that you have explicitly entered in the current S-PLUS session while the History Log displays both mouse-generated and commands typed in the Commands window. An example of the Commands History window is below.

2. The Commands Window and Basic S-PLUS Syntax
S-PLUS has a Commands Window. The Commands Window is much like the Scripts Window which we will use later except that it is interactive. In conjunction with menu options it is useful in exploring data. The Commands Window can be invoked by clicking on the icon in the standard toolbar. If the window is in the background, it can be brought to the foreground by clicking on the icon twice or by selecting it from the Window choice on the Menu bar.

S-PLUS is a case-sensitive language. Case matters for both the S-PLUS commands and the variable names. Elementary commands are either expressions or assignments.
S-PLUS commands are separated by either a semicolon or a new line.
S-PLUSignores most spaces; however, do not put spaces in the middle of numbers or names.
The action of storing a value in a variable in S-PLUS is called assignment. You should always put spaces around the two-character assignment operator of the less-than symbol and the dash (<-), otherwise, you may perform a comparison instead of an assignment.
Assignments are in effect until they are either removed or overwritten. The value of a variable can be changed. All assigned variables are written to disk in the current working directory. S-PLUS saves data automatically and when you restart, all data from previous sessions are ready for use. Therefore, if you want to not overwrite your data files, you must back up your data file(s) BEFORE you start S-PLUS or before you read them into S-PLUS. This is why we have you create the “temp” chapter at the start of our session today.
Here is an excerpt from a commands session:

Commands can perform functions and operate on the objects in S-PLUS. In the sequence of commands below, the first command copies the bank data to another data set called BANK2. The next command creates two new data sets: BANK2.1 consisting of the data on males, and BANK2.2 consisting of the data on females. The third command merges the males and females back together into a data set called SDF1. The final command performs a t-test.
Copy each of the five commands into the Commands window and press enter.
BANK2 <- bank
menuSplit(data = BANK2, split.col = "GENDER", replicate.cols = "<ALL>", result.type = "Separate", max.numeric.levels = 10, nbins = 6, show.p = T)
menuMergeDataFrame(x = BANK2.1, y = BANK2.2, method = "Column Names", all.x = T, all.y = T, suffixes.x = ".1", suffixes.y = ".2", show.p = T)
SDF1 <- gui.sort.col(target = SDF1, target.col.spec = list("<ALL>"), source = SDF1, source.col.spec = list("<ALL>"), sort.by.spec = list("GENDER"), descending = F)
menuTTest2(data = SDF1, x = SALARY, y = GENDER, groups.p = T, mu = 0, alternative = "two.sided", t.paired = "Two-sample t", var.equal = T, conf.level = 0.95, print.object.p = T)
Here are the results of the t-test. The output shows results we have seen before -- such as that the mean salary for women is about $26,000, while the mean salary for men is about $41,000.
Standard Two-Sample t-Test
data: x: SALARY with GENDER = 1 , and y: SALARY with GENDER = 2
t = 10.8123, df = 468, p-value = 0
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
12516.98 18077.23
sample estimates:
mean of x mean of y
41342.67 26045.56
The large value of t and the small p-value (indicated by zero), showing low risk of being wrong in rejecting the null hypothesis) confirms our impressions from our look at the data in Part 1, and supports our claim that there is a statistical difference between male and female salaries. But why is there a difference? What about those other factors?
The graphics and plotting capabilities of S-PLUS are extremely rich and flexible. In this class we barely scratch the surface of their potential. Chapter 3 of the S-PLUS Users’ Guide for Windows is an excellent resource for learning how to use S-PLUS for one-, two- and multi-dimensional plots as well as trellis graphics.
We've already considered the role of education by grouping cases in cross tabulations. We could also produce a plot to compare individual cases. Let's compare two such plots -- one looking at the role of education (EDUC, not EDUC2), and a second looking at the role of tenure at the bank (JOBTIME) -- in order to compare the role of several factors, controlling for gender.
We’ll first sort the data, and remove the four observations that have missing gender. From the main menu, select Data, then Restructure, and then Sort… You should get a dialog box that looks like the one below. Make sure that the dataset listed in the From Data Set box is “bank” and not bank2 or bank3. In the Sort by Columns:, click on the down pointing triangle, and select GENDER. Then in the To box, make sure the Data Set is “bank” as well. If your Sort Columns window looks like the one below, click on OK to run it.

Again, from the main menu, choose Data, then Subset... We’ll retain the cases where GENDER is either equal to 1 or 2. Do this by making your Subset window look like the one below. In the Data Set box select bank if it’s not already displayed. In the Columns in Subset: select <ALL>. In the Subset Rows with: box of the Subset Expression area, type in GENDER <=2. REMEMBER: upper and lowercase matter to S-PLUS, it won’t find the variable if you type in gender instead of GENDER. In the Results area, type in BANKG in the Save In: box. If your window looks the one below, you can click on OK.

Now from the main menu, choose Graph, and then 2D Plot… You should see a window that looks like the one below. In the Axes Type: window, Linear should be highlighted and in the Plot Type: Scatter Plot (x, y1, y2, …) should be highlighted and the Graph Sheet: should say GS1 If it does, click on OK.

This should make the window below appear. Using the down pointing triangle next to the variable select box, set the x Columns, y Columns, and z Columns as shown below: Make sure the Data Set: is set to BANKG

Now click on Vary Symbols tab at the top of the window and set Vary Color By: box to show z Column and make sure all the rest of the settings are on their defaults as shown below.

Click OK to view the graph of SALARY by EDUC. It should look like this:

To add a legend to the graph, click on the autolegend button on the toolbar. This will put a legend in your graph. You can drag it to a location of your choice by clicking on it.

Then with the graph sheet as the active window (in focus), from the main menu, choose Insert, then Titles, and then Main… The text @AUTO appears in a text box at the top of your graph. In place of this, type: Salary by Education by Gender. Numerous features of graphs can be added to or modified by using the Insert and Format menus.
4. Using the Script Window – More Plots and Regression
Using the Script Window allows you to write programs to automate data importing and exporting, data modification, data analysis, and graph creation. The Script Window has a program pane for typing in the commands that make up the script. The Script Window also has an output pane that displays the output when the script is run.
When commands are typed into the Commands Window, they are evaluated immediately, while in the Scripts Window a long sequence of commands and related functions may be typed without any evaluation by S-PLUS. You could type a letter to your friend in the Scripts Window because what you type isn’t evaluated by S-PLUS as valid commands until you click on the Run button on the toolbar (a right pointing solid triangle).
We will open and run a script to create plots of SALARY by JOBTIME and SALARY by AGE. Script1.sss automates a number of fairly complex actions:
From the main menu, open the file by clicking on File then Open... Browse to the C:\temp directory and select Script1.ssc. To run the script, click on the Run icon (a right pointing solid triangle) in the script menu bar. As the script runs, a number of graphics and data frame windows will be opened within S-PLUS. Wait until you see READY at the lower left of the screen before performing any new action.

The difference is apparent: In the first plot (looking at the relationship between education and salary), there is an apparent relationship across the entire sample, although we can see that few women are above $40K or 16 years of education. The second plot shows a less clear relationship between job tenure and salary -- and the few women that earn above $40K are not the ones with the longest job tenure. So, education and gender seem to be related to salary, but length of job tenure does not.

This third plot shows something very different: Older employees tend to earn less, with the highest salaries going to men in their 40s and the lowest going to women older than 50.
If you go to Windows Explorer (click on the Start bar, then choose Windows Explorer), you can browse to C:\temp directory in order to see the new files created by this script. There should be several new files:

S-PLUS can save graphics to many graphical formats. The BMP format can be opened with Microsoft Paint and the choice was indicated in the script.
The results for males in Report1.txt are:
Call:
lm(formula = BYGEN.1$SALARY ~ BYGEN.1$EDUC + BYGEN.1$JOBTIME + BYGEN.1$PREVEXP + BYGEN.1$
AGE, na.action = na.exclude)
Coefficients:
(Intercept) BYGEN.1$EDUC BYGEN.1$JOBTIME BYGEN.1$PREVEXP BYGEN.1$AGE
-57548.93 4250.413 119.6168 -51.62697 747.8231
Degrees of freedom: 254 total; 249 residual
1 observations deleted due to missing values
Residual standard error: 14718.02
And for females in Report2.txt:
Call:
lm(formula = BYGEN.2$SALARY ~ BYGEN.2$EDUC + BYGEN.2$JOBTIME + BYGEN.2$PREVEXP + BYGEN.2$
AGE, na.action = na.exclude)
Coefficients:
(Intercept) BYGEN.2$EDUC BYGEN.2$JOBTIME BYGEN.2$PREVEXP BYGEN.2$AGE
9332.781 1573.559 27.99023 1.962787 -111.1421
Degrees of freedom: 215 total; 210 residual
Residual standard error: 6313.819
Use the notes from Part I to verify these results and find out more on their interpretations.
5. Data File Documentation
It's best to leave a trail of comments and information behind you as you plunge through the data. Using variable labels and value labels, comments, file labels, and document commands all make the data analysis process that much smoother and easier to re-create or explain to someone else should the need arise.
Comments can be put anywhere in an S-PLUS file. They are highly recommended
as a way to annotate your command program and output listing for future
reference. You can indicate a comment by starting the line or portion of a line
with a pound sign (#) and end on the
line on which they were started.
That is comments are ended by a new line. If you want to enter a multi-line comment, you must put a
pound sign (#) on each line of your
comments. You can also use the
pound sign to ‘comment out’ a S-PLUS command in your script. That is, by putting a pound sign in
front of a S-PLUS command, it will be treated as a comment, not a command by
S-PLUS when you run your script.
S-PLUS includes a number of function libraries that extend basic functionality or provide instructive examples of S-PLUS programming. Some of these libraries, such as the cluster library and the GUI library, are loaded (or attached) automatically when you start S-PLUS. You can attach other libraries as needed using the Load Library dialog.
The Hmisc library contains many functions useful for data analysis, high-level graphics, utility operations, functions for computing sample size and power, translating SAS datasets into S-PLUS, imputing missing values, advanced table making, variable clustering, character string manipulation, recoding variables, and bootstrap repeated measures analysis. The Hmisc library is written by Dr. Frank Harrell, a professor here at U.Va. can also found at its official website: http://hesweb1.med.virginia.edu/biostat/s/Hmisc.html
To load an S-Plus library
From the Main Menu, choose File and Load Library…. From the Load Library window highlight the MASS library and click on OK
Next open a commands window and issue the following commands:
x <- c(2.1,3.4)
fractions(x, cycles = 10, max.denominator = 2000)
BANKNEW <- insert.col(target = BANKNEW, target.column = list("@END"), count = 1, column.type = "factor", column.names = list("GENDER1"), fill.expression = GENDER)
attach( BANKNEW)
plot(GENDER1, SALARY, main ="Salary by Gender")

7. Getting Help
Statistical Consulting hours: For the hours when a statistical computing consultant is available, please contact the Research Computing Support Center by telephoning 243-8800 or e-mailing res-consult@virginia.edu
Helpful Web Pages:
|
Arithmetic |
+, -, *, /, ^, %/%, %% |
|
Comparison |
>, <, >=, <=, ==, !=, compare |
|
Logical |
!, &, | |
Some Sample S-PLUS Commands
|
objects() |
|
Returns names of all variables currently being used in the Data directory |
|
rm(foo, bar) |
|
Removes the objects named foo and bar |
|
x<-c(2.0, 0, 4.1)
y<-c(2, 10, 7, 5, 9) |
|
Assignment statement using the function c( ). This function concatenates an ordered sequence of vector arguments. |
|
x |
[1] 2.0 0.0 4.1 |
Display x |
|
x[1] |
[1] 2 |
Display first element |
|
2*x |
[1] 4.0 0.0 8.2 |
Multiply x by 2 |
|
4.7/x |
[1] 2.350000 Inf 1.146341 |
Divide 4.7 by x |
|
mean(x) |
[1] 2.033333 |
Mean of x values |
|
min(x) |
[1] 0 |
Minimum of x values |
|
max(y) |
[1] 10 |
Maximum of y values |
|
sum(y) |
[1] 33 |
Sum of y values |
|
sort(x) |
[1] 0.0 2.0 4.1 |
Sort list of x values |
|
length(x) |
[1] 3 |
Length of x vector |
|
prod(x) |
[1] 0 |
Product of x values |
|
median(x) |
[1] 2 |
Median of x values |
|
1:5 |
[1] 1 2 3 4 5 |
sequence 1 to 5 |
|
x2<-rep(x,times=4) |
[1] 2.0 0.0 4.1 2.0 0.0 4.1 2.0 0.0 4.1 2.0 0.0 4.1 |
repeat x four times and store in x2 |
|
seq(-1,1, by=0.25) |
[1] -1.00 -0.75 -0.50 -0.25 0.00 0.25 0.50 0.75 1.00 |
General sequence: seq(start, end, increment size) |
|
help() |
|
Start help facility |
|
sink("c:\\temp\\my.lis") |
|
Route output to file my.lis in C:\temp |
|
sink( ) |
|
Output is sent to window again |
|
date( ) |
[1] "Sun Sep 22 10:32:34 EDT 2002" |
|
|
p2 <- densityplot(~BYGEN.2$SALARY,data=BYGEN.2) |
|
|
|
p1 <- densityplot(~BYGEN.1$SALARY,data=BYGEN.1) |
|
|
|
print(p1,split=c(1,1,1,2), more=T) |
|
|
|
print(p2,split=c(1,2,1,2)) |
|
|
![]()