**********************************************************
**  This document is out of date and has been retained  **
**  because of the examples given for optimization of   **
**  fortran programsthat use long arrays.               **
**                                                      **
**  All references to PSC (Pittsburgh Supercomputing    **
**  Center) WWW pages are likely not to be functional.  **
**********************************************************
TITLE: Introduction to Vectorization DOCUMENT NO: U-028 DATE: March 26, 1996 VERSION NO: 7 INTRODUCTION This document provides information on the use of computers that attain rapid floating point calculations through their vector archi- tecture. The Pittsburgh Supercomputing Center (PSC), of which the University of Virginia is an Academic Affiliate, offers computing resources on the Cray C90 vector machine, the Cray T3D massively parallel processing system, and two workstation clusters called Super- Clusters. This document focuses on the use of the Cray C90 and the front-end Ultrix machines at the PSC. If you have questions beyond the scope of this document, please contact the Unix consultant at the Information Technology and Communication (ITC) Help Desk at 924-3731 for referral or send electronic mail to consult@virginia.edu. You may also access the PSC World Wide Web (WWW) Home Page at http://www.psc.edu/ ADDITIONAL DOCUMENTATION The following local documents may be helpful to you in using Unix machines and their screen editors: TIB-167 "TCP/IP - A Primer of User Commands" U-002 "Introduction to the Unix Operating System" U-003 "JOVE Text Editor on the RS/6000 or Sun Computer" U-004 "Vi Text Editor" These are available from Information Technology and Communication (ITC) at the Help Desk in Wilson Hall. GETTING STARTED AT THE PSC To get a grant on a PSC computer and a corresponding initial ac- count, you must be a faculty member or postdoctoral affiliate studying a problem which can demonstrate the need for vectorized or parallel- ized computation. You can pick up an application from Chris Roberts in ITC-West (982-5201). Successful completion of the Grant applica- 2 tion process results in a 10 Service Unit (SU) ``Starter Grant'' and an account for the Principle Investigator (PI). Also included in the information received from PSC will be the ``PSC Users Guide'' referred to elsewhere in this document. Expect processing time of just under a month for an application for a ``Starter Grant.'' The ``Starter Grant'' is to be used to optimize your code for the particular machine and determine memory and time requirements. The grant can then be expanded to a ``Production Grant'' which will be from 11 to 99 SUs. With this grant, you are expected to run the actu- al experiment. Although a faculty member or postdoctoral affiliate must be the Principle Investigator (PI) for the grant, graduate students may have an account which uses that grant time. PITTSBURGH SUPERCOMPUTING CENTER ITC provides high-speed access through the Internet (TCP/IP) to the Pittsburgh Supercomputing Center, one of four centers funded by NSF for engineering and scientific research. The centerpiece at the PSC is a Cray C90, with two J90's at its side. Both interactive and batch access are available on the C90 and J90's. They differ in al- lowing 10 CPU-minute and, respectively, 16 Megaword and 4 Megaword memory limits on interactive jobs. Massively parallel computing can be done on the Cray T3D which consists of 512 DEC Alpha CPUs (Process- ing Elements). Two Ultrix machines serve as front-ends to the C90, all using the Andrew File System (AFS) for storage. Users can also submit batch jobs to the C90 through two clustered VMS front-ends, although these do not use AFS. The T3D is tightly coupled to the C90 and jobs are submitted to it much the same way as they are to the C90. As an Affiliate of the PSC, ITC offers basic consulting and help with information included in grant proposals to users. Grants of computer time are available for research and for class use and instructional purposes. Principal Investigators must be faculty or staff at their institutions. To receive a grant of comput- er time for research, the Principal Investigator must submit a propo- sal to the PSC along with a ``Request for Advanced Computing Resources at the PSC'' form. For class accounts, it is assumed that the pro- jects are numerically-intensive or are studies of computer architec- ture that require the resources of a supercomputer. For a class ac- count, the instructor must submit the request form, a course syllabus, his/her curriculum vitae, and a roster of enrolled students. Forms for grant applications and additional information on the use of the Cray may be obtained from ITC, or through the World Wide Web at the URL http://www.psc.edu/grants/applicationform/form.html. There is a SuperCluster of DECstations and Dec Alphas which are for applications best suited to a single high-performance scalar machine or to a loosely coupled set of such machines. The SuperClus- ters are capable of executing a variety of single-threaded codes and are available in a parallel mode with the Parallel Virtual Machine (PVM) software library. A summary of commands common to the machines at the PSC is pro- vided at the end of this document. Also included is a list of URLs through which you can access PSC documentation on the World Wide Web. 3 NETWORK ACCESS Connections to the PSC machines are through the Internet. The command telnet establishes an interactive connection, and the command ftp is used for file transfer. To access a machine, you need to know its IP address. The chart below lists addresses for the machines cited in this document. Machine Internet Address VMS front-ends cpwsca.psc.edu cpwscb.psc.edu Ultrix front-ends pscuxa.psc.edu pscuxb.psc.edu Alpha Cluster front-ends axpfea.psc.edu axpfeb.psc.edu File ARchiver far.psc.edu (ftp only) Cray C90 mario.psc.edu Occasionally, you may need to know the numeric address for a machine, perhaps with X Windows applications. To obtain this number, type the command nslookup hostname on a local Unix machine and the ad- dress will be returned. For example, the address for the Cray C90 is 128.182.83.3. These numbers can change, so it is best to use the machine name where possible. For more detailed information on telnet and ftp, refer to TIB- 167, "TCP/IP - A Primer of User Commands." The syntax of the commands is given below. The address refers to one of those listed in the chart above. TCP/IP command Function telnet address establishes an interactive session ftp address establishes a connection for file transfer binary mode to transfer a non-ascii file or to use wildcards put file_old file_new transfers file_old, renaming it file_new get file_old file_new retrieves file_old, renaming it file_new mput or mget used with wildcards * or ?, requiring user verification CHANGING PASSWORDS Your password, along with your user ID, is your key to accessing a computer. Please do not publicize your password or allow others to use your accounts. This is extremely important. Please refer to the Introduction in the PSC User's Guide for their specific policy on com- puter abuse and passwords. AFS and Ultrix The C90, Ultrix and SuperCluster machines use the Andrew File System. Your login directory is accessible to each machine through 4 AFS, and your password is the same throughout. Your password may be changed at any time by typing the command kpasswd. Prompts will ap- pear to ask for your current password, the new password, and a verifi- cation. The change takes effect the next time you log in. AFS generates a token when you log in. This token is valid for a little under three days. You can get the expiration time by typing the command tokens. If you run a program and the token expires, your results are lost. To prevent loss of results, give the command klog to revalidate the token before submitting a program. Detailed information on AFS is available on the World Wide Web through the URL: http://www.psc.edu/general/filesys/afs/afs.html PSC backs up your AFS directory periodically. In your home directory, OldFiles is a copy of your account as of the last backup. In many instances you can avoid the need to request file restores by copying a file from the OldFiles directory. Vax (VMS) The first time you log in to the VMS front-ends, you must change your Vax password to an eight character (or longer) string of your choice. Failure to change your password will prevent you from using the system a second time. In addition to this initial change, your Vax password must be modified every sixty days. Several days before the expiration date, you will be notified when you log in. To change your password, type the command set password and you will be prompt- ed for input to verify your change. You may change your password at any time using this command. The new password must be different from your previous 100 passwords. Cray (UNICOS) The first time you use the C90 or J90, you must change your Cray password to an eight (or more) character string of your choice. As for the Vax, failure to do so will prevent you from using the computer a second time. Any Cray password requires at least one numeric or spe- cial character. There is no expiration period. Your password may be changed at any time by issuing the command passwd. There will be prompts to assist you. A new password must differ from an old pass- word by at least three characters. THE CRAY C90 Machine Specifications The Cray C90 at the PSC is a sixteen-processor machine with 512 million 64-bit words of main memory (2 Gigabytes). There is a 512- million word solid-state storage device (SSD) which is currently used as a fast cache to memory; user-control of it is not allowed. The machine can process up to 64 vector results per clock cycle of 4 nanoseconds for a peak performance of 16 Gflops (10**9 floating-point operations per second). This figure results from the dual vector pipe architecture and two functional units: 4 operations/clockcycle x 16 CPU's. 5 The Crays run under UNICOS with two Vax (VMS) and two DEC-Station 5000 (Ultrix) computers that act as front-ends. Batch jobs can be submitted from any of these machines to the C90 or J90. It is also possible to login directly and run interactive jobs, although you are charged for connect time and there is an interactive memory limit of, respectively, 16MW and 8MW, and an interactive CPU limit of 10 minutes. Mass storage uses the Cray Data Migration Facility (DMF) running on far.psc.edu, a Cray Y-MP/EL running UNICOS. Files saved on far.psc.edu are initially stored on disk, then migrated to tape by DMF as needed. Users connect through the Internet using TCP/IP telnet to the front-ends and the C90, and ftp to these machines and the File AR- chive (FAR). The commands rsh and rcp are accepted by the Cray from local Unix machines, provided the appropriate .rhosts files have been created on both machines (see p.8 for an example). The Andrew File System (AFS) is accessible from the Cray through the environment vari- ables $AFS and $AFSTMP. The Cray runs vectorizing FORTRAN 77, FORTRAN 90, C, and C++. Numerous software packages are available. These may be broadly categorized as graphics, applications (engineering and chemistry), mathematical, and statistical. The system supports multitasking (mac- ro), microtasking, and autotasking. Detailed information is available through the URL: http://www.psc.edu/machines/cray/c90/c90desc.html The Environment Users of the Pittsburgh Supercomputing Center receive accounts on the front-end machines as well as the Cray C90, J90, and, if applica- ble, the T3D. The Vax running VMS as an operating system offer the editors Emacs (full-screen) and EDT (line mode or full-screen). The VAXTPU has a full-screen interface, EVE, which requires a VT100, VT200, or compatible terminal. Emacs is also available on the Vax. The Cray and the other front-ends running versions of Unix have Vi and Emacs as editors. Note in the charging algorithm below that users are charged for time used in interactive connection to the Cray. It is more cost-effective to transfer a file to a local machine or a front- end to make major changes, than to edit that file on the Cray. The PSC also charges for disk use on the Cray. Files may be stored, without charge, on the front-ends and in FAR. Examples of the basic commands for storage and retrieval of files on each of these machines are given in the sample batch files on the next page. More detailed information is also available in the User's Manual issued by the PSC or in the on-line man pages. The UNICOS charging algorithm for Service Units (SU) is subject to change to assure the best use of the machine. The current algo- rithm for non-dedicated work is available through the URL: http://www.psc.edu/machines/cray/c90/policies/charges/nonded.html As of the date of this document the algorithm is as follows: SU = 0.75xCPU + 0.015625xMem + 0.01xCon SU = service units used CPU = cpu time in hours (user + system) 6 Mem = memory*time in Megaword hours Con = hours of interactive connect time Users may arrange to have all sixteen processors and all avail- able memory (about 244 MW) of the C90 dedicated to one of their appli- cations. Typically, a user might want a dedicated machine for accu- rate tests of a new parallel algorithm under the multitasking capabil- ity of UNICOS. The service unit charge for dedicated use of the C90 is 12 SU's per wall clock hour. Currently, dedicated time is avail- able once per week as specified in the document with the URL: http://www.psc.edu/machines/cray/c90/policies/downtime/dedicatedqueue.html Dedicated time must be arranged by sending electronic mail to remarks@psc.edu or by calling the hotline (1-800-221-1641). The Cray runs UNICOS with the C shell as the default login shell. If you prefer to use the Korn shell or Bourne shell, you may use the chsh command to change to /bin/ksh or /bin/sh, respectively. Compiling and Running Your Code It is possible to use the Cray interactively, transferring files, compiling, and running jobs. Since users are charged for connect time to the Cray, use of batch jobs is preferable. The sections of this do- cument that focus on submitting jobs assume batch mode. You may submit jobs from either the Vax or Ultrix front-ends, or directly on the Cray, and you may store and retrieve files on the front-ends or through DMF on the Cray. Jobs may be queued for submis- sion from the front-ends at any time, even during Cray test and maintenance times, in which case the programs are run as soon as the Cray becomes available. Mass Storage Archival storage for the C90 is available through a mass storage system using Cray's Data Migration Facility. DMF runs on the PSC machine named far.psc.edu. This machine is strictly a file archive where files are initially stored on disk, then migrated to tape by DMF. There are two ways to access archived files. The first is through the File ARchiver (FAR) program, an interface to far.psc.edu, which runs on the C90. The second is through the commonly used file transfer program FTP. The FAR interface provides both a single-line mode, used in batch job files, and a command mode, suitable for interactive work. Many FAR commands are standard Unix commands. Further information on FAR and a list of commands is given in the AFS document accessable through the URL: /http://www.psc.edu/general/filesys/far/far.html Submitting Jobs 1. Submitting on the Cray Your program may be submitted and run on the Cray in batch mode using the Network Queuing System (NQS). Jobs are queued to be run by 7 submitting a Unix script file, which contains the commands you would have entered interactively. You submit a file named file.nqs similar to the one below, by typing qsub file.nqs in response to the mario% prompt. Note that, although the job may be submitted from your small login directory, we have changed to the large temporary directory, $TMP, before any commands related to compiling or running the job occur. All work takes place in this temporary space. The Cray produces files named file.e##### and file.o#####, where ##### is the assigned job identifier number. They contain, respectively, the error messages, if any, and the results normally written to the stan- dard output device. These files will appear in the directory from which the job was submitted. They may be combined into one file using the -eo option cited below. In the following example, the files test1.f and test1.inp are retrieved from FAR, while test1.res is stored there. The file below is called test1.nqs. # USER= cray_i.d. PW= cray_pw# If job is submitted from a front-end # QSUB -r testprb # Give name testprb to report file # QSUB -s /bin/sh # Specify Bourne shell, if desired # QSUB -eo # Combine reports into one file # QSUB -lT 10 # Mandatory qsub options specifying Memory # QSUB -lM 3mw # and CPU-time limits for the entire job In the sample below, the files have the following roles re- lating to the job: test1.f program to be compiled and run test1.inp input data file, as stored in FAR infile input data file, as retrieved from FAR test1.nqs batch-submit file listed above outfile redirected results as printed by the program test1.res results, as stored in FAR testprb.o##### contains the job accounting report, any error messages, and results written by the program if not sent to an output file # To compile and run test1.f set -x # Echo back commands in Bourne shell ja my_acctfile # Turn on usage monitoring and name # # accounting file, if desired cd $TMP # Move to working directory far get test1.f test1.f # Retrieve files test1.f and far get test1.inp infile # test1.inp from FAR. You can # # also retrieve from other locations. cf77 -o test1 test1.f Compile the Fortran program test1 < infile > outfile # Run compiled program with input # # and output files far store outfile test1.res # Save results in FAR or # # other locations. # Follow-up rm *.* # Clean up $TMP ja -chlst my_acctfile # End accounting; generate reports 8 Users can obtain a listing of the program with vectorization messages by issuing the following command to the compiler: cf77 -o test1 -Wf"-em -l test1.list" test1.f <CR> where test1 is the executable version, and the list file produced is named test1.list. Any options used with the cft77 compiler may be specified for the cf77 compilation using -Wf"options_list" extension. In the above example, the option Wf"-em -l" enables loopmarks for vectorization analysis (m) and a named listing file (l) to hold marked code. Other options are listed in the man page for cf77. The -Zv option to cf77 gives enhanced vectoriza- tion. Job status may be checked on the C90 using the command qstat. The following are typical examples of its use: qstat -a shows information on job status and provides job number ##### qstat -f ##### shows more detailed information qdel -k #####.mario kills the job number ##### 2. Submitting from Ultrix pscsub is part of a collection of commands available on the Ultrix front-ends that allows users to submit jobs directly to the NQS batch queue on the Cray C90. pscsub copies jobs from the Ultrix front-ends to the Cray, runs them through NQS as though they were submitted directly from the Cray, and then re- turns the results to the Ultrix user. Thus, the user need not make any changes to existing Cray job files. Commands called pscfetch and pscdispose allow data files and programs to be transferred between the Ultrix front-ends and the Cray. Users can access AFS from the Cray as well, so a file could be copied with a command similar to cp $AFS/file_name new_name. Two additional job control commands exist from the Ul- trix front-ends: pscstat emulates the UNICOS qstat command, and pscdel, which emulates the UNICOS qdel command. Because pscsub is based on the remote commands, one must create .rhosts files on both the Ultrix front-ends and the Cray C90. The following chart shows the lines needed in the respec- tive files. Contents of .rhosts files on the Cray pscuxa.psc.edu login_id pscuxb.psc.edu login_id on Ultrix mario.psc.edu login_id In addition, allow your Ultrix home directory to be readable by all with the command: fs sa ~ system: anyuser rl. Warning: The Unix remote commands can fail during a remote login if the remote .cshrc or .kshrc file contains commands that try to print to the terminal, such as an echo statement, which will cause the remote login to fail, and hence pscsub will fail, 9 with no helpful error message. Be sure to remove such output statements from your .cshrc or .kshrc files, or be sure that they are not executed in noninteractive logins. 3. Submitting from UVa Hosts Since the remote commands rsh and rcp are currently accepted by the Cray from other Unix hosts, they may be used from local computers. You are reminded that there are possible security loopholes in using these commands. Telnet and ftp are more secure. We provide the following information because, in the ab- sence of security considerations, use of the remote commands can be more efficient. You are spared the time required for telnet logins, and you can use local computers to store your programs and results. For interactive work, we still recommend that you use a telnet connection to the Cray. This section focuses on the use of the remote shell commands in the context of submitting jobs. First, ensure that the commands will work by inserting the necessary lines in your .rhosts files on the Cray and the local machine. As shown in the previous section, you need a line in this file on the Cray that identifies the local machine and your local login_id. Similarly, on the local machine, there must be a line in its .rhosts file identifying the Cray with your Cray login_id. You can increase security by only giving yourself read and write permissions to your .rhosts file. The following examples illustrate the syntax for issuing a command from a local host to the Cray. They are used below in the context of a batch job submitted from the local machine. Note that the Cray requires the use of the equivalent remsh com- mand. rsh mario.psc.edu -l cray_id command remsh local.domain.virginia.edu -l local_id command The file rcp.nqs can be remotely copied to the Cray from a local host, and then submitted remotely to the batch queue on the Cray with this sequence of commands: rcp rcp.nqs cray_id@mario.psc.edu:/usr/users/#/cray_id rsh mario.psc.edu -l cray_id qsub rcp.nqs This is the file rcp.nqs: # QSUB -r rcpprb # Give name rcpprb to report file # QSUB -s /bin/sh # Specify Bourne shell, if desired # QSUB -eo # Combine reports into one file 10 # QSUB -lT 10 # Set 10 second time limit on the job # QSUB -lM 3mw # Set 3 Megaword limit on memory required # To compile and run test1.f set -x # Echo back commands in Bourne shell cd $TMP # Move to working directory ja # Turn on usage monitoring # # accounting file, if desired rcp local_id@local.domain.virginia.edu:/home/local_id/test1.f . # Copy from local cf77 test1.f # Compile a.out > test1.res # Run compiled program rcp test1.res local_id@local.domain.virginia.edu:/home/local_id # Copy to local # Follow-up rm *.* # Clean up $TMP ja -st # End accounting; generate reports Timing Your Code You can time blocks of code on the Cray with the second com- mand, which reports the elapsed CPU time (in floating-point seconds) since the start of the job. The function call does not require an argument. It returns the (real) number of CPU seconds elapsed since the job began. The second command is used as a subroutine call in the example given in the section "Computing Megaflop Rates and Speedups." If your job is swapped out of core for any reason, the timing function stops counting. Second may be used as shown in the following example: before = second() . . code to be timed . . after = second() time = after - before The rtc timing command is useful for timing short segments of code and can be more accurate than second. Rtc returns the number of machine cycles where each cycle is 4.0 nanoseconds (ns) on the Cray C90. Unlike second, rtc "keeps on ticking" if your job is swapped out of core, so using it to time longer sections of code can be deceptive. Use rtc in the following way: 11 ticks = 4.0E-9 before = rtc()*ticks . . code to be timed . . after = rtc()*ticks time = after - before COMPUTING MEGAFLOP RATES AND SPEEDUPS One megaflop stands for one million floating-point opera- tions per second and is used to measure the rate of computation attained by a given computer. It has become the standard of measure for vector processors, since they are able to do opera- tions of this order of magnitude. Since mathematical methods depend on real-valued functions, and the algorithms are pro- grammed in floating point operations, the sum of these operations is invariant for a particular simulation. The quantity that varies, from machine to machine, or between levels of optimiza- tion, is the total execution time for these operations. Most simply, the calculation of this metric is made using a known quantity, the number of vector operations performed in the section of timed code, and an observed quantity, the time, as follows: Mflops = (operations) / (time x 1000000) When a program is run in vector mode, the amount of CPU time required should decrease from that used to run the program in scalar mode. This speedup is one feature that makes vectoriza- tion and code optimization worthwhile. The simplest way to meas- ure this factor is by the following ratio: speedup = (time-in-scalar-mode) / (time-in-vector-mode) To obtain the time-in-scalar-mode, you will need to compile and run the code without the vectorization options. On the Cray C90, use the command cft77 -o novector test1.f, or cf77 -Wf"-o novector" test1.f. A more accurate example of computing the megaflop rate is presented below. It allows for several levels of computation, with the number of operations calculated as a function of the DO loop indices. A correction factor is used to obtain a more accu- rate figure that does not include overhead for calls to the ti- mer. Assume a section of code is to be repeated nreps times and within that loop, the code takes another loop through nloops iterations, where nrops operations are performed in the section of code to be timed. The clock overhead is calculated as tover and is used in the computation of the megaflop rate as a correc- tion factor for the call to the subroutine. The scale factor converts time to microseconds. The final megaflop rate is con- 12 tained in the variable rmflops. Note that here, second is used as a subroutine through which the time is passed as a parameter. scale = 1000000.0 do 30 i = 1, nreps call second(before) do 20 k = 1, nloops . code to be timed . 20 continue call second(after) timetot = timetot + after - before 30 continue do 40 i = 1, 100000 call second(before) call second(after) tover = tover + after - before 40 continue tover = tover/100000.0 rtime = (timetot - nreps*tover) * scale mops = nrops * nloops * nreps rmflops = float(mops) / rtime CODE OPTIMIZATION ON VECTOR PROCESSORS Vector compilers allow users to increase greatly the execu- tion speed of their programs. Vector processors can handle large amounts of data rapidly when the data is organized in subscripted arrays. By using vector registers, the computer can do the same operation on each element of the input vectors simultaneously. On the Cray, these registers have 64 elements; a program operat- ing in vector mode (the default) can do 64 operations nearly con- currently, whereas in scalar mode, only one operation at a time is performed. The goal in code optimization on a vector computer is to have as many computations as possible take place in vector mode, i.e., within vectorized DO loops. In the following subsections, we consider coding practices that can degrade performance. Topics include vectorization inhi- bitors, recurrence, order dependency, and memory bank conflicts. Examples illustrate ways to circumvent the problems created by these programming structures. A vectorizing compiler transforms a sequence of operations in an iterative loop, or DO loop, from scalar to vector form. A vector is an ordered list of scalar values stored in an array. Some basic guidelines for writing code that will vectorize are as follows: vector operations must occur within a DO loop, must act on a vector, two vectors, or a vector and a scalar, and must, in most instances, produce a vector of results. A compiler will vectorize only when it can determine that the meaning of the operation will not change. The DO loop should do a "fair" amount of work over "many" steps, should have an assignment statement with an array reference on the left, and should not contain any vectorization inhibitors. A vectorization inhibitor is a programming construction that prevents the compiler from producing vectorized object code. 13 Remember, vectorization only happens to DO loops. For the most part, it only happens to the innermost DO loop. Some things that keep a loop from vectorizing are under the control of the pro- grammer, and rewriting code may fix the problem and improve speed. The following are typical vectorization inhibitors with most compilers: Character assignments I/O statements IF statements Subroutine calls Branches into a loop from outside Multiple exits or backward within a loop Recursive expressions Too small an iteration count ( < 10 ) External function references Computed GOTO: GOTO (820,830), K(I) Conditional branch: IF (D(I)) 720,730 By checking the listing produced when the program is com- piled, you may see what loops have been vectorized, since they will be marked similarly to the following: V---< DO 400 J=1,L(N) V C(J)=A(J)+B(J) V---> 400 CONTINUE The Cray examines the innermost DO loop level for inhibi- tors. Since the compiler upgrade in January 1992 it has become possible to vectorize at an outer level. This is a special case, in which the inner nested loops are unrolled by the compiler, as discussed in a later section. Coding Around Vectorization Inhibitors In this first example, there are two IF statements within the innermost DO loop. Analysis of the code shows that there are really three levels of computation. One computation is required for K = 2, two computations are performed for 3 < K < N, and three computations are made when K = N. Original Code DO 120 K=2,N DO 110 J = 2,3 DO 100 I = 2,N A(I,J) = (1.0-PX) * B(I,J,K) IF ( K .LT. 3 ) GO TO 11 IF ( K .LT. N ) GO TO 10 B(I,J,K) = A(I,J) 10 B(I,J,K-1) = C(I,J) 11 C(I,J) = A(I,J) 100 CONTINUE 110 CONTINUE 120 CONTINUE Restructured Code DO 160 K = 2,N-1 DO 130 J = 2,3 14 DO 130 I = 2,N A(I,J) = (1.0-PX) * B(I,J,K) 130 CONTINUE IF ( K .EQ. 2 ) THEN DO 140 J = 2,3 DO 140 I = 2,N C(I,J) = A(I,J) 140 CONTINUE ELSE DO 150 J = 2,3 DO 150 I = 2,N B(I,J,K-1) = C(I,J) C(I,J) = A(I,J) 150 CONTINUE ENDIF 160 CONTINUE K = N DO 170 J = 2,3 DO 170 I = 2,N A(I,J) = (1.0-PX) * B(I,J,K) B(I,J,K) = A(I,J) B(I,J,K-1) = C(I,J) C(I,J) = A(I,J) 170 CONTINUE The following example is a trivial case of the compiler's inability to recognize vectorization. The simple correction al- lows the loop to vectorize. J = 1 DO 20 I=1,N DO 10 I=1,N D(2*I) = 1.0 J = J * 2 20 CONTINUE D(J) = 1.0 10 CONTINUE may not vectorize does vectorize The next example illustrates several changes that enable more vectorization to occur. The subroutine call in the original code prevents vectorization in the DO 10 loop, because the com- piler has no way of knowing what will occur once it is in the subroutine. The simplest change, since this is a trivial subrou- tine, is to expand in-line. The next change splits apart the subroutine call but allows the computation of A(I) to be vector- ized. The third solution computes the entire array A(I) at the subroutine level. Here both the DO 10 and the DO 100 loops vec- torize. Original (Unvectorized) Code DO 10 I = 1,1000 R(I) = (X(I)**2 + Y(I)**2) ** 0.5 CALL SAREA (R(I),A(I)) 10 CONTINUE SUBROUTINE SAREA (R,A) A = 3.14159 * R ** 2 RETURN 15 Solution 1: Expand in line DO 10 I = 1,1000 R(I) = (X(I)**2 + Y(I)**2) ** 0.5 A(I) = 3.14159 * R(I) ** 2 10 CONTINUE Solution 2: Split off CALL DO 10 I = 1,1000 10 R(I) = (X(I)**2 + Y(I)**2) ** 0.5 DO 20 I = 1,1000 20 CALL SAREA (R(I),A(I)) SUBROUTINE SAREA (R,A) A = 3.14159 * R ** 2 RETURN Best Solution: Vectorize within subroutine DO 10 I = 1,1000 10 R(I) = (X(I)**2 + Y(I)**2) ** 0.5 CALL VAREA (1000,R,A) SUBROUTINE VAREA (N,R,A) DIMENSION R(N), A(N) DO 100 I = 1,N 100 A(I) = 3.14159 * R(I) ** 2 RETURN Recurrence in Vectorization A recurrence has been formed whenever there is an expression within a loop whose computation requires the value of the expres- sion from a previous iteration. The following code segment il- lustrates a recurrence in the array B: DO 10 I = 0,100 10 B(I) = I A = 35.8 C = SIN(.89456) DO 20 I = 1,100 A = A + 1 B(I) = B(I-1) * 2.0 T = C C = T + 1 20 CONTINUE Why does this matter? We have initialized B(I) = 1, 2, 3, 4, 5, 6, . . . , and can trace the following results: Scalar Mode Vector Mode 2.0 <- 1.0 x 2.0 2.0 <- 1.0 x 2.0 4.0 <- 2.0 x 2.0 4.0 <- 2.0 x 2.0 8.0 <- 4.0 x 2.0 6.0 <- 3.0 x 2.0 16 16.0 <- 8.0 x 2.0 8.0 <- 4.0 x 2.0 The reason for the difference is that in Scalar Mode, the statements are executed sequentially in each trip through the loop: A, B, T, and C computations. In Vector Mode, all 100 A cal- culations are done, before the 100 B calculations, before the 100 T calculations, and finally the 100 C calculations. The following two examples show that some recurrences vec- torize: DO 10 I = 1,100 DO 20 I = 65,1000 R = R + A(I) A(I) = A(I-64) + 1.0 10 CONTINUE 20 CONTINUE Vector Reduction Large Threshold The threshold is the number of iterations before a value is reused. The magnitude of "large" is machine dependent. On the C90, 64 is the key since that is the vector register's length. Several techniques may be used to increase the speed of cal- culations that otherwise would be hampered by recursion. In the next example, the part of the computation that will vectorize is split away from the portion of the loop in which the recursion occurs. Some compilers allow this partial vectorization. Original Code DO 10 I = 1,100 10 Y(I+1) = Y(I) + W(I)*X(I)**3 Modified Code DO 10 I = 1,100 10 TEMP(I) = W(I)*X(I)**3 DO 20 I = 1,100 20 Y(I+1) = Y(I) + TEMP(I) In the example below, the loop indices are switched to avoid the recursion in the reference to Z(I,J) on the index I, but a temporary vector array VDIFF needs to be used to carry the differences that would have otherwise been calculated one-per- iteration. In the original code, the inner loop is processing a single column of the array Z, with each element feeding directly into the next loop iteration. In the modified code with the loops switched, the inner loop on J is generating the Ith row of Z from the (I-1)st row. The recurrence has been pushed into the outer loop, and the inner loop can fully vectorize. Original Code DO 10 J = 1,N DIFF = X(J) - X(J-1) DO 10 I = 2,N Z(I,J) = DIFF * Z(I-1,J) + Y(I,J) 10 CONTINUE Modified Code 17 DO 10 J = 1,N 10 VDIFF(J) = X(J) - X(J-1) DO 20 I = 2,N DO 20 J = 1,N Z(I,J) = VDIFF(J) * Z(I-1,J) + Y(I,J) 20 CONTINUE The last example appears to be a recursion in both dimen- sions, but it is not recursive within the vectorizable inner loop. The I loop is computing the Jth column of X from an en- tirely separate vector in the (J-1)st column. DO 10 J = 1,N DO 10 I = 1,N X(I,J) = 3.0 * X(I-1,J-1) + Y(I,J) 10 CONTINUE Order Dependency Order dependency is similar to recursion, but it can occur in referencing distinct arrays. In the example below, recall that in vector mode all calculations for the array A occur before the execution of the statement that determines the values in ar- ray B. However, the values in array A are dependent on new values of array B, so the order in which the statements are exe- cuted is critical. Compare the results traced in vector and scalar modes. DO 10 I = 1,N A(I) = B(I) * 2.0 B(I+1) = C(I) / D(I) 10 CONTINUE Scalar Mode Vector Mode A(1) <- B(1) x 2.0 A(1) <- B(1) x 2.0 B(2) <- C(1) / D(1) A(2) <- B(2) x 2.0 A(2) <- B(2) x 2.0 . . . B(3) <- C(2) / D(2) B(2) <- C(1) / D(1) A(3) <- B(3) x 2.0 B(3) <- C(2) / D(2) Vectorization will not occur since results would not be the same for the array A. Reordering the statements allows for vectoriza- tion since equivalent results are obtained: DO 20 I = 1,N B(I+1) = C(I) / D(I) A(I) = B(I) * 2.0 20 CONTINUE Scalar Mode Vector Mode B(2) <- C(1) / D(1) B(2) <- C(1) / D(1) A(1) <- B(1) x 2.0 B(3) <- C(2) / D(2) B(3) <- C(2) / D(2) . . . A(2) <- B(2) x 2.0 A(1) <- B(1) x 2.0 B(4) <- C(3) / D(3) A(2) <- B(2) x 2.0 18 B(4) <- C(3) / D(3) . . . Loop-unrolling Loop-unrolling is used to give the compiler more work to do on each iteration of the loop. This is done by explicitly coding into the equation terms that otherwise would have been handled by a reference to the loop index. In the example below, the four cases referenced by the index K are written out in the restruc- tured code. The number of iterations to unroll is machine depen- dent and usually must be determined experimentally. This example also illustrates another basic guideline in optimization: vec- torize on the largest dimension (assuming N>4). Original Code DO 10 I = 1,N DO 10 J = 1,4 A(I,J) = 0. DO 10 K = 1,4 A(I,J) = A(I,J) + B(I,K)*C(K,J) 10 CONTINUE Restructured Code DO 20 I = 1,N A(I,1) = B(I,1)*C(1,1) + B(I,2)*C(2,1) 1 + B(I,3)*C(3,1) + B(I,4)*C(4,1) A(I,2) = B(I,1)*C(1,2) + B(I,2)*C(2,2) 1 + B(I,3)*C(3,2) + B(I,4)*C(4,2) A(I,3) = B(I,1)*C(1,3) + B(I,2)*C(2,3) 1 + B(I,3)*C(3,3) + B(I,4)*C(4,3) A(I,4) = B(I,1)*C(1,4) + B(I,2)*C(2,4) 1 + B(I,3)*C(3,4) + B(I,4)*C(4,4) 20 CONTINUE In most instances, performance on an unavoidable recursion may be improved by unrolling the loop, because more work is done with each iteration. There may be a need for a cleanup loop if the depth of unrolling is not a perfect factor of the number of iterations. This is illustrated in the following example in the DO 30 loop. Original Recursive Loop DO 10 I = 2,N X(I) = Y(I)*X(I-1) + Z(I) 10 CONTINUE Recursive Loop Unrolled DO 20 I = 2,N,4 X(I) = Y(I)*X(I-1) + Z(I) X(I+1) = Y(I+1)*X(I) + Z(I+1) X(I+2) = Y(I+2)*X(I+1) + Z(I+2) X(I+3) = Y(I+3)*X(I+2) + Z(I+3) 20 CONTINUE DO 30 J = I,N X(J) = Y(J)*X(J-1) + Z(J) 19 30 CONTINUE Memory Bank Conflicts Typically, the memory system on a computer includes many un- its or "banks." Consecutive memory locations are assigned to the banks in a round-robin manner. The time necessary for execution of a loop would be greatly increased if the same bank were suc- cessively accessed, or accessed before the bank cycle time had elapsed. The bank cycle time is the time required by the comput- er before it can access a single bank again. We need to find a way around this problem to decrease the time for execution. A stride is the interval between memory locations for suc- cessive elements in a vector. The stride has an effect on the ef- ficiency of memory accesses. The performance of your vectorized programs can be degraded by memory contention, which is the con- flict between successive accesses to the same memory bank. Each access to a location in a bank causes that bank to be unavailable for some number of periods. Any attempts to access a bank that is currently unavailable are held until the bank is free. A stride of 1 is besA.stride of 16 is baA.stride of 64 is worst. The best way to ensure efficient memory access is to write loops so that elements are accessed down columns (incrementing the first subscript on consecutive elements) and vectors to be loaded or stored have odd strides. An odd stride guarantees that successive accesses are not to the same memory bank. In the following example, on the innermost loop controlled by I, we must access elements B(J,1), B(J,2), ..., B(J,8) in se- quence. In the schematic illustrating the storage of elements in memory, the storage pattern is spread across eight banks to create a "worst-case" scenario. Since the leading dimension specified in the code for the array B is even, bank conflicts oc- cur in accessing these elements. Array C has odd leading dimen- sion and references to it do not cause bank conflicts. Sample Code DIMENSION A(8,8),B(8,8),C(9,8) DO 10 J = 1,8 DO 10 I = 1,8 A(I,J) = B(J,I) SUM = SUM + C(J,I) 10 CONTINUE Bank: 1 2 . . . 8 A(1,1) A(2,1) . . . A(8,1) A(1,2) A(2,2) . . . A(8,2) . . . A(1,8) A(2,8) . . . A(8,8) B(1,1) B(2,1) . . . B(8,1) . . . B(1,8) B(2,8) . . . B(8,8) 20 C(1,1) C(2,1) . . . C(8,1) C(9,1) C(1,2) . . . C(7,2) . . . C(3,7) C(3,8) . . . C(1,8) C(2,8) C(3,8) . . . C(9,8) Changing the leading dimension to an odd integer means that the computer will not experience the delay required to repeatedly access the same bank. Compare the new storage in memory with that above to see that there is no longer a conflict on any of the arrays. An additional row across the banks is required to accomodate the increase in dimension for B. This extra memory requirement is a small price to pay on a supercomputer for the increase in speed. DIMENSION A(8,8),B(9,8),C(9,8) DO 10 J = 1,8 DO 10 I = 1,8 A(I,J) = B(J,I) SUM = SUM + C(J,I) 10 CONTINUE Bank: 1 2 . . . 8 A(1,1) A(2,1) . . . A(8,1) A(1,2) A(2,2) . . . A(8,2) . . . A(1,8) A(2,8) . . . A(8,8) B(1,1) B(2,1) . . . B(8,1) B(9,1) B(1,2) . . . B(7,2) B(8,2) B(9,2) . . . B(6,3) B(7,3) B(8,3) . . . B(5,4) . . . B(3,7) B(4,7) . . . B(1,8) B(2,8) B(3,8) . . . B(9,8) C(1,1) C(2,1) . . . C(8,1) C(9,1) C(1,2) . . . C(7,2) . . . C(3,7) C(3,8) . . . C(1,8) C(2,8) C(3,8) . . . C(9,8) Compiler Directives At the first level of compilation, we may specify options that govern how a file will be changed into object code. To force a program to run in scalar mode, use the novector option. Most compilers that allow different modes of execution also allow the user to fine-tune the type of compilation performed. This is done within the code at the DO loop level by compiler directives. Compiler directives are lines inserted in the source code of a program to specify actions to be performed by the compiler. They begin with a "trigger string" that looks like a comment line to any other compiler. On the Cray C90, the recognized string is CDIR$. The following code excerpt shows how a compiler directive is used within a program. Sample Usage DO 10 J = 2,8 N1 = J 21 N2 = J-1 CDIR$ IVDEP SAFEVL=6 DO 10 I = 7,N X(I,N1) = X(I-6,N2)*Y(I)+Z(I) 10 CONTINUE Directives are used to override or influence the compiler's decisions about how it will execute a DO loop. Compiler direc- tives should be used with caution. Please refer to the appropri- ate manual for details of their usage. Recall that the above example does not present a recursion, although the compiler may judge it to be so. We are looking at different columns of the array X. To override the compiler's de- cision not to vectorize the loop, we can use the compiler direc- tive IVDEP to cause the compiler to ignore what appear to be vec- tor dependencies. The directives may interact in ways that are not immediately apparent. Some directives act globally until turned off; other directives act only on the next DO loop. Some compiler direc- tives only act when certain options are specified at compile time. This is the case with inlining on the Cray, since this feature is disabled by default. To enable the directives INLINE and NOINLINE in the source code, one must compile with the ex- tended cf77 -Wf "-o inline" command. The following table lists the most commonly-used directives on the Cray C90. The list is not exhaustive and as compilers are improved, there may be loops that would not vectorize under one version that will be recognized by an upgraded version. There may be loops that vectorize on one machine and not on another. The only way to know what the compiler is doing with your code is to obtain a vectorization listing, and to compare scalar and vec- tor results to ensure that the program is making correct computa- tions. The man pages for cf77 and cft77 list many other compiler directives that may help improve performance. These documents are also available in AFS on the Ultrix front-ends in the direc- tory /usr/local/doc in the files named cf77_60.big and cft77.big. Cray Compiler Directives IVDEP ignore vector dependencies. Use with caution. SAFEVL = constant safe vector length at which no dependency occurs NEXTSCALAR forces disabling of vectorization for a loop SHORTLOOP placed before loops known to have 1 to 128 iterations. SPARSE indicates the percentage of true cases within a loop is low DENSE indicates the percentage of true cases within 22 a loop is high ------------------------------------------------------------------------ End of Document U-028, Introduction to Vectorization ------------------------------------------------------------------------

© 2008 by the Rector and Visitors of the University of Virginia.

The information contained on the University of Virginia’s Department of Information Technology and Communication (ITC) website is provided as a public service with the understanding that ITC makes no representations or warranties, either expressed or implied, concerning the accuracy, completeness, reliability or suitability of the information, including warrantees of title, non-infringement of copyright or patent rights of others. These pages are expected to represent the University of Virginia community and the State of Virginia in a professional manner in accordance with the University of Virginia’s Computing Policies.