SPSS when making calculations essentially loops through every variable sequentially. So although calculations in syntax are always vectorized (the exception being explicit loops in MATRIX commands), that is compute y = x - 5.
works over the entire x vector without specifying it, it really is just doing a loop through all of the records in the data set and calculating the value of y in one row at a time.
We can use this to our advantage though in a variety of data management tasks in conjunction with using lagged values in the data matrix. Let’s consider making a counter variable within a set of ID’s. Consider the example dataset below;
data list free /id value.
begin data
1 10
1 11
1 14
1 13
2 12
2 90
2 16
2 14
3 12
3 8
3 17
3 22
end data.
dataset name seq.
To make a counter for the entire dataset, it would be as simple as using the system variable $casenum
, but what about a counter variable within each unique id value? Well we can use SPSS’s sequential case processing and LAG
to do that for us. For example (note that this assumes the variables are already sorted so the id’s are in sequential order in the dataset);
DO IF id <> LAG(id) or MISSING(LAG(id)) = 1.
COMPUTE counter_id = 1.
ELSE IF id = LAG(id).
COMPUTE counter_id = lag(counter_id) + 1.
END IF.
The first if statement evalutes if the previous id value is the same, and if it is different (or missing, which is for the first row in the dataset) starts the counter at 1. If the lagged id value is the same, it increases the counter by 1. It should be clear how this can be used to identify duplicate values as well. Although the MATCH FILES
command can frequently be more economical, it is pretty easy using sort. For instance, lets say in the previous example I wanted to only have one id per row in the dataset (e.g. eliminate duplicate id’s), but I wanted to only keep the highest value within id. This can be done just by sorting the dataset in a particular way (so the id with the highest value is always at the top of the list of sequential id’s).
SORT CASES BY id (A) value (D).
COMPUTE dup = 0.
IF id = lag(id) dup = 1.
SELECT IF dup = 0.
The equivalent expression using match files would be (note the reversal of dup
in the two expressions, in match files I want to select the 1 value).
SORT CASES BY id (A) value (D).
MATCH FILES file = *
/first = dup
/by id.
SELECT IF dup = 1.
The match files approach scales better to more variables. If I had two variables I would need to write IF id1 = lag(id1) and id2 = lag(id2) dup = 1.
with the lag approach, but only need to write /by id1 id2.
for the match files approach. Again this particular example can be trivially done with another command (AGGREGATE
in this instance), but the main difference is the two approaches above keep all of the variables in the current data set, and this needs to be explicitly written on the AGGREGATE
command.
DATASET ACTIVATE seq.
DATASET DECLARE agg_seq.
AGGREGATE
/OUTFILE='agg_seq'
/BREAK=id
/value=MAX(value).
dataset close seq.
dataset activate agg_seq.
One may think that the sequential case processing is not that helpful, as I’ve shown some alternative ways to do the same thing. But consider a case where you want to propogate down values, this can’t be done directly via match files or aggregate. For instance, I’ve done some text munging of tables exported from PDF files that look approximately like this when reading into an SPSS data file (where I use periods to symbolize missing data);
data list free /table_ID (F1.0) row_name col1 col2 (3A5).
begin data
1 . Col1 Col2
. Row1 4 6
. Row2 8 20
2 . Col1 Col2
. Row1 5 10
. Row2 15 20
end data.
dataset name tables.
Any more useful representation of the data would need to associate particular rows with which table it came from. Here is sequential case processing to the rescue;
if MISSING(table_ID) = 1 table_ID = lag(table_ID).
Very simple fix, but perhaps not intuitive without munging around in SPSS for awhile. For another simple application of this see this NABBLE discussion where I give an example of propogating down and concatenating multiple string values). Another (more elaborate) example of this can be seen when merging and sorting a database of ranges to a number within the range.
This is what I would consider an advanced data management tool, and one that I use on a regular basis.
