-
Notifications
You must be signed in to change notification settings - Fork 14
S4 Migration
This document attempts to list some of the places where we want to solidify conventions that have emerged in the optmatch codebase as formal classes or opportunities to use classes and methods to make optmatch easier for end users. See the s4 branch for work in progress.
The first step of bipartite matching is creating a representation of the distances between treatment and control units. The canonical representation in the literature is a control by treatment matrix of entries, some of which may have infinite value representing unmatchable pairs. We might replace that matrix by a sparse representation where only finite entries are stored. If we consider the problem as a network with edges weighted by distance, we might break the problem into connected components, each of which is one of the matrix representations. The question arises how to handle all of these representations with an eye towards efficiency of space and computation.
The matrix class already exists and is used throughout Optmatch. Finite entries indicate distance, infinite entries indicate non-matchable pairs. For small or dense problems (i.e. most entries are finite), it is hard to improve upon the basic matrix class.
For more sparse problems (for example, stratifying on one or more blocking factors), a sparser representation is called for. The new S4 class InfinitySparseMatrix represents these problems, and is the default return type from functions that create sparse descriptions (e.g. caliper). The class does not support matrix algebra found in the SparseM or Matrix packages, but we suspect this will not be readily needed by people creating distance matrices. It does support methods for manipulation such as subset, cbind, rbind, t, and element-wise arithmetic. Coercion from and to matrices is also supported via as.InfinitySparseMatrix and as.matrix, respectively. These matrices support row and column names. Joining operations give precedence to names, such that two ISMs with the same row names in different order will create the appropriate cbinded matrix. The class inherits from numeric so scalar operations work on the finite entires, as do functions like mean and sum that process vectors.
While subject to change, the internal representation of the ISM class is composed of the following slots:
-
.Dataholds the finite entries. This is accessed by default for scalar operations, e.g.2 * x. -
colsis a vector of column ids the same length as.Data -
rowslikewise for row ids. Betweencolsandrows, we know the location of each item in.Data -
rownamesholds the names of the rows (optional) -
colnamesholds the names of the columns (optional) -
dimensionis a two element list of the number of rows and columns. Usually set automatically during creation.
For sparse representations we could in principle use classes from the SparseM or Matrix packages; however, our running impression is that their orientations are different enough from ours that it would be better to create our own class or classes. (We are not doing very heavy matrix algebra. It does not appear that SparseM supports row/column names, while Matrix does. These names are nice to have for later joining of the matching to the original data.)
The S3 optmatch.dlist class has been deprecated.
DistanceSpecification is a "class union" (see Chambers (2008) chapter 9 for more details). This union formalizes the fact that either a matrix or InfinitySparseMatrix can serve as a distance specification. This class union acts like a normal class and can serve as the indicator for dispatch for methods or as a slot in an S4 object. If a class is part of the union, it must also support the following operations.
-
prepareMatching(x)turns a distance specification into a "canonical matching form." While not a formal class, the canonical form is adata.framewith 3 columns:control,treatment,distance.
exactMatch is a generic function for producing InfinitySparseMatrix objects representing stratified or exact matches. There are currently two methods.
-
exactMatch(B, Z)whereBis a factor andZis a two level treatment indicator.BandZmust be the same length. Treatment-control pairs that have the same level in B receive a zero in the resulting matrix, otherwise the pair gets an infinite entry. -
exactMatch(Z ~ B, [data = a.data.frame]). Like the previous example, except that it uses a formula specification and an optionaldata.framethat contains the vectorsZandB. Formulas of the formZ ~ B1 + B2 + B3 ...stratify on the interaction of all the blocking factors.
The results of exactMatch can be added to an existing distance specification or used as the excludes argument to mdist (see below for more details).
The mdist function formerly took an argument structure.fmla that allowed creating distances stratified by blocking factors. This argument has been replaced in favor of a new format mdist(..., excludes = aInfinitySparseMatrix). This allows more flexible limits on allowed matched differences: mdist(..., excludes = caliper(....). The old behavior of mdist(..., data = my.data, structure.fmla = Z ~ B) is equivalent to mdist(..., excludes = exactMatch(Z ~ B, data = my.data)).
This class has been deprecated in favor of the InfinitySparseMatrix class. Users should not notice a difference if they do not manipulate the objects directly. mdist and caliper, as well as the matching functions, have been updated to work with the new objects directly.
Use the glm method and formula methods for mdist instead.
Logical statements of InfinitySparseMatrix objects have a slightly different interpretation than optmatch.dlist objects. Since the InfinitySparseMatrix class descends from a numeric vector, logical operators result in a logical vector (not list of matrices of 1s and 0s as was the case with optmatch.dlist objects). Here is an example illustrating the change, from tests/fullmatch.R:
data(nuclearplants)
mhd2 <- mahal.dist(~date+cum.n, nuclearplants, pr~pt)
fullmatch(mhd2 < 1)data(nuclearplants)
mhd2 <- mdist(pr ~ date + cum.n, data = nuclearplants,
exclusions = exactMatch(pr ~ pt, data = nuclearplants))
mhd2[mhd2 < 1] <- 1
mhd2[mhd2 >= 1] <- 0
fullmatch(mhd2)If you are using logical operators, you are strongerly encouraged to consider the caliper function instead.
The subclass.indices argument, formerly marked deprecated, has been removed.