1 Introduction
The advent of deep learning has led the field of computer science to a difficult crossroads. On the one hand, in pursuit of performance stateoftheart methods require everlarger mountains of data. On the other, users have become more conscious and protective of their data. Much of this concern is over fear of what could be done with data: most are in favor of more relevant search results and recommendations, but many would be worried about malicious use of their personal data.
Most current works take approaches related to federated learning or differential privacy (Dwork, 2011; Bonawitz et al., 2019; Geyer et al., 2017). In this work, we take an alternative approach, most similar to Schmidt et al. (2018), by reducing a dataset to a much smaller subset which is most applicable to the task at hand and least applicable to other tasks.
Our framework operates as follows. First we train one or more models for the entire dataset; we treat these models as approximations of the target function, constrained to lie within the model family. Second, we phrase an optimization problem to select a coreset that supports reconstruction of the approximate target while simultaneously optimizing for desirable secondary criteria. In this abstract, we focus on the secondary criterion of preserving privacy of specified target attributes, but this applies more broadly.
The benefits of a coreset approach to privacy are threefold. First, by storing much smaller amounts of data, the potential downside of the training data being compromised is significantly lessened. (Incidentally, this also has many auxiliary advantages, such as lowered costs and footprint for data storage, and more rapid training.) Second, it is also possible to construct coresets that increase the difficulty of solving problems other than the specific problem of interest, lessening privacy concerns. Third, it is simple and practical, because all it requires is deletion of certain data elements.
2 Background
Coresets are popular in classical machine learning and data mining, typically for increasing the speed of algorithms.
(HarPeled & Kushal, 2007; Tsang et al., 2005) However, they remain relatively unexplored for both neural networks and other applications.Given a training set of examples and corresponding of labels, we train a model to predict on unseen . We are provided a model evaluator (typically based on a testing set) that maps a model to a scalar performance. The coreset problem is to find indices , , inducing subset of the examples and of the labels that maximize . Analogously, we define the coreset problem as the same problem with . Throughout the paper, will denote the coreset size.
3 Methods
One might reasonably wonder whether coresets could ever perform significantly better than random samples. As a trivial example, consider the problem of estimating the mean and variance of a normal distribution
givensamples. As the samples are unbiased, our bestestimate mean should be the mean of the samples, an unbiased estimator with expected squared error of
. As grows very large we will have an excellent estimate of . Likewise, standard techniques allow us to estimate the variance of the underlying distribution. Given samples, we may for example select a coreset of elements to preserve the sample mean, outperforming a random sample. We may also in this coreset carefully select points with greater dispersion than random, obscuring the variance of the distribution. Such a coreset would be appropriate to publish if the goal is to allow learning of the mean, while protecting the variance.Thus, by condensing the greater information of the full sample into a small number of examples through careful choice, we may produce a (nonrepresentative) sample that nonetheless achieves superior performance on our evaluation metric. Furthermore, given this great freedom, we may also select unrepresentative samples that disguise other information about the true distribution.
What enabled us to successfully condense and disguise our dataset was our knowledge of the learning method. Our algorithm did not require the sample variance; therefore we did not sacrifice performance by altering it. One could imagine alternate approaches that make use of the sample variance, such that obscuring it might damage performance on the task of mean estimation. While we cannot cover this in more detail here, as in other areas of privacy, privacypreserving coresets require striking a balance between accuracy and privacy.
4 A synthetic case study using linear regression
Suppose we have a dataset , target values which we would like to predict, and targets which we would like to hide. We propose to build a loss that encourages fidelity in reconstructing while obscuring . Specifically, we use the following scheme for coreset construction, following our framework. First, train on the entire dataset, and record the weights and the intercept . Then, define a loss for a point as , where . The first term of the loss favors points that align with the learned model for from the full data. The second term, not yet specified, should make it hard to recover . To construct a coreset, simply choose the points with lowest loss.
For the term , many approaches are possible, but we consider two simple approaches:

Hide the secret value. In this approach, we take . This is a simple way to make difficult to predict, by choosing samples that are all roughly the same. However, this method might select a coreset that is artificially uniform in value, perhaps selecting almost all points from a majority class.

Plant a synthetic secret value. In this approach, we take
for some random vector
and intercept . As the coreset is a subset of data points with their labels, we cannot modify the labels of points. Instead, we favor selection of points whose noise in the direction will mislead a learner. This will encourage simple learners to discover the planted value rather than the original value.
In this short paper, we don’t argue that these two simple approaches are safe against targeted attacks; in fact, we expect they are not. However, we can analyze their effect on standard training algorithms. Let with and labels and , where each . The task is to recover the line of best fit. As seen in Table 1, for a full dataset of size and , our method works well, preserving prediction of while masking the true relationship between and .
Data  

True values  
Full dataset  
()  ()  () 
Coreset  
()  ()  () 
5 Coresets for CIFAR100
The synthetic example above shows that coresets can hide properties of the original dataset, but the example is stylized, and the and values are chosen independently. We now consider a real dataset, and attempt to obscure one set of labels that is highly correlated with another.
CIFAR100 comes labeled both with coarse and finegrained labels. So, we will try to select a subset of data such that performance on identifying coarsegrained classes is maximized but performance on finegrained classification is minimized. Additionally, to ensure that classes are not simply dropped from the dataset, we also constrain the dataset to be classbalanced under the finegrained labels.
Again following our framework, we train a model on the full dataset. We collect for each example the average gradient magnitude over training as a proxy for its difficulty, with the idea being that difficult, unrepresentative or uncommon examples will yield lower performance for small than easy examples. (Evidence for this method is presented in Figure 2.) So, we will choose examples that produce good coresets for coarsegrained classification but generalize poorly on the individual classes.
We trained a wideresnet168 on CIFAR100, and calculated the average gradient norm of each example on each task. We then constructed several coresets for evaluation. The finegrainedmasking coreset is constructed by sorting and classbalancing data by the quotient of an example’s average finegrained gradient norm and its coarsegrained norm, and the coarsegrainedmasking coreset by the inverse of that quotient. For completeness, we list results using a random coreset and a highquality coreset for each task using the minimum gradient norm examples. Final accuracy were obtained using a small CNN due to compute limitations.
As seen in Table 2, we found that we were successful in constructing coresets which would maximize performance for one task and minimize for the other.
Coreset  Coarsegrained  Finegrained 

()  classification (%)  classification (%) 
Random  36.35  22.83 
Finegrained minnorm  42.86  30.79 
Coarsegrained minnorm  43.10  27.15 
Finegrainedmasking  39.55  16.40 
Coarsegrainedmasking  33.53  26.32 
Performance of CNN classifier on various coresets. The bottom two rows evidence that even on these related tasks our method can construct coresets which significantly favor one task over the other.
6 Discussion
The main implication of this work is that one need not alter or synthesize data nor come up with alternate embeddings for data as in Zemel et al. (2013); Wang et al. (2018). Rather, by carefully selecting an appropriate subset of data for release, one can sufficiently alter the statistics of relevant properties so as to obscure protected values. In this extended abstract, we introduce these ideas without discussing specifics of attacks and counterattacks, as the start of a conversation about coresetbased privacy.
In the linear regression domain, our method successfully hides protected functions from direct likelihoodmaximizing approaches on synthetic data. We would like to point out some nice additional properties of our formulation. First, as the dimensionality of the input data increases, so does the probability that the random direction is approximately orthogonal to the true relationship. Second, due to the presence of the parameter
in the loss, one may easily tune the focus on performance versus privacy. Third, the factor of dataset reduction in order to construct a coreset depends on the magnitude of the unexplained variance—for very low unexplained variance, one needs many samples just to find a few unrepresentative examples. In more difficult, realworld problems though, this should actually be even easier. One potential vulnerability is that if an adversary has good priors on the distribution of input features, they may be able to gain information through the difference in feature distribution in the coreset. In the future, we would like to address these problems and experiment with regressions on realworld data.Additionally, it should be noted that the CIFAR100 task we chose is actually an especially difficult problem for this approach. First, the size of the dataset is only several times larger than the amount of data required to reasonably consistently train a CNN from scratch, which limits choice of unrepresentative samples. Second, the tasks of predicting class versus superclass are very similar, which makes it surprising that it is even possible to find coresets which are significantly more predictive on one task than the other. It will be interesting to try these ideas on other datasets such as the ImageNet.
7 Conclusions
In this work we have presented simple and practical methods for constructing fair coresets which better preserve privacy. We demonstrate their utility on both synthetic and realworld datasets, with both linear and neural models. We hope that this work will pave the way for smaller and more private datasets in the future.
Acknowledgments
We’d like to thank Tushar Chandra for productive conversation related to this work, and Asher Spector for advice, comments, and revisions on this paper.
References
 Bonawitz et al. (2019) Keith Bonawitz, Hubert Eichner, Wolfgang Grieskamp, Dzmitry Huba, Alex Ingerman, Vladimir Ivanov, Chloe Kiddon, Jakub Konecny, Stefano Mazzocchi, H Brendan McMahan, et al. Towards federated learning at scale: System design. arXiv preprint arXiv:1902.01046, 2019.
 Dwork (2011) Cynthia Dwork. Differential privacy. Encyclopedia of Cryptography and Security, pp. 338–340, 2011.
 Geyer et al. (2017) Robin C Geyer, Tassilo Klein, and Moin Nabi. Differentially private federated learning: A client level perspective. arXiv preprint arXiv:1712.07557, 2017.

HarPeled & Kushal (2007)
Sariel HarPeled and Akash Kushal.
Smaller coresets for kmedian and kmeans clustering.
Discrete & Computational Geometry, 37(1):3–19, 2007.  Schmidt et al. (2018) Melanie Schmidt, Chris Schwiegelshohn, and Christian Sohler. Fair coresets and streaming algorithms for fair kmeans clustering. arXiv preprint arXiv:1812.10854, 2018.
 Tsang et al. (2005) Ivor W Tsang, James T Kwok, and PakMing Cheung. Core vector machines: Fast svm training on very large data sets. Journal of Machine Learning Research, 6(Apr):363–392, 2005.
 Wang et al. (2018) Tongzhou Wang, JunYan Zhu, Antonio Torralba, and Alexei A Efros. Dataset distillation. arXiv preprint arXiv:1811.10959, 2018.
 Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International Conference on Machine Learning, pp. 325–333, 2013.
Comments
There are no comments yet.