loading...
Techniques for Warehousing of Sample Data
Atlanta, Georgia April 03-April 07
DOI Bookmark: http://doi.ieeecomputersociety.org/10.1109/ICDE.2006.15722nd International Conference on Data ...
 This Article 
 
PDF
HTML
 
 Share 
   
 Bibliographic References 
   
 Add to: 
 
Digg
Furl
Spurl
Blink
Simpy
Google
Del.icio.us
Y!MyWeb
 
 Search 
   
Paul G. Brown, IBM Almaden Research Center
Peter J. Haas, IBM Almaden Research Center
We consider the problem of maintaining a warehouse of sampled data that "shadows" a full-scale data warehouse, in order to support quick approximate analytics and metadata discovery. The full-scale warehouse comprises many "data sets," where a data set is a bag of values; the data sets can vary enormously in size. The values constituting a data set can arrive in batch or stream form. We provide and compare several new algorithms for independent and parallel uniform random sampling of data-set partitions, where the partitions are created by dividing the batch or splitting the stream. We also provide novel methods for merging samples to create a uniform sample from an arbitrary union of data-set partitions. Our sampling/merge methods are the first to simultaneously support statistical uniformity, a priori bounds on the sample footprint, and concise sample storage. As partitions are rolled in and out of the warehouse, the corresponding samples are rolled in and out of the sample warehouse. In this manner our sampling methods approximate the behavior of more sophisticated stream-sampling methods, while also supporting parallel processing. Experiments indicate that our methods are efficient and scalable, and provide guidance for their application.
Citation:
Paul G. Brown, Peter J. Haas, "Techniques for Warehousing of Sample Data," icde, pp.6, 22nd International Conference on Data Engineering (ICDE'06), 2006
Usage of this product signifies your acceptance of the Terms of Use.