Scalable Correlated Sampling for Join Query Estimations on Big Data

10 pages•Published: September 26, 2019

Abstract

Estimate query results within limited time constraints is a challenging problem in the research of big data management. Query estimation based on simple random samples per- forms well for simple selection queries; however, return results with extremely high relative errors for complex join queries. Existing methods only work well with foreign key joins, and the sample size can grow dramatically as the dataset gets larger. This research implements a scalable sampling scheme in a big data environment, namely correlated sampling in map-reduce, that can speed up search query length results, give precise join query estimations, and minimize storage costs when presented with big data. Extensive experiments with large TPC-H datasets in Apache Hive show that our sampling method produces fast and accurate query estimations on big data.

Keyphrases: query approximation, query size estimation, sampling

In: Frederick Harris, Sergiu Dascalu, Sharad Sharma and Rui Wu (editors). Proceedings of 28th International Conference on Software Engineering and Data Engineering, vol 64, pages 41-50.

Links:	https://easychair.org/publications/paper/RB13
	https://doi.org/10.29007/87vt

BibTeX entry

@inproceedings{SEDE2019:Scalable_Correlated_Sampling_Join,
  author    = {David Wilson and Wen-Chi Hou and Feng Yu},
  title     = {Scalable Correlated Sampling for Join Query Estimations on Big Data},
  booktitle = {Proceedings of 28th International Conference on Software Engineering and Data Engineering},
  editor    = {Frederick Harris and Sergiu Dascalu and Sharad Sharma and Rui Wu},
  series    = {EPiC Series in Computing},
  volume    = {64},
  publisher = {EasyChair},
  bibsource = {EasyChair, https://easychair.org},
  issn      = {2398-7340},
  url       = {/publications/paper/RB13},
  doi       = {10.29007/87vt},
  pages     = {41-50},
  year      = {2019}}

Download PDF Open PDF in browser