Coarse-refinement dilemma: On generalization bounds for data clustering

2021
The data clustering problem is of central importance for the area of machine learning, given its usefulness to represent data structural similarities from input spaces. Although, data clustering counts on scarse literature of a theoretical framework with generalization guarantees. In this context, this manuscript introduces a new concept, based on multidimensional persistent homology, to analyze the conditions on which a clustering model is capable of generalizing data. As a first step, we propose a more general definition of DC problem by relying on topological spaces, instead of metric ones as typically approached in the literature. From that, we show that the data clustering problem presents an analogous dilemma to the bias-variance one, which is here referred to as the coarse-refinement dilemma, from which we conclude that: (i) highly-refined partitions and the clustering instability (overfitting); and (ii) over-coarse partitions and the lack of representativeness (underfitting). The coarse-refinement dilemma suggests the need of a relaxation of Kleinberg's richness axiom, as such axiom allows the production of unstable or unrepresentative partitions. Experimental exploration considering different clustering refinements can, then, depict such partitions.
EXPERT SYSTEMS WITH APPLICATIONS
卷号:184
ISSN:0957-4174
来源机构
Universidade de Sao Paulo
收录类型
SSCI
发表日期
2021
学科领域
循证管理学
国家
巴西
语种
英语
DOI
10.1016/j.eswa.2021.115399
其他关键词
TOPOLOGICAL PERSISTENCE; STABILITY
EISSN
1873-6793
资助机构
FAPESP (Sao Paulo Research Foundation), BrazilFundacao de Amparo a Pesquisa do Estado de Sao Paulo (FAPESP) [2017/16548-6, PROEX-5881819/D, 302077/2017-0]; CAPES (Coordination for the Improvement of Higher Education Personnel), BrazilCoordenacao de Aperfeicoamento de Pessoal de Nivel Superior (CAPES) [2017/16548-6, PROEX-5881819/D, 302077/2017-0]; CNPq (National Counsel of Technological and Scientific Development), BrazilConselho Nacional de Desenvolvimento Cientifico e Tecnologico (CNPQ) [2017/16548-6, PROEX-5881819/D, 302077/2017-0]
资助信息
This paper is based upon projects sponsored by FAPESP (Sao Paulo Research Foundation) , CAPES (Coordination for the Improvement of Higher Education Personnel) and CNPq (National Counsel of Technological and Scientific Development) , Brazil, under grants 2017/16548-6, PROEX-5881819/D, 302077/2017-0. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and/do not necessarily reflect the views of FAPESP, CAPES, and CNPq.
被引频次(WOS)
0
被引更新日期
2022-01
关键词
Data clustering Topology Persistent homology Multidimensional persistence Algorithm stability