Flexible Similarity Search of Semantic Vectors Using Fulltext Search Engines

Publikace nespadá pod Filozofickou fakultu, ale pod Fakultu informatiky. Oficiální stránka publikace je na webu muni.cz.

Autoři

RŮŽIČKA Michal NOVOTNÝ Vít SOJKA Petr POMIKÁLEK Jan ŘEHŮŘEK Radim

Rok publikování 2017
Druh Článek ve sborníku
Konference CEUR Workshop Proceedings, Vol. 1923
Fakulta / Pracoviště MU

Fakulta informatiky

Citace
www
Obor Informatika
Klíčová slova vector space modelling; semantic vectors encodings; inverted-index; systems performance; document representations; Latent Semantic Analysis; doc2vec; GloVe; Elasticsearch; evaluation; performance optimization
Popis Vector representations and vector space modeling (VSM) play a central role in modern machine learning. In our recent research we proposed a novel approach to ‘vector similarity searching’ over dense semantic vector representations. This approach can be deployed on top of traditional inverted-index-based fulltext engines, taking advantage of their robustness, stability, scalability and ubiquity. In this paper we validate our method using varied datasets ranging from text representations and embeddings (LSA, doc2vec, GloVe) to SIFT descriptors of image data. We show how our approach handles the indexing and querying in these domains, building a fast and scalable vector database with a tunable trade-off between vector search performance and quality, backed by a standard fulltext engine such as Elasticsearch.
Související projekty: