Overview
A library for efficient similarity search and clustering of dense vectors.Faiss:高效稠密向量 top 相似检索库
Faiss contains several methods for similarity search. It assumes that the instances are represented as vectors and are identified by an integer, and that the vectors can be compared with L2 (Euclidean) distances or dot products. Vectors that are similar to a query vector are those that have the lowest L2 distance or the highest dot product with the query vector. It also supports cosine similarity, since this is a dot product on normalized vectors. Faiss 支持多种相似度检索度量:L2 距离,点积和余弦相似度
Installation
# 更新conda
conda update conda
# 先安装mkl
conda install mkl
# gpu版本 -- 记得根据自己安装的cuda版本安装对应的faiss版本,不然会出异常
conda install faiss-gpu cudatoolkit=10.0 -c pytorch # For CUDA10
Demo
import numpy as np
import faiss
d = 64 # dimension
nb = 100000 # database size
nq = 10000 # nb of queries
np.random.seed(1234) # make reproducible
xb = np.random.random((nb, d)).astype('float32')
xq = np.random.random((nq, d)).astype('float32')
xb[:, 0] += np.arange(nb) / 1000.
xq[:, 0] += np.arange(nq) / 1000.
'''
Faiss is built around the Index object.
It encapsulates the set of database vectors, and optionally preprocesses them to make searching efficient.
There are many types of indexes,
we are going to use the simplest version that just performs brute-force L2 distance search on them: IndexFlatL2.
All indexes need to know when they are built which is the dimensionality of the vectors they operate on, d in our case.
Then, most of the indexes also require a training phase, to analyze the distribution of the vectors. For IndexFlatL2, we can skip this operation.
When the index is built and trained, two operations can be performed on the index: add and search.
To add elements to the index, we call add on xb.
We can also display the two state variables of the index: is_trained, a boolean that indicates whether training is required and ntotal, the number of indexed vectors.
Some indexes can also store integer IDs corresponding to each of the vectors (but not IndexFlatL2).
If no IDs are provided, add just uses the vector ordinal as the id, ie. the first vector gets 0, the second 1, etc.
'''
index = faiss.IndexFlatL2(d) # build the index
print(index.is_trained) # log: True
index.add(xb) # add vectors to the index
print(index.ntotal) # log: 100000
'''
The basic search operation that can be performed on an index is the k-nearest-neighbor search,
ie. for each query vector, find its k nearest neighbors in the database.
The result of this operation can be conveniently stored in an integer matrix of size nq-by-k,
where row i contains the IDs of the neighbors of query vector i, sorted by increasing distance.
In addition to this matrix, the search operation returns a nq-by-k floating-point matrix with the corresponding squared distances.
As a sanity check, we can first search a few database vectors, to make sure the nearest neighbor is indeed the vector itself.
'''
k = 4 # we want to see 4 nearest neighbors
D, I = index.search(xb[:5], k) # sanity check
print(I)
print(D)
'''
[[ 0 393 363 78]
[ 1 555 277 364]
[ 2 304 101 13]
[ 3 173 18 182]
[ 4 288 370 531]]
[[0. 7.1751738 7.20763 7.2511625]
[0. 6.3235645 6.684581 6.799946 ]
[0. 5.7964087 6.391736 7.2815123]
[0. 7.2779055 7.527987 7.6628466]
[0. 6.7638035 7.2951202 7.3688145]]
the nearest neighbor of each query is indeed the index of the vector,
and the corresponding distance is 0. And within a row, distances are increasing.
'''
D, I = index.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
'''
[[ 381 207 210 477]
[ 526 911 142 72]
[ 838 527 1290 425]
[ 196 184 164 359]
[ 526 377 120 425]]
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
Because of the value added to the first component of the vectors,
the dataset is smeared along the first axis in d-dim space.
So the neighbors of the first few vectors are around the beginning of the dataset,
and the ones of the vectors around ~10000 are also around index 10000 in the dataset.
'''
'''
To speed up the search, it is possible to segment the dataset into pieces.
We define Voronoi cells in the d-dimensional space, and each database vector falls in one of the cells.
At search time, only the database vectors y contained in the cell the query x falls in and a few neighboring ones are compared against the query vector.
This is done via the IndexIVFFlat index.
This type of index requires a training stage, that can be performed on any collection of vectors that has the same distribution as the database vectors.
In this case we just use the database vectors themselves.
The IndexIVFFlat also requires another index, the quantizer, that assigns vectors to Voronoi cells.
Each cell is defined by a centroid, and finding the Voronoi cell a vector falls in consists in finding the nearest neighbor of the vector in the set of centroids.
This is the task of the other index, which is typically an IndexFlatL2.
There are two parameters to the search method: nlist, the number of cells, and nprobe, the number of cells (out of nlist) that are visited to perform a search.
The search time roughly increases linearly with the number of probes plus some constant due to the quantization.
'''
nlist = 100 # 聚类中心个数
k = 4
quantizer = faiss.IndexFlatL2(d) # the other index
index = faiss.IndexIVFFlat(quantizer, d, nlist)
assert not index.is_trained
index.train(xb)
assert index.is_trained
index.add(xb) # add may be a bit slower as well
D, I = index.search(xq, k) # actual search
print(I[-5:]) # neighbors of the 5 last queries
'''
[[ 9900 9309 9810 10048]
[11055 10895 10812 11321]
[11353 10164 9787 10719]
[10571 10664 10632 10203]
[ 9628 9554 9582 10304]]
The values are similar, but not exactly the same as for the brute-force search (see above).
This is because some of the results were not in the exact same Voronoi cell.
Therefore, visiting a few more cells may prove useful. '
'''
index.nprobe = 10 # default nprobe is 1, try a few more 默认聚类中心个数1, 尝试调大中心数10
D, I = index.search(xq, k)
print(I[-5:]) # neighbors of the 5 last queries
'''
[[ 9900 10500 9309 9831]
[11055 10895 10812 11321]
[11353 11103 10164 9787]
[10571 10664 10632 9638]
[ 9628 9554 10036 9582]]
which is the correct result.
Note that getting a perfect result in this case is merely an artifact of the data distribution,
as it is has a strong component on the x-axis which makes it easier to handle.
The nprobe parameter is always a way of adjusting the tradeoff between speed and accuracy of the result.
Setting nprobe = nlist gives the same result as the brute-force search (but slower).
'''
ngpus = faiss.get_num_gpus()
print('number of GPUS:', ngpus)
res = faiss.StandardGpuResources()
# build a flat (CPU) index
index_flat = faiss.IndexFlatL2(d)
# make it into a gpu index
gpu_index_flat = faiss.index_cpu_to_gpu(res, 0, index_flat)
gpu_index_flat.add(xb)
print(gpu_index_flat.ntotal)
D, I = gpu_index_flat.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
# add self defined index
index = faiss.IndexFlatL2(xb.shape[1])
ids = np.arange(xb.shape[0]) + 231
index2 = faiss.IndexIDMap(index)
index2.add_with_ids(xb, ids)
k = 4 # we want to see 4 nearest neighbors
D, I = index2.search(xq, k) # actual search
print(I[:5]) # neighbors of the 5 first queries
print(I[-5:]) # neighbors of the 5 last queries
转载请注明来源 goldandrabbit.github.io