CN104376051A

CN104376051A - Random structure conformal Hash information retrieval method

Info

Publication number: CN104376051A
Application number: CN201410604395.6A
Authority: CN
Inventors: 邵岭; 蔡子贇; 刘力; 余孟洋
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2014-10-30
Filing date: 2014-10-30
Publication date: 2015-02-25

Abstract

The invention relates to a random structure conformal Hash information retrieval method. The random structure conformal Hash information retrieval method is characterized by including the steps of (1) protecting important structures of high dimensional data, conducting dimensionality reduction on original high dimensional data through a provided objective function, and accordingly obtaining low dimensional data; (2) calculating a basic dimensionality matrix and a low dimensionality matrix of the original high dimensional data through obtained updating rules of basic operators U and the low dimensional data V; (3) setting a threshold value, converting low dimensionality real number expressions in a training set into binary codes, and calculating Hash codes of a testing sample through probability statistics disaggregated model logistic regression; (4) calculating the Hamming distance, namely XOR operation, between the training data and the testing sample, and obtaining final results. By means of the random structure conformal Hash information retrieval method, on the basis that distribution of random data and the local and overall structures of the high dimensional data are protected, a Hash function is successfully obtained through the multivariable logistic regression, surpassing sample expansion can be achieved, and the random structure conformal Hash information retrieval method is suitable for computer visions, data mining, machine learning or similar searching fields.

Description

Random structure conformal Hash information search method

Technical field

The invention belongs to computer information data processing technology field, particularly relate to a kind of random structure conformal Hash information search method for computer vision, data mining, machine learning or similarity.

Background technology

In information retrieval, machine learning, pattern-recognition and data mining, similarity searching is a problem needing to solve.In general, effective search for similarity method can set up index structure in metric space, and the early stage research about similarity searching can trace back to 20 century 70s.Specifically, when dimension lower≤20 time, some method such as KD-tree, VP-tree and R+ trees etc. based on data structure can solve the problem of similarity searching.But along with the growth of data dimension, the difficulty how effectively realizing similarity searching problem in information data process field constantly rises.The existing method taking the concept of " approximate value " to solve similarity searching problem, as in order to improve recall precision, hash algorithm need to obtain one from Euclidean space the hash function to Hamming space, as long as utilize binary-coded hash algorithm to comprise two advantages: one is that scale-of-two Hash codes has saved storage area; Two is the Hamming distance (XOR computing) effectively between calculation training data and test sample book, the time complexity in Hash table can be approximately O (1) in the retrieving of similarity searching.

Existing hash algorithm can be divided into substantially based on accidental projection with based on two kinds that learn.Local sensitivity Hash (LSH) is the hash algorithm based on stochastic linear projection of widespread use, effectively data point can be mapped to low-dimensional Hamming space from higher dimensional space; More similarity can be excavated in nuclear space in order to better recall precision based on the local sensitivity Hash (KLSH) of core and the local sensitivity Hash (BMKLSH) of strengthening multinuclear.In order to find the similar arest neighbors of test point in higher dimensional space, Panigrahy proposes to know clearly a kind of hash algorithm based on entropy.Dong proposes many explorations local sensitivity Hash of Corpus--based Method characteristic model, and this is the best change of current local sensitivity Hash.In addition, Raginsky and Lazebnik guarantees by the encoding scheme freely distributed based on Random Maps the relation offseting the Hamming distance of the binary code of the numerical value of constant core in two vector sum vectors.

Only have when scale-of-two Hash codes is sufficiently long time, the hash function based on accidental projection just can be effectively.Therefore, in order to obtain compacter and encode accurately, many hash algorithms based on study are suggested.By the structure of mining data, then show on objective function, by solving the optimization problem relevant with objective function, the hash algorithm based on study can obtain hash function.Spectrum Hash (SpH) is typical non-supervisory hash algorithm, by impel balance with incoherent constraint to code learn, compose Hash and may learn compact binary code and similarity in protected data.Principal component analysis (PCA) Hash (PCAH) can obtain better quantification relative to Random Maps Hash.In addition, the semantic Hash (SH) based on limited Boltzmann machine is suggested.The people such as Liu propose the hash algorithm based on image that automatically can find data neighbour inner structure, may learn code compact accordingly simultaneously, and anchor-shaped figure can accelerate the process of analysis of spectrum.Recently having and be suggested based on the spherical Hash of hyperspherical scale-of-two implanted prosthetics (Spherical Hashing). this algorithm can provide compact data mode and the nearest neighbor search of expansion.

But all there is certain defect in above-mentioned hash method.Although can produce compact code based on the hash method of Random Maps, simple linear hash function but cannot map out relation potential between data point.Meanwhile, because linear formula is calculated by higher dimensional matrix and obtains, this can bring very high computation complexity.In addition, when code word is very long time, the hash algorithm based on study can not be very effective.In addition, those hash methods first reducing raw data dimension can not obtain the low-dimensional data result of fine structure.

In recent years as the matrix decomposition algorithm that can learn object non-negative portion-form, Non-negative Matrix Factorization (NMF) plays important effect in information retrieval and data mining.As one has the nonnegative matrix of M N dimension data vector two nonnegative matrix U=[u can be resolved into by NMF _id] ∈ R ^{m × D}with V=[u _jd] ∈ R ^{d × N}, its result can estimate original matrix well, as X ≈ UV.Lee and Seung it is also proposed two objective functions and goes the distance of assessment two between nonnegative matrix X and UV, and the objective function based on difference can be expressed as:

O_{F} = {| | X - UV | |}^{2} = \underset{i, j}{Σ} {(x_{ij} - Σ_{d = 1}^{D} u_{id} v_{jd})}^{2}, - - - (1)

In formula (1) || || be Fu Luobin Nice norm;

In order to optimize above-mentioned objective function, iteration step of updating can be used for obtaining O _flocal minimum:

u_{jd}^{(t + 1)} = u_{jd}^{(t)} \frac{{({XV}^{T})}_{jd}}{{({UVV}^{T})}_{jd}}, v_{di}^{(t + 1)} = v_{di}^{(t)} \frac{{(U^{T} X)}_{di}}{{(U^{T} UV)}_{di}} - - - (2)

The verified above-mentioned iteration update algorithm of formula (2) can find O effectively _flocal minimum; The matrix V obtained from NMF is that the low-dimensional of X represents; Meanwhile, U is basis matrix.

At present also have the more algorithm based on NMF, as local Non-negative Matrix Factorization (LNMF), better can obtain local feature and may learn visual pattern based on part with the topical manifestations in space.In order to improve LNMF algorithm, the people such as Cai propose localised protection Non-negative Matrix Factorization (LPNMF), and it can analyze the similarity between two hiding data points.Based on these methods, a kind of effectively based on monumented point can the method for packed data, accelerate the problem that local Non-negative Matrix Factorization (A-LPNMF) is suggested the computation complexity solving LPNMF of making a return journey.In order to find profound popular structure, the people such as Cai propose and can decompose and the figure regularization Non-negative Matrix Factorization (GNMF) of graphic structure by associate(d) matrix.The Non-negative Matrix Factorization (CNMF) of constraint is using label information as the constraint condition of adding, and of a sort data point merges in new performance territory simultaneously.Because being subject to the impact of sparse coding, openness in order to ensure the performance obtained, non-negative local coordinate is decomposed (NLCF) and be with the addition of local coordinate restriction.

In sum, deficiency existing for prior art can be summarized as: one is that existing NMF algorithm protects the local of original high dimensional data and the problem of general structure because solving, so there is the feature that the low-dimensional data obtained farthest cannot inherit high dimensional data; Two is that the existing hash algorithm based on accidental projection is had to produce a lot of Hash tables and could be obtained certain retrieval effectiveness, and simply linearly hash function cannot map out relation potential between data point; Three is when code word is very long time, and the hash algorithm based on study cannot obtain effective result.

Summary of the invention

The object of the invention is the deficiency for overcoming prior art existence and a kind of random structure conformal Hash information search method (SSPH) is provided; the present invention can on the local of distribution and high dimensional data protecting random data well and the basis of general structure; successfully the logistic regression of using multivariate obtains hash function, can realize the expansion surmounting sample.

According to a kind of random structure conformal Hash information search method that the present invention proposes, it is characterized in that it comprises following concrete steps:

Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data; In order to the important feature of protection high dimensional data as much as possible, set up the KL divergence minimizing the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:

C＝λKL(P||Q) (3)，

In formula (3), P is the joint probability distribution of higher dimensional space, can be expressed as p simultaneously _ij; Q is the joint probability distribution of lower dimensional space, can be expressed as q simultaneously _ij; Concrete steps are as follows:

Step 1.1: conditional probability p _ijillustrate data point x _iand x _jbetween similarity, wherein x _iproportional with their probability density; Important point is only had to need to go to fashion into right similarity, therefore p _iiand q _iibe set to 0; Simultaneously right there is attribute p _ij=p _jiand q _ij=q _ji; Similarity between two in higher dimensional space can be expressed as:

p_{ij} = \frac{\exp (\frac{- {| | x_{i} - x_{j} | |}^{2}}{2 {σ_{i}}^{2}})}{Σ_{k &NotEqual; l} \exp (\frac{- {| | x_{k} - x_{l} | |}^{2}}{2 {σ_{k}}^{2}})} - - - (4),

Step 1.2: wherein σ _iillustrate at data point x _ithe variable of RC Gaussian distribution, each data point x _ithere is corresponding complexity, the figure of low-dimensional uses heavy-tailed probability distribution, joint probability q _ijmay be defined as:

q_{ij} = \frac{{(1 + {| | v_{i} - v_{j} | |}^{2})}^{- 1}}{Σ_{k &NotEqual; l} {(1 + {| | v_{k} - v_{l} | |}^{2})}^{- 1}} - - - (5),

Formula (5) definition is the unlimited mixing of Gauss, owing to there is no exponential term, and can than the density of independent Gauss estimation point faster; Set up based on KL divergence cost function formulation (6) can effectively assessment data distribution emphasis;

Step 1.3:q _ijand p _ijcan:

G = KL (P | | Q) = \underset{i}{Σ} \underset{j}{Σ} p_{ij} \log \frac{p_{ij}}{q_{ij}} - - - (6,)

The gradient of the KL divergence in formula (6) between P and Q can be expressed as:

\begin{matrix} g = KL (P | | Q) \frac{&PartialD; g}{&PartialD; v_{i}} \\ = 4 Σ_{j = 1}^{N} (p_{ij} - q_{ij}) (v_{i} - v_{j}) {(1 + {| | v_{i} - v_{j} | |}^{2})}^{- 1} \end{matrix} - - - (7);

Step 1.4: by protecting part and NMF in conjunction with the data structure in formula (3), obtain new objective function below:

Q _f＝||X-UV|| ²+λKL(P||Q) (8)，

V ∈ { 0,1} herein ^{d × N}, X, U, V>=0, U ∈ R ^{m × D}, X ∈ R ^{m × N}, λ can control the smoothness of new sign simultaneously;

In most of the cases, only using the low-dimensional data of NMF not to be so effectively with meaningful for practical application, in order to obtain better result in information retrieval, needing to introduce the structure that λ KL (P||Q) goes to protect raw data.

Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data; Comprise the concrete steps of following optimizing process:

Step 2.1: discrete conditions V ∈ { 0, the 1} of formula (8) ^{d × N}cannot be directly computed out in optimizing process, in order to obtain real number value, first data V ∈ { 0,1} ^{d × N}be put into territory V ∈ R ^{d × N}on;

Step 2.2: then the Lagrangian function in problem is set to:

Matrix Φ and ψ in formula (9) is two Lagrange's multiplier matrixes, obtains the gradient of g thus:

\frac{&PartialD; g}{{&PartialD; v}_{i}} = 4 λ Σ_{j = 1}^{N} (p_{ij} - q_{ij}) (v_{i} - v_{j}) {(1 + {| | v_{i} - v_{j} | |}^{2})}^{- 1} - - - (10);

Step 2.3: allow gradient be 0 go to minimize O _f:

Wherein:

G = \frac{&PartialD; g}{&PartialD; V} = [\frac{&PartialD; g}{&PartialD; v_{1}}, . . ., \frac{&PartialD; g}{&PartialD; v_{N}}];

Step 2.4: except above-mentioned points, has KKT condition: Φ _iju _ij=0 He then, V is multiplied by the relevant position on the both sides of formula (11) and formula (12) respectively _ijand U _ij, can obtain:

(2(-U ^TX+U ^TUV)+G) _ijV _ij＝0， (13)，

2(-XV ^T+UVV ^T) _ijU _ij＝0 (14)，

Wherein:

\begin{matrix} G_{ij} = {(\frac{&PartialD; g}{&PartialD; v_{j}})}_{i} \\ = {(4 λ Σ_{k = 1}^{N} (p_{jk} - q_{jk}) (v_{j} - v_{k}) {(1 + {| | v_{j} - v_{k} | |}^{2})}^{- 1})}_{i} \\ = {(4 λ Σ_{k = 1}^{N} \frac{p_{jk} v_{j} - q_{jk} v_{j} - p_{jk} v_{k} + q_{jk} v_{k}}{1 + {| | v_{j} - v_{k} | |}^{2}})}_{i} \\ = 4 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ij} - q_{jk} V_{ij} - p_{ik} V_{ik} + q_{jk} V_{ik}}{1 + {| | v_{j} - v_{k} | |}^{2}} . \end{matrix};

Step 2.5: have following update rule to arbitrary i and j:

V_{ij} &LeftArrow; \frac{{(U^{T} X)}_{ij} + 2 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ik} + q_{jk} V_{ij}}{1 + {| | v_{j} - v_{k} | |}^{2}}}{{(U^{T} UV)}_{ij} + 2 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ij} + q_{jk} V_{ik}}{1 + {| | v_{j} - v_{k} | |}^{2}}} V_{ij} - - - (15),

U_{ij} &LeftArrow; \frac{{({XV}^{T})}_{ij}}{{({UVV}^{T})}_{ij}} U_{ij} - - - (16),

Wherein, all elements in U and V is all positive number; And the dullness the upgrading objective function each time not increasing property of U or V.

Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression, namely form hash function by following concrete steps:

Step 3.1: base U=[u _id] ∈ R ^{m × D}with low-dimensional matrix V=[u _jd] ∈ R ^{d × N}, wherein d < < D _i, i=1 ..., n, obtains from formula (15) and formula (16), then needs to arrange threshold value and low-dimensional real number is showed V=[v ₁..., v _n] convert binary code to, if in vector v _nin f element larger than threshold value, this real number value is just set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;

Step 3.2: from the principle of quantity of information, by a uniform probability distribution, information source can arrive a maximum entropy; Especially, if the entropy of the code in data is very little, whole file can be mapped on a fraction of code; For guaranteeing the efficiency of semantic Hash, making semantic hash algorithm accomplish entropy maximization, and meeting entropy maximization principle, use v _pintermediate value as v _pthe threshold value of middle element, have half numerical value to be set as 1, half is set to 0 in addition, by the method, real number yardage is counted as binary code;

Step 3.3: from above process, can obtain the binary code of training intensive data; For making a new sample, directly obtain hash function, in the process, due to the environment of binary code, in test sample book, probability of use statistical classification model-logistic regression calculates Hash codes; Before obtaining logistic regression function, binary code representation is become wherein and n=1 ..., N; Training sample is expressed as thus correlation regression matrix Θ based on d × d is expressed as logistic regression function is expressed as:

Y is 1 or is 0; Relevant regression matrix function is defined as:

Wherein 1 be N × 1 matrix, for avoiding the regularization term of over-fitting in logistic regression;

Step 3.4: can minimize to find parameter Θ, use Gradient Descent and repeatedly upgrade the mode of each parameter, more new formula is as follows:

More new formula can be worked as with between difference arrive convergence, then obtain regression matrix Θ;

Step 3.5: finally by linear mapping matrix the low-dimensional obtaining real number represents, because be sigmoid function, the Hash codes for new sample is expressed as:

Wherein: illustrate each input all get nearest integer function, defining binary threshold value is 0.5; If from bit be greater than 0.5, can 1 be expressed as, otherwise be 0; Obtain the SSPH code of training sample and test sample book thus, wherein the search method flow process of SSPH is expressed as follows:

Random structure conformal Hash search method (SSPH), input:

One group of training matrix:

X = {x_{i} &Element; R^{d}}_{i = 1}^{n};

D is the target dimension of Hash codes;

To the learning rate α of logistic regression;

Regularization parameter { δ, λ };

Export: basis matrix U and regression matrix Θ;

One is calculate basis matrix U and low-dimensional matrix V with formula (15) and formula (16);

Two is until convergence;

Three is obtain regression matrix Θ from formula (20), is coded in definition in formula (21) to the SSPH of a sample.

Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result; Specifically following complicated dynamic behaviour analysis:

The computation complexity of random structure conformal Hash search method (SSPH) comprises 3 parts, Part I calculates NMF, computation complexity is O (NMKd), N is the size of database, M and d is the dimension of high dimensional data and low-dimensional data respectively, and K is the number of class in database; Part II is the cost function (formula 6) in calculating target function, and computation complexity is O (N ²d); Part III is logistic regression process, and complexity is O (Nd ²); Therefore, the whole computation complexity of SSPH is O (tNMKd+N ²d+tNd ²), wherein t is the number of times of iteration.

The present invention compared with prior art its remarkable advantage is: one is the invention solves the difficult problem that unsupervised-learning algorithm (NMF) cannot find the essential geometry of data space, the objective function proposed uses the method for efficient Non-negative Matrix Factorization and logistic regression to solve, and protects the partial structurtes of high dimensional data to low-dimensional figure; Two is the Optimization Frameworks that the present invention proposes objective function, gives the update rule of framework on two benchmark test database SIFT1M and GIST1M; Three is that the optimization conclusion that the present invention obtains makes training sample can be placed to the codomain of a real number, so that real outcomes can be transformed into binary code.The present invention is applicable to the fields such as computer vision, data mining, machine learning or similarity, produces significant effect to the nearest _neighbor retrieval problem solving extensive high dimensional data.

Accompanying drawing explanation

Fig. 1 is the flow of presentation block diagram of random structure conformal Hash information search method (SSPH) of the present invention.

Fig. 2 is the implementation step block diagram of random structure conformal Hash information search method (SSPH) of the present invention.

Fig. 3 comprises Fig. 3 a, Fig. 3 b, Fig. 3 c and Fig. 3 d, for the present invention by Average Accuracy and look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; Wherein: Fig. 3 a represents that for code length be 48bits, the present invention in database SIFT1M by look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; Fig. 3 b represents that for code length be 48bits, the present invention in database GIST1M by look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; What Fig. 3 c represented that the present invention carried out with 10 popular approach by Average Accuracy in database SIFT1M compares schematic diagram; What Fig. 3 d represented that the present invention carried out with 10 popular approach by Average Accuracy in database GIST1M compares schematic diagram.

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.

The flow of presentation of random structure conformal Hash information search method (SSPH) that the present invention proposes in detail as shown in Figure 1, vision operator is extracted from tranining database, use the objective function that proposes and the update rule of the base operator U drawn and low-dimensional data V to carry out dimensionality reduction to original high dimensional data, the Hash codes of probability of use statistical classification model logic calculated test sample book also draws hash function.In test process, the vision operator in the test pattern obtained, substitutes into the hash function drawn, draws the Hash codes of test sample book, then do XOR computing with the Hash codes of training sample, draw net result.

In detail as shown in Figure 2, it comprises following concrete steps to the implementation step flow process of random structure conformal Hash information search method (SSPH) that the present invention proposes:

Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data;

Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data;

Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression;

Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result.

Further illustrate the Application Example of random structure conformal Hash information search method of the present invention below.

Embodiment 1, the random structure conformal Hash information search method problem solved in similarity searching of the present invention.Be provided with two large scale databases: one is the SIFT1M based on SIFT operator, and another one is the GIST1M based on GIST operator; Wherein, SIFT database has 1,000,000 dimensions to be the data point of 128, and GIST database has 1,000,000 dimensions to be the data point of 960; In similarity searching, the basic parameter situation of two large database concepts refers to table 1.

Table 1: the basic parameter table of two large database concepts in similarity searching

Database	SIFT dim＝128	GIST dim＝960
			The size of database	1,000,000	1,000,000
The size of test sample book	10,000	10,000
			The size of training sample	990,000	990,000

In order to protect high dimensional data important feature as much as possible, the present invention proposes the KL divergence minimizing the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:

C＝λKL(P||Q)，

By protecting part and NMF in conjunction with the data structure in this formula, new objective function below can be obtained:

Q _f＝||X-UV|| ²+λKL(P||Q)

In this objective function, V ∈ { 0,1} ^{d × N}, X, U, V>=0, U ∈ R ^{m × D}, X ∈ R ^{m × N}, λ can control the smoothness of new sign simultaneously;

In order to obtain real number value, first data V ∈ { 0,1} ^{d × N}be put into territory V ∈ R ^{d × N}on, then the Lagrangian function in problem is set to:

Wherein matrix Φ and ψ is two Lagrange's multiplier matrixes;

In addition, there is KKT condition: Φ simultaneously _iju _ij=0 He

Following update rule is adopted to arbitrary i, j:

V_{ij} &LeftArrow; \frac{{(U^{T} X)}_{ij} + 2 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ik} + q_{jk} V_{ij}}{1 + {| | v_{j} - v_{k} | |}^{2}}}{{(U^{T} UV)}_{ij} + 2 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ij} + q_{jk} V_{ik}}{1 + {| | v_{j} - v_{k} | |}^{2}}} V_{ij},

U_{ij} &LeftArrow; \frac{{({XV}^{T})}_{ij}}{{({UVV}^{T})}_{ij}} U_{ij},

All elements in above formula U and V is all positive number; In literary composition, the dullness the upgrading objective function each time not increasing property through U or V has been demonstrated at " Algorithms of Non-Negative Matrix Factorization ".

Then, threshold value is set and low-dimensional real number is showed V=[v ₁..., v _n] convert binary code to: if in vector v _nin f element larger than threshold value, this real number value is set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;

In the above process, the binary code of training intensive data can only be obtained.Therefore, a new sample, directly cannot obtain hash function.In the present invention, due to the environment of binary code, in test sample book, Hash codes can be calculated by probability of use statistical classification model-logistic regression, namely before obtaining logistic regression function, binary code representation be become wherein and n=1 ..., N; Therefore training sample can be expressed as correlation regression matrix θ based on d × d can be expressed as

Wherein 1 is the matrix of N × 1, uses as the regularization term avoiding over-fitting in logistic regression;

Again by linear mapping matrix the low-dimensional obtaining real number represents, because it is sigmoid function; Hash codes for new sample can be expressed as:

Wherein illustrate each input all get nearest integer function; Defining binary threshold value is 0.5, if therefore from bit be greater than 0.5, can 1 be expressed as, otherwise be 0, thus obtain the SSPH code of training sample and test sample book.

The present invention is in the middle of above-mentioned application, and 10K the data point randomly drawed is by as test sample book, simultaneously remaining by as image data base in database.In the process of training, if data point is positioned at front 2 percent of most phase near point, these points will be denoted as 1, otherwise are 0.In the process of test, if the point returned is front 2 percent the most close, it will be considered to close neighbour.Because the arrangement mode of Hamming distance is very fast in Hash codes application, Hamming can be used to sort and to go to measure retrieval tasks.Application result is by Average Accuracy and look into accurate and recall curve to judge.Result shows, random structure conformal Hash information search method (SSPH) method all the time higher than other in the accuracy rate of different code length of the present invention.

Compare 10 comparatively popular Hash information search methods further, comprise LSH, BSSC, RBM, SpH, STH, AGH, KLSH, PCAH, KSH and CH, these 10 methods all can compare in different code lengths that is 32,48,64 and 80; In each database, random structure conformal Hash information search method (SSPH) of the present invention can from 0.01,0.02,0.03 ... until choose the learning rate of a value as cross validation in 0.10, regularization parameter is confirmed as 0.35.

In two large database concepts, as can be seen from Figure 3, random structure conformal Hash information search method (SSPH) of the present invention has best effect compared with other art methods.Simultaneously, in SIFT1M and GIST1M database, code length is the Average Accuracy of 32 and 48bits, the comparison parameter of its training time and test duration is as shown in table 2, in the training time, SSPH of the present invention is more more effective than STH, KSSH and BSSC, so it is a kind of highly effective method that SSPH is applied in large-scale data retrieval.

Table 2: in SIFT 1M and GIST 1M database, code length is the Average Accuracy of 32 and 48bits, the comparison parameter list of its training time and test duration

The explanation do not related in the specific embodiment of the present invention belongs to technology well known in the art, can be implemented with reference to known technology.

The present invention, through application verification repeatedly, achieves satisfied effect.

Claims

1. a random structure conformal Hash information search method, is characterized in that it comprises following concrete steps:

2. a kind of random structure conformal Hash information search method according to claim 1; it is characterized in that the important feature of the protection high dimensional data described in step 1; proposition objective function is used to carry out dimensionality reduction to original high dimensional data; thus obtain low-dimensional data, refer to the KL divergence set up and minimize the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:

C＝λKL(P||Q) (3)，

In formula (3), P is the joint probability distribution of higher dimensional space, can be expressed as p simultaneously _ij; Q is the joint probability distribution of lower dimensional space, can be expressed as q simultaneously _ij; Concrete steps comprise:

p_{ij} = \frac{\exp (\frac{- {| | x_{i} - x_{j} | |}^{2}}{2 {σ_{i}}^{2}})}{Σ_{k &NotEqual; l} \exp (\frac{- {| | x_{k} - x_{l} | |}^{2}}{2 {σ_{k}}^{2}})} - - - (4),

q_{ij} = \frac{{(1 + {| | v_{i} - v_{j} | |}^{2})}^{- 1}}{Σ_{k &NotEqual; l} {(1 + {| | v_{k} - v_{l} | |}^{2})}^{- 1}} - - - (5),

Step 1.3:q _ijand p _ijcan:

G = KL (P | | Q) = \underset{i}{Σ} \underset{j}{Σ} p_{ij} \log \frac{p_{ij}}{q_{ij}} - - - (6),

\begin{matrix} g = KL (P | | Q) \frac{&PartialD; g}{&PartialD; v_{i}} \\ = 4 Σ_{j = 1}^{N} (p_{ij} - q_{ij}) (v_{i} - v_{j}) {(1 + {| | v_{i} - v_{j} | |}^{2})}^{- 1} \end{matrix} - - - (7);

O _f＝||X-UV|| ²+λKL(P||Q) (8)，

3. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that the update rule of the base operator U that the use described in step 2 has drawn and low-dimensional data V, calculate base and the low-dimensional matrix of original high dimensional data, refer to the concrete steps comprising following optimizing process:

Step 2.2: then the Lagrangian function in problem is set to:

\frac{&PartialD; g}{&PartialD; v_{i}} = 4 λ Σ_{j = 1}^{N} (p_{ij} - q_{ij}) (v_{i} - v_{j}) {(1 + {| | v_{i} - v_{j} | |}^{2})}^{- 1} - - - (10);

Step 2.3: allow gradient be 0 go to minimize O _f:

Wherein:

G = \frac{&PartialD; g}{&PartialD; V} = [\frac{&PartialD; g}{&PartialD; v_{1}}, . . ., \frac{&PartialD; g}{&PartialD; v_{N}}];

Step 2.4: except above-mentioned points, has KKT condition: Φ _iju _ij=0 and Ψ _ijv _ij=0, then, V is multiplied by the relevant position on the both sides of formula (11) and formula (12) respectively _ijand U _ij, can obtain:

(2(-U ^TX+U ^TUV)+G) _ijV _ij＝0， (13)，

2(-XV ^T+UVV ^T) _ijU _ij＝0 (14)，

Wherein:

\begin{matrix} G_{ij} = {(\frac{&PartialD; g}{&PartialD; v_{j}})}_{i} \\ = {(4 λ Σ_{k = 1}^{N} (p_{jk} - q_{jk}) (v_{j} - v_{k}) {(1 + {| | v_{j} - v_{k} | |}^{2})}^{- 1})}_{i} \\ = {(4 λ Σ_{k = 1}^{N} \frac{p_{jk} v_{j} - q_{jk} v_{j} - p_{jk} v_{k} + q_{jk} v_{k}}{1 + {| | v_{j} - v_{k} | |}^{2}})}_{i} \\ = 4 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ij} - q_{jk} V_{ij} - p_{jk} V_{ik} + q_{jk} V_{ik}}{1 + {| | v_{j} - v_{k} | |}^{2}} . \end{matrix};

Step 2.5: have following update rule to arbitrary i and j:

V_{ij} &LeftArrow; \frac{{(U^{T} X)}_{ij} + 2 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ik} + q_{jk} V_{ij}}{1 + {| | v_{j} - v_{k} | |}^{2}}}{{(U^{T} UV)}_{ij} + 2 λ Σ_{k = 1}^{N} \frac{p_{jk} V_{ij} + q_{jk} V_{ik}}{1 + {| | v_{j} - v_{k} | |}^{2}}} V_{ij} - - - (15),

U_{ij} &LeftArrow; \frac{{({XV}^{T})}_{ij}}{{({UVV}^{T})}_{ij}} U_{ij} - - - (16),

Wherein: all elements in U and V is all positive number, and the dullness the upgrading objective function each time not increasing property of U or V.

4. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that threshold value is set and low-dimensional real number performance in training set is converted to binary code described in step 3, calculate the Hash codes of test sample book with probability statistics disaggregated model logistic regression, refer to and form hash function by following concrete steps:

Step 3.3: from above process, can obtain the binary code of training intensive data; For making a new sample, directly obtain hash function, in the process, due to the environment of binary code, in test sample book, probability of use statistical classification model-logistic regression calculates Hash codes; Before obtaining logistic regression function, binary code representation is become

\hat{V} = [{\hat{v}}_{1}, . . ., {\hat{v}}_{N}],

Wherein

{\hat{v}}_{n} &Element; {0,1}^{d},

And n=1 ..., N; Training sample is expressed as thus correlation regression matrix Θ based on d × d is expressed as logistic regression function is expressed as:

J (Θ) = \frac{1}{N} Σ_{j = 1}^{N} Cost (h_{Θ} (v_{n}), {\hat{v}}_{n}) - - - (17),

Cost (h_{Θ} (v_{n}), {\hat{v}}_{n}) = \{\begin{matrix} - \log (h_{Θ} (v_{n})) & if & y = 1 \\ - \log (1 - h_{Θ} (v_{n})) & if & y = 0 \end{matrix} - - - (18),

Y is 1 or is 0; Relevant regression matrix function is defined as:

\begin{matrix} J (Θ) = - \frac{1}{N} {Σ_{n = 1}^{N} [{\hat{v}}_{n} \log (h_{Θ} (v_{n})) \\ + (1 - {\hat{v}}_{n}) \log (1 - h_{Θ} (v_{n}))] + δ {| | Θ | |}^{2}} \end{matrix} - - - (19),

Wherein 1 is matrix, the δ of N × 1 || Θ || ²for avoiding the regularization term of over-fitting in logistic regression;

Step 3.4: in order to find the parameter Θ that can minimize J (Θ), use Gradient Descent and repeatedly upgrade the mode of each parameter, more new formula is as follows:

Θ_{j + 1} = Θ_{j} - α (\frac{1}{N} Σ_{n = 1}^{N} (h_{Θ} (v_{n}) - {\hat{v}}_{n}) v_{n}^{T}) - \frac{αδ}{N} Θ_{j} - - - (20),

More new formula can work as Θ _j+1and Θ _jbetween difference || Θ _j+1-Θ _j|| ²arrive convergence, then obtain regression matrix Θ;

Step 3.5: finally by linear mapping matrix the low-dimensional obtaining real number represents, because h _Θbe sigmoid function, the Hash codes for new sample is expressed as:

Wherein: illustrate each h _Θinput all get nearest integer function, defining binary threshold value is 0.5; If from h _Θ(QX) bit is greater than 0.5, can be expressed as 1, otherwise is 0; Obtain the SSPH code of training sample and test sample book thus, wherein the search method flow process of SSPH is expressed as follows:

Random structure conformal Hash search method (SSPH), input:

One group of training matrix:

X = {x_{i} &Element; R^{d}}_{i = 1}^{n};

D is the target dimension of Hash codes;

To the learning rate α of logistic regression;

Regularization parameter { δ, λ };

Export: basis matrix U and regression matrix Θ;

Two is until convergence;

5. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that the Hamming distance between the calculation training data described in step 4 and test sample book and XOR computing, draw final result, refer to following complicated dynamic behaviour analysis:

The computation complexity of random structure conformal Hash search method (SSPH) comprises 3 parts, Part I calculates NMF, computation complexity is 0 (NMKD), N is the size of database, M and D is the dimension of high dimensional data and low-dimensional data respectively, and K is the number of class in database; Part II is the cost function (formula 6) in calculating target function, and computation complexity is 0 (N ²d); Part III is logistic regression process, and complexity is 0 (ND ²); Therefore, the whole computation complexity of SSPH is 0 (tNMKD+N ²d+tND ²), wherein t is the number of times of iteration.