CN104376051A - Random structure conformal Hash information retrieval method - Google Patents
Random structure conformal Hash information retrieval method Download PDFInfo
- Publication number
- CN104376051A CN104376051A CN201410604395.6A CN201410604395A CN104376051A CN 104376051 A CN104376051 A CN 104376051A CN 201410604395 A CN201410604395 A CN 201410604395A CN 104376051 A CN104376051 A CN 104376051A
- Authority
- CN
- China
- Prior art keywords
- hash
- data
- formula
- sigma
- matrix
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/325—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
- G06F16/2255—Hash tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9014—Indexing; Data structures therefor; Storage structures hash tables
Abstract
The invention relates to a random structure conformal Hash information retrieval method. The random structure conformal Hash information retrieval method is characterized by including the steps of (1) protecting important structures of high dimensional data, conducting dimensionality reduction on original high dimensional data through a provided objective function, and accordingly obtaining low dimensional data; (2) calculating a basic dimensionality matrix and a low dimensionality matrix of the original high dimensional data through obtained updating rules of basic operators U and the low dimensional data V; (3) setting a threshold value, converting low dimensionality real number expressions in a training set into binary codes, and calculating Hash codes of a testing sample through probability statistics disaggregated model logistic regression; (4) calculating the Hamming distance, namely XOR operation, between the training data and the testing sample, and obtaining final results. By means of the random structure conformal Hash information retrieval method, on the basis that distribution of random data and the local and overall structures of the high dimensional data are protected, a Hash function is successfully obtained through the multivariable logistic regression, surpassing sample expansion can be achieved, and the random structure conformal Hash information retrieval method is suitable for computer visions, data mining, machine learning or similar searching fields.
Description
Technical field
The invention belongs to computer information data processing technology field, particularly relate to a kind of random structure conformal Hash information search method for computer vision, data mining, machine learning or similarity.
Background technology
In information retrieval, machine learning, pattern-recognition and data mining, similarity searching is a problem needing to solve.In general, effective search for similarity method can set up index structure in metric space, and the early stage research about similarity searching can trace back to 20 century 70s.Specifically, when dimension lower≤20 time, some method such as KD-tree, VP-tree and R+ trees etc. based on data structure can solve the problem of similarity searching.But along with the growth of data dimension, the difficulty how effectively realizing similarity searching problem in information data process field constantly rises.The existing method taking the concept of " approximate value " to solve similarity searching problem, as in order to improve recall precision, hash algorithm need to obtain one from Euclidean space the hash function to Hamming space, as long as utilize binary-coded hash algorithm to comprise two advantages: one is that scale-of-two Hash codes has saved storage area; Two is the Hamming distance (XOR computing) effectively between calculation training data and test sample book, the time complexity in Hash table can be approximately O (1) in the retrieving of similarity searching.
Existing hash algorithm can be divided into substantially based on accidental projection with based on two kinds that learn.Local sensitivity Hash (LSH) is the hash algorithm based on stochastic linear projection of widespread use, effectively data point can be mapped to low-dimensional Hamming space from higher dimensional space; More similarity can be excavated in nuclear space in order to better recall precision based on the local sensitivity Hash (KLSH) of core and the local sensitivity Hash (BMKLSH) of strengthening multinuclear.In order to find the similar arest neighbors of test point in higher dimensional space, Panigrahy proposes to know clearly a kind of hash algorithm based on entropy.Dong proposes many explorations local sensitivity Hash of Corpus--based Method characteristic model, and this is the best change of current local sensitivity Hash.In addition, Raginsky and Lazebnik guarantees by the encoding scheme freely distributed based on Random Maps the relation offseting the Hamming distance of the binary code of the numerical value of constant core in two vector sum vectors.
Only have when scale-of-two Hash codes is sufficiently long time, the hash function based on accidental projection just can be effectively.Therefore, in order to obtain compacter and encode accurately, many hash algorithms based on study are suggested.By the structure of mining data, then show on objective function, by solving the optimization problem relevant with objective function, the hash algorithm based on study can obtain hash function.Spectrum Hash (SpH) is typical non-supervisory hash algorithm, by impel balance with incoherent constraint to code learn, compose Hash and may learn compact binary code and similarity in protected data.Principal component analysis (PCA) Hash (PCAH) can obtain better quantification relative to Random Maps Hash.In addition, the semantic Hash (SH) based on limited Boltzmann machine is suggested.The people such as Liu propose the hash algorithm based on image that automatically can find data neighbour inner structure, may learn code compact accordingly simultaneously, and anchor-shaped figure can accelerate the process of analysis of spectrum.Recently having and be suggested based on the spherical Hash of hyperspherical scale-of-two implanted prosthetics (Spherical Hashing). this algorithm can provide compact data mode and the nearest neighbor search of expansion.
But all there is certain defect in above-mentioned hash method.Although can produce compact code based on the hash method of Random Maps, simple linear hash function but cannot map out relation potential between data point.Meanwhile, because linear formula is calculated by higher dimensional matrix and obtains, this can bring very high computation complexity.In addition, when code word is very long time, the hash algorithm based on study can not be very effective.In addition, those hash methods first reducing raw data dimension can not obtain the low-dimensional data result of fine structure.
In recent years as the matrix decomposition algorithm that can learn object non-negative portion-form, Non-negative Matrix Factorization (NMF) plays important effect in information retrieval and data mining.As one has the nonnegative matrix of M N dimension data vector
two nonnegative matrix U=[u can be resolved into by NMF
id] ∈ R
m × Dwith V=[u
jd] ∈ R
d × N, its result can estimate original matrix well, as X ≈ UV.Lee and Seung it is also proposed two objective functions and goes the distance of assessment two between nonnegative matrix X and UV, and the objective function based on difference can be expressed as:
In formula (1) || || be Fu Luobin Nice norm;
In order to optimize above-mentioned objective function, iteration step of updating can be used for obtaining O
flocal minimum:
The verified above-mentioned iteration update algorithm of formula (2) can find O effectively
flocal minimum; The matrix V obtained from NMF is that the low-dimensional of X represents; Meanwhile, U is basis matrix.
At present also have the more algorithm based on NMF, as local Non-negative Matrix Factorization (LNMF), better can obtain local feature and may learn visual pattern based on part with the topical manifestations in space.In order to improve LNMF algorithm, the people such as Cai propose localised protection Non-negative Matrix Factorization (LPNMF), and it can analyze the similarity between two hiding data points.Based on these methods, a kind of effectively based on monumented point can the method for packed data, accelerate the problem that local Non-negative Matrix Factorization (A-LPNMF) is suggested the computation complexity solving LPNMF of making a return journey.In order to find profound popular structure, the people such as Cai propose and can decompose and the figure regularization Non-negative Matrix Factorization (GNMF) of graphic structure by associate(d) matrix.The Non-negative Matrix Factorization (CNMF) of constraint is using label information as the constraint condition of adding, and of a sort data point merges in new performance territory simultaneously.Because being subject to the impact of sparse coding, openness in order to ensure the performance obtained, non-negative local coordinate is decomposed (NLCF) and be with the addition of local coordinate restriction.
In sum, deficiency existing for prior art can be summarized as: one is that existing NMF algorithm protects the local of original high dimensional data and the problem of general structure because solving, so there is the feature that the low-dimensional data obtained farthest cannot inherit high dimensional data; Two is that the existing hash algorithm based on accidental projection is had to produce a lot of Hash tables and could be obtained certain retrieval effectiveness, and simply linearly hash function cannot map out relation potential between data point; Three is when code word is very long time, and the hash algorithm based on study cannot obtain effective result.
Summary of the invention
The object of the invention is the deficiency for overcoming prior art existence and a kind of random structure conformal Hash information search method (SSPH) is provided; the present invention can on the local of distribution and high dimensional data protecting random data well and the basis of general structure; successfully the logistic regression of using multivariate obtains hash function, can realize the expansion surmounting sample.
According to a kind of random structure conformal Hash information search method that the present invention proposes, it is characterized in that it comprises following concrete steps:
Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data; In order to the important feature of protection high dimensional data as much as possible, set up the KL divergence minimizing the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:
C=λKL(P||Q) (3),
In formula (3), P is the joint probability distribution of higher dimensional space, can be expressed as p simultaneously
ij; Q is the joint probability distribution of lower dimensional space, can be expressed as q simultaneously
ij; Concrete steps are as follows:
Step 1.1: conditional probability p
ijillustrate data point x
iand x
jbetween similarity, wherein x
iproportional with their probability density; Important point is only had to need to go to fashion into right similarity, therefore p
iiand q
iibe set to 0; Simultaneously right
there is attribute p
ij=p
jiand q
ij=q
ji; Similarity between two in higher dimensional space can be expressed as:
Step 1.2: wherein σ
iillustrate at data point x
ithe variable of RC Gaussian distribution, each data point x
ithere is corresponding complexity, the figure of low-dimensional uses heavy-tailed probability distribution, joint probability q
ijmay be defined as:
Formula (5) definition is the unlimited mixing of Gauss, owing to there is no exponential term, and can than the density of independent Gauss estimation point faster; Set up based on KL divergence cost function formulation (6) can effectively assessment data distribution emphasis;
Step 1.3:q
ijand p
ijcan:
The gradient of the KL divergence in formula (6) between P and Q can be expressed as:
Step 1.4: by protecting part and NMF in conjunction with the data structure in formula (3), obtain new objective function below:
Q
f=||X-UV||
2+λKL(P||Q) (8),
V ∈ { 0,1} herein
d × N, X, U, V>=0, U ∈ R
m × D, X ∈ R
m × N, λ can control the smoothness of new sign simultaneously;
In most of the cases, only using the low-dimensional data of NMF not to be so effectively with meaningful for practical application, in order to obtain better result in information retrieval, needing to introduce the structure that λ KL (P||Q) goes to protect raw data.
Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data; Comprise the concrete steps of following optimizing process:
Step 2.1: discrete conditions V ∈ { 0, the 1} of formula (8)
d × Ncannot be directly computed out in optimizing process, in order to obtain real number value, first data V ∈ { 0,1}
d × Nbe put into territory V ∈ R
d × Non;
Step 2.2: then the Lagrangian function in problem is set to:
Matrix Φ and ψ in formula (9) is two Lagrange's multiplier matrixes, obtains the gradient of g thus:
Step 2.3: allow
gradient be 0 go to minimize O
f:
Wherein:
Step 2.4: except above-mentioned points, has KKT condition: Φ
iju
ij=0 He
then, V is multiplied by the relevant position on the both sides of formula (11) and formula (12) respectively
ijand U
ij, can obtain:
(2(-U
TX+U
TUV)+G)
ijV
ij=0, (13),
2(-XV
T+UVV
T)
ijU
ij=0 (14),
Wherein:
Step 2.5: have following update rule to arbitrary i and j:
Wherein, all elements in U and V is all positive number; And the dullness the upgrading objective function each time not increasing property of U or V.
Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression, namely form hash function by following concrete steps:
Step 3.1: base U=[u
id] ∈ R
m × Dwith low-dimensional matrix V=[u
jd] ∈ R
d × N, wherein d < < D
i, i=1 ..., n, obtains from formula (15) and formula (16), then needs to arrange threshold value and low-dimensional real number is showed V=[v
1..., v
n] convert binary code to, if in vector v
nin f element larger than threshold value, this real number value is just set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;
Step 3.2: from the principle of quantity of information, by a uniform probability distribution, information source can arrive a maximum entropy; Especially, if the entropy of the code in data is very little, whole file can be mapped on a fraction of code; For guaranteeing the efficiency of semantic Hash, making semantic hash algorithm accomplish entropy maximization, and meeting entropy maximization principle, use v
pintermediate value as v
pthe threshold value of middle element, have half numerical value to be set as 1, half is set to 0 in addition, by the method, real number yardage is counted as binary code;
Step 3.3: from above process, can obtain the binary code of training intensive data; For making a new sample, directly obtain hash function, in the process, due to the environment of binary code, in test sample book, probability of use statistical classification model-logistic regression calculates Hash codes; Before obtaining logistic regression function, binary code representation is become
wherein
and n=1 ..., N; Training sample is expressed as thus
correlation regression matrix Θ based on d × d is expressed as
logistic regression function is expressed as:
Y is 1 or is 0; Relevant regression matrix function is defined as:
Wherein 1 be N × 1 matrix,
for avoiding the regularization term of over-fitting in logistic regression;
Step 3.4: can minimize to find
parameter Θ, use Gradient Descent and repeatedly upgrade the mode of each parameter, more new formula is as follows:
More new formula can be worked as
with
between difference
arrive convergence, then obtain regression matrix Θ;
Step 3.5: finally by linear mapping matrix
the low-dimensional obtaining real number represents, because
be sigmoid function, the Hash codes for new sample is expressed as:
Wherein:
illustrate each
input all get nearest integer function, defining binary threshold value is 0.5; If from
bit be greater than 0.5, can 1 be expressed as, otherwise be 0; Obtain the SSPH code of training sample and test sample book thus, wherein the search method flow process of SSPH is expressed as follows:
Random structure conformal Hash search method (SSPH), input:
One group of training matrix:
D is the target dimension of Hash codes;
To the learning rate α of logistic regression;
Regularization parameter { δ, λ };
Export: basis matrix U and regression matrix Θ;
One is calculate basis matrix U and low-dimensional matrix V with formula (15) and formula (16);
Two is until convergence;
Three is obtain regression matrix Θ from formula (20), is coded in definition in formula (21) to the SSPH of a sample.
Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result; Specifically following complicated dynamic behaviour analysis:
The computation complexity of random structure conformal Hash search method (SSPH) comprises 3 parts, Part I calculates NMF, computation complexity is O (NMKd), N is the size of database, M and d is the dimension of high dimensional data and low-dimensional data respectively, and K is the number of class in database; Part II is the cost function (formula 6) in calculating target function, and computation complexity is O (N
2d); Part III is logistic regression process, and complexity is O (Nd
2); Therefore, the whole computation complexity of SSPH is O (tNMKd+N
2d+tNd
2), wherein t is the number of times of iteration.
The present invention compared with prior art its remarkable advantage is: one is the invention solves the difficult problem that unsupervised-learning algorithm (NMF) cannot find the essential geometry of data space, the objective function proposed uses the method for efficient Non-negative Matrix Factorization and logistic regression to solve, and protects the partial structurtes of high dimensional data to low-dimensional figure; Two is the Optimization Frameworks that the present invention proposes objective function, gives the update rule of framework on two benchmark test database SIFT1M and GIST1M; Three is that the optimization conclusion that the present invention obtains makes training sample can be placed to the codomain of a real number, so that real outcomes can be transformed into binary code.The present invention is applicable to the fields such as computer vision, data mining, machine learning or similarity, produces significant effect to the nearest _neighbor retrieval problem solving extensive high dimensional data.
Accompanying drawing explanation
Fig. 1 is the flow of presentation block diagram of random structure conformal Hash information search method (SSPH) of the present invention.
Fig. 2 is the implementation step block diagram of random structure conformal Hash information search method (SSPH) of the present invention.
Fig. 3 comprises Fig. 3 a, Fig. 3 b, Fig. 3 c and Fig. 3 d, for the present invention by Average Accuracy and look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; Wherein: Fig. 3 a represents that for code length be 48bits, the present invention in database SIFT1M by look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; Fig. 3 b represents that for code length be 48bits, the present invention in database GIST1M by look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; What Fig. 3 c represented that the present invention carried out with 10 popular approach by Average Accuracy in database SIFT1M compares schematic diagram; What Fig. 3 d represented that the present invention carried out with 10 popular approach by Average Accuracy in database GIST1M compares schematic diagram.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.
The flow of presentation of random structure conformal Hash information search method (SSPH) that the present invention proposes in detail as shown in Figure 1, vision operator is extracted from tranining database, use the objective function that proposes and the update rule of the base operator U drawn and low-dimensional data V to carry out dimensionality reduction to original high dimensional data, the Hash codes of probability of use statistical classification model logic calculated test sample book also draws hash function.In test process, the vision operator in the test pattern obtained, substitutes into the hash function drawn, draws the Hash codes of test sample book, then do XOR computing with the Hash codes of training sample, draw net result.
In detail as shown in Figure 2, it comprises following concrete steps to the implementation step flow process of random structure conformal Hash information search method (SSPH) that the present invention proposes:
Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data;
Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data;
Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression;
Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result.
Further illustrate the Application Example of random structure conformal Hash information search method of the present invention below.
Embodiment 1, the random structure conformal Hash information search method problem solved in similarity searching of the present invention.Be provided with two large scale databases: one is the SIFT1M based on SIFT operator, and another one is the GIST1M based on GIST operator; Wherein, SIFT database has 1,000,000 dimensions to be the data point of 128, and GIST database has 1,000,000 dimensions to be the data point of 960; In similarity searching, the basic parameter situation of two large database concepts refers to table 1.
Table 1: the basic parameter table of two large database concepts in similarity searching
Database | SIFT dim=128 | GIST dim=960 |
The size of database | 1,000,000 | 1,000,000 |
The size of test sample book | 10,000 | 10,000 |
The size of training sample | 990,000 | 990,000 |
In order to protect high dimensional data important feature as much as possible, the present invention proposes the KL divergence minimizing the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:
C=λKL(P||Q),
By protecting part and NMF in conjunction with the data structure in this formula, new objective function below can be obtained:
Q
f=||X-UV||
2+λKL(P||Q)
In this objective function, V ∈ { 0,1}
d × N, X, U, V>=0, U ∈ R
m × D, X ∈ R
m × N, λ can control the smoothness of new sign simultaneously;
In order to obtain real number value, first data V ∈ { 0,1}
d × Nbe put into territory V ∈ R
d × Non, then the Lagrangian function in problem is set to:
Wherein matrix Φ and ψ is two Lagrange's multiplier matrixes;
In addition, there is KKT condition: Φ simultaneously
iju
ij=0 He
Following update rule is adopted to arbitrary i, j:
All elements in above formula U and V is all positive number; In literary composition, the dullness the upgrading objective function each time not increasing property through U or V has been demonstrated at " Algorithms of Non-Negative Matrix Factorization ".
Then, threshold value is set and low-dimensional real number is showed V=[v
1..., v
n] convert binary code to: if in vector v
nin f element larger than threshold value, this real number value is set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;
In the above process, the binary code of training intensive data can only be obtained.Therefore, a new sample, directly cannot obtain hash function.In the present invention, due to the environment of binary code, in test sample book, Hash codes can be calculated by probability of use statistical classification model-logistic regression, namely before obtaining logistic regression function, binary code representation be become
wherein
and n=1 ..., N; Therefore training sample can be expressed as
correlation regression matrix θ based on d × d can be expressed as
Wherein 1 is the matrix of N × 1, uses
as the regularization term avoiding over-fitting in logistic regression;
Again by linear mapping matrix
the low-dimensional obtaining real number represents, because
it is sigmoid function; Hash codes for new sample can be expressed as:
Wherein
illustrate each
input all get nearest integer function; Defining binary threshold value is 0.5, if therefore from
bit be greater than 0.5, can 1 be expressed as, otherwise be 0, thus obtain the SSPH code of training sample and test sample book.
The present invention is in the middle of above-mentioned application, and 10K the data point randomly drawed is by as test sample book, simultaneously remaining by as image data base in database.In the process of training, if data point is positioned at front 2 percent of most phase near point, these points will be denoted as 1, otherwise are 0.In the process of test, if the point returned is front 2 percent the most close, it will be considered to close neighbour.Because the arrangement mode of Hamming distance is very fast in Hash codes application, Hamming can be used to sort and to go to measure retrieval tasks.Application result is by Average Accuracy and look into accurate and recall curve to judge.Result shows, random structure conformal Hash information search method (SSPH) method all the time higher than other in the accuracy rate of different code length of the present invention.
Compare 10 comparatively popular Hash information search methods further, comprise LSH, BSSC, RBM, SpH, STH, AGH, KLSH, PCAH, KSH and CH, these 10 methods all can compare in different code lengths that is 32,48,64 and 80; In each database, random structure conformal Hash information search method (SSPH) of the present invention can from 0.01,0.02,0.03 ... until choose the learning rate of a value as cross validation in 0.10, regularization parameter is confirmed as 0.35.
In two large database concepts, as can be seen from Figure 3, random structure conformal Hash information search method (SSPH) of the present invention has best effect compared with other art methods.Simultaneously, in SIFT1M and GIST1M database, code length is the Average Accuracy of 32 and 48bits, the comparison parameter of its training time and test duration is as shown in table 2, in the training time, SSPH of the present invention is more more effective than STH, KSSH and BSSC, so it is a kind of highly effective method that SSPH is applied in large-scale data retrieval.
Table 2: in SIFT 1M and GIST 1M database, code length is the Average Accuracy of 32 and 48bits, the comparison parameter list of its training time and test duration
The explanation do not related in the specific embodiment of the present invention belongs to technology well known in the art, can be implemented with reference to known technology.
The present invention, through application verification repeatedly, achieves satisfied effect.
Claims (5)
1. a random structure conformal Hash information search method, is characterized in that it comprises following concrete steps:
Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data;
Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data;
Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression;
Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result.
2. a kind of random structure conformal Hash information search method according to claim 1; it is characterized in that the important feature of the protection high dimensional data described in step 1; proposition objective function is used to carry out dimensionality reduction to original high dimensional data; thus obtain low-dimensional data, refer to the KL divergence set up and minimize the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:
C=λKL(P||Q) (3),
In formula (3), P is the joint probability distribution of higher dimensional space, can be expressed as p simultaneously
ij; Q is the joint probability distribution of lower dimensional space, can be expressed as q simultaneously
ij; Concrete steps comprise:
Step 1.1: conditional probability p
ijillustrate data point x
iand x
jbetween similarity, wherein x
iproportional with their probability density; Important point is only had to need to go to fashion into right similarity, therefore p
iiand q
iibe set to 0; Simultaneously right
there is attribute p
ij=p
jiand q
ij=q
ji; Similarity between two in higher dimensional space can be expressed as:
Step 1.2: wherein σ
iillustrate at data point x
ithe variable of RC Gaussian distribution, each data point x
ithere is corresponding complexity, the figure of low-dimensional uses heavy-tailed probability distribution, joint probability q
ijmay be defined as:
Formula (5) definition is the unlimited mixing of Gauss, owing to there is no exponential term, and can than the density of independent Gauss estimation point faster; Set up based on KL divergence cost function formulation (6) can effectively assessment data distribution emphasis;
Step 1.3:q
ijand p
ijcan:
The gradient of the KL divergence in formula (6) between P and Q can be expressed as:
Step 1.4: by protecting part and NMF in conjunction with the data structure in formula (3), obtain new objective function below:
O
f=||X-UV||
2+λKL(P||Q) (8),
V ∈ { 0,1} herein
d × N, X, U, V>=0, U ∈ R
m × D, X ∈ R
m × N, λ can control the smoothness of new sign simultaneously;
In most of the cases, only using the low-dimensional data of NMF not to be so effectively with meaningful for practical application, in order to obtain better result in information retrieval, needing to introduce the structure that λ KL (P||Q) goes to protect raw data.
3. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that the update rule of the base operator U that the use described in step 2 has drawn and low-dimensional data V, calculate base and the low-dimensional matrix of original high dimensional data, refer to the concrete steps comprising following optimizing process:
Step 2.1: discrete conditions V ∈ { 0, the 1} of formula (8)
d × Ncannot be directly computed out in optimizing process, in order to obtain real number value, first data V ∈ { 0,1}
d × Nbe put into territory V ∈ R
d × Non;
Step 2.2: then the Lagrangian function in problem is set to:
Matrix Φ and Ψ in formula (9) is two Lagrange's multiplier matrixes, obtains the gradient of g thus:
Step 2.3: allow
gradient be 0 go to minimize O
f:
Wherein:
Step 2.4: except above-mentioned points, has KKT condition: Φ
iju
ij=0 and Ψ
ijv
ij=0,
then, V is multiplied by the relevant position on the both sides of formula (11) and formula (12) respectively
ijand U
ij, can obtain:
(2(-U
TX+U
TUV)+G)
ijV
ij=0, (13),
2(-XV
T+UVV
T)
ijU
ij=0 (14),
Wherein:
Step 2.5: have following update rule to arbitrary i and j:
Wherein: all elements in U and V is all positive number, and the dullness the upgrading objective function each time not increasing property of U or V.
4. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that threshold value is set and low-dimensional real number performance in training set is converted to binary code described in step 3, calculate the Hash codes of test sample book with probability statistics disaggregated model logistic regression, refer to and form hash function by following concrete steps:
Step 3.1: base U=[u
id] ∈ R
m × Dwith low-dimensional matrix V=[u
jd] ∈ R
d × N, wherein d < < D
i, i=1 ..., n, obtains from formula (15) and formula (16), then needs to arrange threshold value and low-dimensional real number is showed V=[v
1..., v
n] convert binary code to, if in vector v
nin f element larger than threshold value, this real number value is just set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;
Step 3.2: from the principle of quantity of information, by a uniform probability distribution, information source can arrive a maximum entropy; Especially, if the entropy of the code in data is very little, whole file can be mapped on a fraction of code; For guaranteeing the efficiency of semantic Hash, making semantic hash algorithm accomplish entropy maximization, and meeting entropy maximization principle, use v
pintermediate value as v
pthe threshold value of middle element, have half numerical value to be set as 1, half is set to 0 in addition, by the method, real number yardage is counted as binary code;
Step 3.3: from above process, can obtain the binary code of training intensive data; For making a new sample, directly obtain hash function, in the process, due to the environment of binary code, in test sample book, probability of use statistical classification model-logistic regression calculates Hash codes; Before obtaining logistic regression function, binary code representation is become
Wherein
And n=1 ..., N; Training sample is expressed as thus
correlation regression matrix Θ based on d × d is expressed as
logistic regression function is expressed as:
Y is 1 or is 0; Relevant regression matrix function is defined as:
Wherein 1 is matrix, the δ of N × 1 || Θ ||
2for avoiding the regularization term of over-fitting in logistic regression;
Step 3.4: in order to find the parameter Θ that can minimize J (Θ), use Gradient Descent and repeatedly upgrade the mode of each parameter, more new formula is as follows:
More new formula can work as Θ
j+1and Θ
jbetween difference || Θ
j+1-Θ
j||
2arrive convergence, then obtain regression matrix Θ;
Step 3.5: finally by linear mapping matrix
the low-dimensional obtaining real number represents, because h
Θbe sigmoid function, the Hash codes for new sample is expressed as:
Wherein:
illustrate each h
Θinput all get nearest integer function, defining binary threshold value is 0.5; If from h
Θ(QX) bit is greater than 0.5, can be expressed as 1, otherwise is 0; Obtain the SSPH code of training sample and test sample book thus, wherein the search method flow process of SSPH is expressed as follows:
Random structure conformal Hash search method (SSPH), input:
One group of training matrix:
D is the target dimension of Hash codes;
To the learning rate α of logistic regression;
Regularization parameter { δ, λ };
Export: basis matrix U and regression matrix Θ;
One is calculate basis matrix U and low-dimensional matrix V with formula (15) and formula (16);
Two is until convergence;
Three is obtain regression matrix Θ from formula (20), is coded in definition in formula (21) to the SSPH of a sample.
5. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that the Hamming distance between the calculation training data described in step 4 and test sample book and XOR computing, draw final result, refer to following complicated dynamic behaviour analysis:
The computation complexity of random structure conformal Hash search method (SSPH) comprises 3 parts, Part I calculates NMF, computation complexity is 0 (NMKD), N is the size of database, M and D is the dimension of high dimensional data and low-dimensional data respectively, and K is the number of class in database; Part II is the cost function (formula 6) in calculating target function, and computation complexity is 0 (N
2d); Part III is logistic regression process, and complexity is 0 (ND
2); Therefore, the whole computation complexity of SSPH is 0 (tNMKD+N
2d+tND
2), wherein t is the number of times of iteration.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410604395.6A CN104376051A (en) | 2014-10-30 | 2014-10-30 | Random structure conformal Hash information retrieval method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410604395.6A CN104376051A (en) | 2014-10-30 | 2014-10-30 | Random structure conformal Hash information retrieval method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104376051A true CN104376051A (en) | 2015-02-25 |
Family
ID=52554958
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410604395.6A Pending CN104376051A (en) | 2014-10-30 | 2014-10-30 | Random structure conformal Hash information retrieval method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104376051A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808723A (en) * | 2016-03-07 | 2016-07-27 | 南京邮电大学 | Image retrieval method based on image semantics and visual hashing |
CN105843555A (en) * | 2016-03-18 | 2016-08-10 | 南京邮电大学 | Stochastic gradient descent based spectral hashing method in distributed storage |
CN106484782A (en) * | 2016-09-18 | 2017-03-08 | 重庆邮电大学 | A kind of large-scale medical image retrieval based on the study of multinuclear Hash |
CN106815349A (en) * | 2017-01-19 | 2017-06-09 | 银联国际有限公司 | The temporal filtering method and event filtering method matched based on hash algorithm and canonical |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
CN110188223A (en) * | 2019-06-06 | 2019-08-30 | 腾讯科技(深圳)有限公司 | Image processing method, device and computer equipment |
CN116244483A (en) * | 2023-05-12 | 2023-06-09 | 山东建筑大学 | Large-scale zero sample data retrieval method and system based on data synthesis |
CN117609488A (en) * | 2024-01-22 | 2024-02-27 | 清华大学 | Method and device for searching small-weight code words, computer storage medium and terminal |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102034085A (en) * | 2010-09-27 | 2011-04-27 | 山东大学 | Video copy detection method based on local linear imbedding |
US20110299721A1 (en) * | 2010-06-02 | 2011-12-08 | Dolby Laboratories Licensing Corporation | Projection based hashing that balances robustness and sensitivity of media fingerprints |
CN102819582A (en) * | 2012-07-26 | 2012-12-12 | 华数传媒网络有限公司 | Quick searching method for mass images |
-
2014
- 2014-10-30 CN CN201410604395.6A patent/CN104376051A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20110299721A1 (en) * | 2010-06-02 | 2011-12-08 | Dolby Laboratories Licensing Corporation | Projection based hashing that balances robustness and sensitivity of media fingerprints |
CN102034085A (en) * | 2010-09-27 | 2011-04-27 | 山东大学 | Video copy detection method based on local linear imbedding |
CN102819582A (en) * | 2012-07-26 | 2012-12-12 | 华数传媒网络有限公司 | Quick searching method for mass images |
Non-Patent Citations (1)
Title |
---|
LI LIU 等: ""Latent Structure Preserving Hashing",", 《INTERNATIONAL JOURNAL OF COMPUTER VISION》 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105808723A (en) * | 2016-03-07 | 2016-07-27 | 南京邮电大学 | Image retrieval method based on image semantics and visual hashing |
CN105808723B (en) * | 2016-03-07 | 2019-06-28 | 南京邮电大学 | The picture retrieval method hashed based on picture semantic and vision |
CN105843555A (en) * | 2016-03-18 | 2016-08-10 | 南京邮电大学 | Stochastic gradient descent based spectral hashing method in distributed storage |
CN105843555B (en) * | 2016-03-18 | 2018-11-02 | 南京邮电大学 | Spectrum hash method based on stochastic gradient descent in distributed storage |
CN106484782B (en) * | 2016-09-18 | 2019-11-12 | 重庆邮电大学 | A kind of large-scale medical image retrieval based on the study of multicore Hash |
CN106484782A (en) * | 2016-09-18 | 2017-03-08 | 重庆邮电大学 | A kind of large-scale medical image retrieval based on the study of multinuclear Hash |
CN106873964A (en) * | 2016-12-23 | 2017-06-20 | 浙江工业大学 | A kind of improved SimHash detection method of code similarities |
CN106815349A (en) * | 2017-01-19 | 2017-06-09 | 银联国际有限公司 | The temporal filtering method and event filtering method matched based on hash algorithm and canonical |
CN110188223A (en) * | 2019-06-06 | 2019-08-30 | 腾讯科技(深圳)有限公司 | Image processing method, device and computer equipment |
CN110188223B (en) * | 2019-06-06 | 2022-10-04 | 腾讯科技(深圳)有限公司 | Image processing method and device and computer equipment |
CN116244483A (en) * | 2023-05-12 | 2023-06-09 | 山东建筑大学 | Large-scale zero sample data retrieval method and system based on data synthesis |
CN116244483B (en) * | 2023-05-12 | 2023-07-28 | 山东建筑大学 | Large-scale zero sample data retrieval method and system based on data synthesis |
CN117609488A (en) * | 2024-01-22 | 2024-02-27 | 清华大学 | Method and device for searching small-weight code words, computer storage medium and terminal |
CN117609488B (en) * | 2024-01-22 | 2024-03-26 | 清华大学 | Method and device for searching small-weight code words, computer storage medium and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104376051A (en) | Random structure conformal Hash information retrieval method | |
Izakian et al. | Anomaly detection and characterization in spatial time series data: A cluster-centric approach | |
CN105045812B (en) | The classification method and system of text subject | |
Yu et al. | Short term wind power prediction for regional wind farms based on spatial-temporal characteristic distribution | |
CN107871014A (en) | A kind of big data cross-module state search method and system based on depth integration Hash | |
Chen et al. | HAPGN: Hierarchical attentive pooling graph network for point cloud segmentation | |
CN104462196B (en) | Multiple features combining Hash information search method | |
CN111291556B (en) | Chinese entity relation extraction method based on character and word feature fusion of entity meaning item | |
CN104657350A (en) | Hash learning method for short text integrated with implicit semantic features | |
CN105184298A (en) | Image classification method through fast and locality-constrained low-rank coding process | |
CN109284411B (en) | Discretization image binary coding method based on supervised hypergraph | |
CN111125411A (en) | Large-scale image retrieval method for deep strong correlation hash learning | |
US11841839B1 (en) | Preprocessing and imputing method for structural data | |
CN104850533A (en) | Constrained nonnegative matrix decomposing method and solving method | |
Vincent-Cuaz et al. | Template based graph neural network with optimal transport distances | |
CN112749752A (en) | Hyperspectral image classification method based on depth transform | |
CN104318271A (en) | Image classification method based on adaptability coding and geometrical smooth convergence | |
Nugraha et al. | Particle Swarm Optimization–Support Vector Machine (PSO-SVM) Algorithm for Journal Rank Classification | |
CN117251754A (en) | CNN-GRU energy consumption prediction method considering dynamic time packaging | |
Fan et al. | Cadtransformer: Panoptic symbol spotting transformer for cad drawings | |
Klomsae et al. | A string grammar fuzzy-possibilistic C-medians | |
He et al. | Classification of metro facilities with deep neural networks | |
CN106033546A (en) | Behavior classification method based on top-down learning | |
Laptin et al. | Shape of basic clusters: using analogues of Hough transform in higher dimensions | |
CN113516019B (en) | Hyperspectral image unmixing method and device and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150225 |