CN104376051A - Random structure conformal Hash information retrieval method - Google Patents

Random structure conformal Hash information retrieval method Download PDF

Info

Publication number
CN104376051A
CN104376051A CN201410604395.6A CN201410604395A CN104376051A CN 104376051 A CN104376051 A CN 104376051A CN 201410604395 A CN201410604395 A CN 201410604395A CN 104376051 A CN104376051 A CN 104376051A
Authority
CN
China
Prior art keywords
hash
data
formula
sigma
matrix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410604395.6A
Other languages
Chinese (zh)
Inventor
邵岭
蔡子贇
刘力
余孟洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Information Science and Technology
Original Assignee
Nanjing University of Information Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Information Science and Technology filed Critical Nanjing University of Information Science and Technology
Priority to CN201410604395.6A priority Critical patent/CN104376051A/en
Publication of CN104376051A publication Critical patent/CN104376051A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/325Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables

Abstract

The invention relates to a random structure conformal Hash information retrieval method. The random structure conformal Hash information retrieval method is characterized by including the steps of (1) protecting important structures of high dimensional data, conducting dimensionality reduction on original high dimensional data through a provided objective function, and accordingly obtaining low dimensional data; (2) calculating a basic dimensionality matrix and a low dimensionality matrix of the original high dimensional data through obtained updating rules of basic operators U and the low dimensional data V; (3) setting a threshold value, converting low dimensionality real number expressions in a training set into binary codes, and calculating Hash codes of a testing sample through probability statistics disaggregated model logistic regression; (4) calculating the Hamming distance, namely XOR operation, between the training data and the testing sample, and obtaining final results. By means of the random structure conformal Hash information retrieval method, on the basis that distribution of random data and the local and overall structures of the high dimensional data are protected, a Hash function is successfully obtained through the multivariable logistic regression, surpassing sample expansion can be achieved, and the random structure conformal Hash information retrieval method is suitable for computer visions, data mining, machine learning or similar searching fields.

Description

Random structure conformal Hash information search method
Technical field
The invention belongs to computer information data processing technology field, particularly relate to a kind of random structure conformal Hash information search method for computer vision, data mining, machine learning or similarity.
Background technology
In information retrieval, machine learning, pattern-recognition and data mining, similarity searching is a problem needing to solve.In general, effective search for similarity method can set up index structure in metric space, and the early stage research about similarity searching can trace back to 20 century 70s.Specifically, when dimension lower≤20 time, some method such as KD-tree, VP-tree and R+ trees etc. based on data structure can solve the problem of similarity searching.But along with the growth of data dimension, the difficulty how effectively realizing similarity searching problem in information data process field constantly rises.The existing method taking the concept of " approximate value " to solve similarity searching problem, as in order to improve recall precision, hash algorithm need to obtain one from Euclidean space the hash function to Hamming space, as long as utilize binary-coded hash algorithm to comprise two advantages: one is that scale-of-two Hash codes has saved storage area; Two is the Hamming distance (XOR computing) effectively between calculation training data and test sample book, the time complexity in Hash table can be approximately O (1) in the retrieving of similarity searching.
Existing hash algorithm can be divided into substantially based on accidental projection with based on two kinds that learn.Local sensitivity Hash (LSH) is the hash algorithm based on stochastic linear projection of widespread use, effectively data point can be mapped to low-dimensional Hamming space from higher dimensional space; More similarity can be excavated in nuclear space in order to better recall precision based on the local sensitivity Hash (KLSH) of core and the local sensitivity Hash (BMKLSH) of strengthening multinuclear.In order to find the similar arest neighbors of test point in higher dimensional space, Panigrahy proposes to know clearly a kind of hash algorithm based on entropy.Dong proposes many explorations local sensitivity Hash of Corpus--based Method characteristic model, and this is the best change of current local sensitivity Hash.In addition, Raginsky and Lazebnik guarantees by the encoding scheme freely distributed based on Random Maps the relation offseting the Hamming distance of the binary code of the numerical value of constant core in two vector sum vectors.
Only have when scale-of-two Hash codes is sufficiently long time, the hash function based on accidental projection just can be effectively.Therefore, in order to obtain compacter and encode accurately, many hash algorithms based on study are suggested.By the structure of mining data, then show on objective function, by solving the optimization problem relevant with objective function, the hash algorithm based on study can obtain hash function.Spectrum Hash (SpH) is typical non-supervisory hash algorithm, by impel balance with incoherent constraint to code learn, compose Hash and may learn compact binary code and similarity in protected data.Principal component analysis (PCA) Hash (PCAH) can obtain better quantification relative to Random Maps Hash.In addition, the semantic Hash (SH) based on limited Boltzmann machine is suggested.The people such as Liu propose the hash algorithm based on image that automatically can find data neighbour inner structure, may learn code compact accordingly simultaneously, and anchor-shaped figure can accelerate the process of analysis of spectrum.Recently having and be suggested based on the spherical Hash of hyperspherical scale-of-two implanted prosthetics (Spherical Hashing). this algorithm can provide compact data mode and the nearest neighbor search of expansion.
But all there is certain defect in above-mentioned hash method.Although can produce compact code based on the hash method of Random Maps, simple linear hash function but cannot map out relation potential between data point.Meanwhile, because linear formula is calculated by higher dimensional matrix and obtains, this can bring very high computation complexity.In addition, when code word is very long time, the hash algorithm based on study can not be very effective.In addition, those hash methods first reducing raw data dimension can not obtain the low-dimensional data result of fine structure.
In recent years as the matrix decomposition algorithm that can learn object non-negative portion-form, Non-negative Matrix Factorization (NMF) plays important effect in information retrieval and data mining.As one has the nonnegative matrix of M N dimension data vector two nonnegative matrix U=[u can be resolved into by NMF id] ∈ R m × Dwith V=[u jd] ∈ R d × N, its result can estimate original matrix well, as X ≈ UV.Lee and Seung it is also proposed two objective functions and goes the distance of assessment two between nonnegative matrix X and UV, and the objective function based on difference can be expressed as:
O F = | | X - UV | | 2 = Σ i , j ( x ij - Σ d = 1 D u id v jd ) 2 , - - - ( 1 )
In formula (1) || || be Fu Luobin Nice norm;
In order to optimize above-mentioned objective function, iteration step of updating can be used for obtaining O flocal minimum:
u jd ( t + 1 ) = u jd ( t ) ( XV T ) jd ( UVV T ) jd , v di ( t + 1 ) = v di ( t ) ( U T X ) di ( U T UV ) di - - - ( 2 )
The verified above-mentioned iteration update algorithm of formula (2) can find O effectively flocal minimum; The matrix V obtained from NMF is that the low-dimensional of X represents; Meanwhile, U is basis matrix.
At present also have the more algorithm based on NMF, as local Non-negative Matrix Factorization (LNMF), better can obtain local feature and may learn visual pattern based on part with the topical manifestations in space.In order to improve LNMF algorithm, the people such as Cai propose localised protection Non-negative Matrix Factorization (LPNMF), and it can analyze the similarity between two hiding data points.Based on these methods, a kind of effectively based on monumented point can the method for packed data, accelerate the problem that local Non-negative Matrix Factorization (A-LPNMF) is suggested the computation complexity solving LPNMF of making a return journey.In order to find profound popular structure, the people such as Cai propose and can decompose and the figure regularization Non-negative Matrix Factorization (GNMF) of graphic structure by associate(d) matrix.The Non-negative Matrix Factorization (CNMF) of constraint is using label information as the constraint condition of adding, and of a sort data point merges in new performance territory simultaneously.Because being subject to the impact of sparse coding, openness in order to ensure the performance obtained, non-negative local coordinate is decomposed (NLCF) and be with the addition of local coordinate restriction.
In sum, deficiency existing for prior art can be summarized as: one is that existing NMF algorithm protects the local of original high dimensional data and the problem of general structure because solving, so there is the feature that the low-dimensional data obtained farthest cannot inherit high dimensional data; Two is that the existing hash algorithm based on accidental projection is had to produce a lot of Hash tables and could be obtained certain retrieval effectiveness, and simply linearly hash function cannot map out relation potential between data point; Three is when code word is very long time, and the hash algorithm based on study cannot obtain effective result.
Summary of the invention
The object of the invention is the deficiency for overcoming prior art existence and a kind of random structure conformal Hash information search method (SSPH) is provided; the present invention can on the local of distribution and high dimensional data protecting random data well and the basis of general structure; successfully the logistic regression of using multivariate obtains hash function, can realize the expansion surmounting sample.
According to a kind of random structure conformal Hash information search method that the present invention proposes, it is characterized in that it comprises following concrete steps:
Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data; In order to the important feature of protection high dimensional data as much as possible, set up the KL divergence minimizing the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:
C=λKL(P||Q) (3),
In formula (3), P is the joint probability distribution of higher dimensional space, can be expressed as p simultaneously ij; Q is the joint probability distribution of lower dimensional space, can be expressed as q simultaneously ij; Concrete steps are as follows:
Step 1.1: conditional probability p ijillustrate data point x iand x jbetween similarity, wherein x iproportional with their probability density; Important point is only had to need to go to fashion into right similarity, therefore p iiand q iibe set to 0; Simultaneously right there is attribute p ij=p jiand q ij=q ji; Similarity between two in higher dimensional space can be expressed as:
p ij = exp ( - | | x i - x j | | 2 2 σ i 2 ) Σ k ≠ l exp ( - | | x k - x l | | 2 2 σ k 2 ) - - - ( 4 ) ,
Step 1.2: wherein σ iillustrate at data point x ithe variable of RC Gaussian distribution, each data point x ithere is corresponding complexity, the figure of low-dimensional uses heavy-tailed probability distribution, joint probability q ijmay be defined as:
q ij = ( 1 + | | v i - v j | | 2 ) - 1 Σ k ≠ l ( 1 + | | v k - v l | | 2 ) - 1 - - - ( 5 ) ,
Formula (5) definition is the unlimited mixing of Gauss, owing to there is no exponential term, and can than the density of independent Gauss estimation point faster; Set up based on KL divergence cost function formulation (6) can effectively assessment data distribution emphasis;
Step 1.3:q ijand p ijcan:
G = KL ( P | | Q ) = Σ i Σ j p ij log p ij q ij - - - ( 6 , )
The gradient of the KL divergence in formula (6) between P and Q can be expressed as:
g = KL ( P | | Q ) ∂ g ∂ v i = 4 Σ j = 1 N ( p ij - q ij ) ( v i - v j ) ( 1 + | | v i - v j | | 2 ) - 1 - - - ( 7 ) ;
Step 1.4: by protecting part and NMF in conjunction with the data structure in formula (3), obtain new objective function below:
Q f=||X-UV|| 2+λKL(P||Q) (8),
V ∈ { 0,1} herein d × N, X, U, V>=0, U ∈ R m × D, X ∈ R m × N, λ can control the smoothness of new sign simultaneously;
In most of the cases, only using the low-dimensional data of NMF not to be so effectively with meaningful for practical application, in order to obtain better result in information retrieval, needing to introduce the structure that λ KL (P||Q) goes to protect raw data.
Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data; Comprise the concrete steps of following optimizing process:
Step 2.1: discrete conditions V ∈ { 0, the 1} of formula (8) d × Ncannot be directly computed out in optimizing process, in order to obtain real number value, first data V ∈ { 0,1} d × Nbe put into territory V ∈ R d × Non;
Step 2.2: then the Lagrangian function in problem is set to:
Matrix Φ and ψ in formula (9) is two Lagrange's multiplier matrixes, obtains the gradient of g thus:
∂ g ∂ v i = 4 λ Σ j = 1 N ( p ij - q ij ) ( v i - v j ) ( 1 + | | v i - v j | | 2 ) - 1 - - - ( 10 ) ;
Step 2.3: allow gradient be 0 go to minimize O f:
Wherein:
G = ∂ g ∂ V = [ ∂ g ∂ v 1 , . . . , ∂ g ∂ v N ] ;
Step 2.4: except above-mentioned points, has KKT condition: Φ iju ij=0 He then, V is multiplied by the relevant position on the both sides of formula (11) and formula (12) respectively ijand U ij, can obtain:
(2(-U TX+U TUV)+G) ijV ij=0, (13),
2(-XV T+UVV T) ijU ij=0 (14),
Wherein:
G ij = ( ∂ g ∂ v j ) i = ( 4 λ Σ k = 1 N ( p jk - q jk ) ( v j - v k ) ( 1 + | | v j - v k | | 2 ) - 1 ) i = ( 4 λ Σ k = 1 N p jk v j - q jk v j - p jk v k + q jk v k 1 + | | v j - v k | | 2 ) i = 4 λ Σ k = 1 N p jk V ij - q jk V ij - p ik V ik + q jk V ik 1 + | | v j - v k | | 2 . ;
Step 2.5: have following update rule to arbitrary i and j:
V ij ← ( U T X ) ij + 2 λ Σ k = 1 N p jk V ik + q jk V ij 1 + | | v j - v k | | 2 ( U T UV ) ij + 2 λ Σ k = 1 N p jk V ij + q jk V ik 1 + | | v j - v k | | 2 V ij - - - ( 15 ) ,
U ij ← ( XV T ) ij ( UVV T ) ij U ij - - - ( 16 ) ,
Wherein, all elements in U and V is all positive number; And the dullness the upgrading objective function each time not increasing property of U or V.
Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression, namely form hash function by following concrete steps:
Step 3.1: base U=[u id] ∈ R m × Dwith low-dimensional matrix V=[u jd] ∈ R d × N, wherein d < < D i, i=1 ..., n, obtains from formula (15) and formula (16), then needs to arrange threshold value and low-dimensional real number is showed V=[v 1..., v n] convert binary code to, if in vector v nin f element larger than threshold value, this real number value is just set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;
Step 3.2: from the principle of quantity of information, by a uniform probability distribution, information source can arrive a maximum entropy; Especially, if the entropy of the code in data is very little, whole file can be mapped on a fraction of code; For guaranteeing the efficiency of semantic Hash, making semantic hash algorithm accomplish entropy maximization, and meeting entropy maximization principle, use v pintermediate value as v pthe threshold value of middle element, have half numerical value to be set as 1, half is set to 0 in addition, by the method, real number yardage is counted as binary code;
Step 3.3: from above process, can obtain the binary code of training intensive data; For making a new sample, directly obtain hash function, in the process, due to the environment of binary code, in test sample book, probability of use statistical classification model-logistic regression calculates Hash codes; Before obtaining logistic regression function, binary code representation is become wherein and n=1 ..., N; Training sample is expressed as thus correlation regression matrix Θ based on d × d is expressed as logistic regression function is expressed as:
Y is 1 or is 0; Relevant regression matrix function is defined as:
Wherein 1 be N × 1 matrix, for avoiding the regularization term of over-fitting in logistic regression;
Step 3.4: can minimize to find parameter Θ, use Gradient Descent and repeatedly upgrade the mode of each parameter, more new formula is as follows:
More new formula can be worked as with between difference arrive convergence, then obtain regression matrix Θ;
Step 3.5: finally by linear mapping matrix the low-dimensional obtaining real number represents, because be sigmoid function, the Hash codes for new sample is expressed as:
Wherein: illustrate each input all get nearest integer function, defining binary threshold value is 0.5; If from bit be greater than 0.5, can 1 be expressed as, otherwise be 0; Obtain the SSPH code of training sample and test sample book thus, wherein the search method flow process of SSPH is expressed as follows:
Random structure conformal Hash search method (SSPH), input:
One group of training matrix: X = { x i &Element; R d } i = 1 n ;
D is the target dimension of Hash codes;
To the learning rate α of logistic regression;
Regularization parameter { δ, λ };
Export: basis matrix U and regression matrix Θ;
One is calculate basis matrix U and low-dimensional matrix V with formula (15) and formula (16);
Two is until convergence;
Three is obtain regression matrix Θ from formula (20), is coded in definition in formula (21) to the SSPH of a sample.
Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result; Specifically following complicated dynamic behaviour analysis:
The computation complexity of random structure conformal Hash search method (SSPH) comprises 3 parts, Part I calculates NMF, computation complexity is O (NMKd), N is the size of database, M and d is the dimension of high dimensional data and low-dimensional data respectively, and K is the number of class in database; Part II is the cost function (formula 6) in calculating target function, and computation complexity is O (N 2d); Part III is logistic regression process, and complexity is O (Nd 2); Therefore, the whole computation complexity of SSPH is O (tNMKd+N 2d+tNd 2), wherein t is the number of times of iteration.
The present invention compared with prior art its remarkable advantage is: one is the invention solves the difficult problem that unsupervised-learning algorithm (NMF) cannot find the essential geometry of data space, the objective function proposed uses the method for efficient Non-negative Matrix Factorization and logistic regression to solve, and protects the partial structurtes of high dimensional data to low-dimensional figure; Two is the Optimization Frameworks that the present invention proposes objective function, gives the update rule of framework on two benchmark test database SIFT1M and GIST1M; Three is that the optimization conclusion that the present invention obtains makes training sample can be placed to the codomain of a real number, so that real outcomes can be transformed into binary code.The present invention is applicable to the fields such as computer vision, data mining, machine learning or similarity, produces significant effect to the nearest _neighbor retrieval problem solving extensive high dimensional data.
Accompanying drawing explanation
Fig. 1 is the flow of presentation block diagram of random structure conformal Hash information search method (SSPH) of the present invention.
Fig. 2 is the implementation step block diagram of random structure conformal Hash information search method (SSPH) of the present invention.
Fig. 3 comprises Fig. 3 a, Fig. 3 b, Fig. 3 c and Fig. 3 d, for the present invention by Average Accuracy and look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; Wherein: Fig. 3 a represents that for code length be 48bits, the present invention in database SIFT1M by look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; Fig. 3 b represents that for code length be 48bits, the present invention in database GIST1M by look into accurate with recall that curve carries out with 10 popular approach compare schematic diagram; What Fig. 3 c represented that the present invention carried out with 10 popular approach by Average Accuracy in database SIFT1M compares schematic diagram; What Fig. 3 d represented that the present invention carried out with 10 popular approach by Average Accuracy in database GIST1M compares schematic diagram.
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.
The flow of presentation of random structure conformal Hash information search method (SSPH) that the present invention proposes in detail as shown in Figure 1, vision operator is extracted from tranining database, use the objective function that proposes and the update rule of the base operator U drawn and low-dimensional data V to carry out dimensionality reduction to original high dimensional data, the Hash codes of probability of use statistical classification model logic calculated test sample book also draws hash function.In test process, the vision operator in the test pattern obtained, substitutes into the hash function drawn, draws the Hash codes of test sample book, then do XOR computing with the Hash codes of training sample, draw net result.
In detail as shown in Figure 2, it comprises following concrete steps to the implementation step flow process of random structure conformal Hash information search method (SSPH) that the present invention proposes:
Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data;
Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data;
Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression;
Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result.
Further illustrate the Application Example of random structure conformal Hash information search method of the present invention below.
Embodiment 1, the random structure conformal Hash information search method problem solved in similarity searching of the present invention.Be provided with two large scale databases: one is the SIFT1M based on SIFT operator, and another one is the GIST1M based on GIST operator; Wherein, SIFT database has 1,000,000 dimensions to be the data point of 128, and GIST database has 1,000,000 dimensions to be the data point of 960; In similarity searching, the basic parameter situation of two large database concepts refers to table 1.
Table 1: the basic parameter table of two large database concepts in similarity searching
Database SIFT dim=128 GIST dim=960
The size of database 1,000,000 1,000,000
The size of test sample book 10,000 10,000
The size of training sample 990,000 990,000
In order to protect high dimensional data important feature as much as possible, the present invention proposes the KL divergence minimizing the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:
C=λKL(P||Q),
By protecting part and NMF in conjunction with the data structure in this formula, new objective function below can be obtained:
Q f=||X-UV|| 2+λKL(P||Q)
In this objective function, V ∈ { 0,1} d × N, X, U, V>=0, U ∈ R m × D, X ∈ R m × N, λ can control the smoothness of new sign simultaneously;
In order to obtain real number value, first data V ∈ { 0,1} d × Nbe put into territory V ∈ R d × Non, then the Lagrangian function in problem is set to:
Wherein matrix Φ and ψ is two Lagrange's multiplier matrixes;
In addition, there is KKT condition: Φ simultaneously iju ij=0 He
Following update rule is adopted to arbitrary i, j:
V ij &LeftArrow; ( U T X ) ij + 2 &lambda; &Sigma; k = 1 N p jk V ik + q jk V ij 1 + | | v j - v k | | 2 ( U T UV ) ij + 2 &lambda; &Sigma; k = 1 N p jk V ij + q jk V ik 1 + | | v j - v k | | 2 V ij ,
U ij &LeftArrow; ( XV T ) ij ( UVV T ) ij U ij ,
All elements in above formula U and V is all positive number; In literary composition, the dullness the upgrading objective function each time not increasing property through U or V has been demonstrated at " Algorithms of Non-Negative Matrix Factorization ".
Then, threshold value is set and low-dimensional real number is showed V=[v 1..., v n] convert binary code to: if in vector v nin f element larger than threshold value, this real number value is set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;
In the above process, the binary code of training intensive data can only be obtained.Therefore, a new sample, directly cannot obtain hash function.In the present invention, due to the environment of binary code, in test sample book, Hash codes can be calculated by probability of use statistical classification model-logistic regression, namely before obtaining logistic regression function, binary code representation be become wherein and n=1 ..., N; Therefore training sample can be expressed as correlation regression matrix θ based on d × d can be expressed as
Wherein 1 is the matrix of N × 1, uses as the regularization term avoiding over-fitting in logistic regression;
Again by linear mapping matrix the low-dimensional obtaining real number represents, because it is sigmoid function; Hash codes for new sample can be expressed as:
Wherein illustrate each input all get nearest integer function; Defining binary threshold value is 0.5, if therefore from bit be greater than 0.5, can 1 be expressed as, otherwise be 0, thus obtain the SSPH code of training sample and test sample book.
The present invention is in the middle of above-mentioned application, and 10K the data point randomly drawed is by as test sample book, simultaneously remaining by as image data base in database.In the process of training, if data point is positioned at front 2 percent of most phase near point, these points will be denoted as 1, otherwise are 0.In the process of test, if the point returned is front 2 percent the most close, it will be considered to close neighbour.Because the arrangement mode of Hamming distance is very fast in Hash codes application, Hamming can be used to sort and to go to measure retrieval tasks.Application result is by Average Accuracy and look into accurate and recall curve to judge.Result shows, random structure conformal Hash information search method (SSPH) method all the time higher than other in the accuracy rate of different code length of the present invention.
Compare 10 comparatively popular Hash information search methods further, comprise LSH, BSSC, RBM, SpH, STH, AGH, KLSH, PCAH, KSH and CH, these 10 methods all can compare in different code lengths that is 32,48,64 and 80; In each database, random structure conformal Hash information search method (SSPH) of the present invention can from 0.01,0.02,0.03 ... until choose the learning rate of a value as cross validation in 0.10, regularization parameter is confirmed as 0.35.
In two large database concepts, as can be seen from Figure 3, random structure conformal Hash information search method (SSPH) of the present invention has best effect compared with other art methods.Simultaneously, in SIFT1M and GIST1M database, code length is the Average Accuracy of 32 and 48bits, the comparison parameter of its training time and test duration is as shown in table 2, in the training time, SSPH of the present invention is more more effective than STH, KSSH and BSSC, so it is a kind of highly effective method that SSPH is applied in large-scale data retrieval.
Table 2: in SIFT 1M and GIST 1M database, code length is the Average Accuracy of 32 and 48bits, the comparison parameter list of its training time and test duration
The explanation do not related in the specific embodiment of the present invention belongs to technology well known in the art, can be implemented with reference to known technology.
The present invention, through application verification repeatedly, achieves satisfied effect.

Claims (5)

1. a random structure conformal Hash information search method, is characterized in that it comprises following concrete steps:
Step 1: the important feature of protection high dimensional data, uses proposition objective function to carry out dimensionality reduction to original high dimensional data, thus obtains low-dimensional data;
Step 2: the update rule using base operator U and the low-dimensional data V drawn, calculates base and the low-dimensional matrix of original high dimensional data;
Step 3: threshold value be set and low-dimensional real number performance in training set is converted to binary code, calculating the Hash codes of test sample book with probability statistics disaggregated model logistic regression;
Step 4: the Hamming distance between calculation training data and test sample book and XOR computing, draw final result.
2. a kind of random structure conformal Hash information search method according to claim 1; it is characterized in that the important feature of the protection high dimensional data described in step 1; proposition objective function is used to carry out dimensionality reduction to original high dimensional data; thus obtain low-dimensional data, refer to the KL divergence set up and minimize the joint probability distribution of higher dimensional space and the joint probability distribution of lower dimensional space heavytailed distribution:
C=λKL(P||Q) (3),
In formula (3), P is the joint probability distribution of higher dimensional space, can be expressed as p simultaneously ij; Q is the joint probability distribution of lower dimensional space, can be expressed as q simultaneously ij; Concrete steps comprise:
Step 1.1: conditional probability p ijillustrate data point x iand x jbetween similarity, wherein x iproportional with their probability density; Important point is only had to need to go to fashion into right similarity, therefore p iiand q iibe set to 0; Simultaneously right there is attribute p ij=p jiand q ij=q ji; Similarity between two in higher dimensional space can be expressed as:
p ij = exp ( - | | x i - x j | | 2 2 &sigma; i 2 ) &Sigma; k &NotEqual; l exp ( - | | x k - x l | | 2 2 &sigma; k 2 ) - - - ( 4 ) ,
Step 1.2: wherein σ iillustrate at data point x ithe variable of RC Gaussian distribution, each data point x ithere is corresponding complexity, the figure of low-dimensional uses heavy-tailed probability distribution, joint probability q ijmay be defined as:
q ij = ( 1 + | | v i - v j | | 2 ) - 1 &Sigma; k &NotEqual; l ( 1 + | | v k - v l | | 2 ) - 1 - - - ( 5 ) ,
Formula (5) definition is the unlimited mixing of Gauss, owing to there is no exponential term, and can than the density of independent Gauss estimation point faster; Set up based on KL divergence cost function formulation (6) can effectively assessment data distribution emphasis;
Step 1.3:q ijand p ijcan:
G = KL ( P | | Q ) = &Sigma; i &Sigma; j p ij log p ij q ij - - - ( 6 ) ,
The gradient of the KL divergence in formula (6) between P and Q can be expressed as:
g = KL ( P | | Q ) &PartialD; g &PartialD; v i = 4 &Sigma; j = 1 N ( p ij - q ij ) ( v i - v j ) ( 1 + | | v i - v j | | 2 ) - 1 - - - ( 7 ) ;
Step 1.4: by protecting part and NMF in conjunction with the data structure in formula (3), obtain new objective function below:
O f=||X-UV|| 2+λKL(P||Q) (8),
V ∈ { 0,1} herein d × N, X, U, V>=0, U ∈ R m × D, X ∈ R m × N, λ can control the smoothness of new sign simultaneously;
In most of the cases, only using the low-dimensional data of NMF not to be so effectively with meaningful for practical application, in order to obtain better result in information retrieval, needing to introduce the structure that λ KL (P||Q) goes to protect raw data.
3. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that the update rule of the base operator U that the use described in step 2 has drawn and low-dimensional data V, calculate base and the low-dimensional matrix of original high dimensional data, refer to the concrete steps comprising following optimizing process:
Step 2.1: discrete conditions V ∈ { 0, the 1} of formula (8) d × Ncannot be directly computed out in optimizing process, in order to obtain real number value, first data V ∈ { 0,1} d × Nbe put into territory V ∈ R d × Non;
Step 2.2: then the Lagrangian function in problem is set to:
Matrix Φ and Ψ in formula (9) is two Lagrange's multiplier matrixes, obtains the gradient of g thus:
&PartialD; g &PartialD; v i = 4 &lambda; &Sigma; j = 1 N ( p ij - q ij ) ( v i - v j ) ( 1 + | | v i - v j | | 2 ) - 1 - - - ( 10 ) ;
Step 2.3: allow gradient be 0 go to minimize O f:
Wherein:
G = &PartialD; g &PartialD; V = [ &PartialD; g &PartialD; v 1 , . . . , &PartialD; g &PartialD; v N ] ;
Step 2.4: except above-mentioned points, has KKT condition: Φ iju ij=0 and Ψ ijv ij=0, then, V is multiplied by the relevant position on the both sides of formula (11) and formula (12) respectively ijand U ij, can obtain:
(2(-U TX+U TUV)+G) ijV ij=0, (13),
2(-XV T+UVV T) ijU ij=0 (14),
Wherein:
G ij = ( &PartialD; g &PartialD; v j ) i = ( 4 &lambda; &Sigma; k = 1 N ( p jk - q jk ) ( v j - v k ) ( 1 + | | v j - v k | | 2 ) - 1 ) i = ( 4 &lambda; &Sigma; k = 1 N p jk v j - q jk v j - p jk v k + q jk v k 1 + | | v j - v k | | 2 ) i = 4 &lambda; &Sigma; k = 1 N p jk V ij - q jk V ij - p jk V ik + q jk V ik 1 + | | v j - v k | | 2 . ;
Step 2.5: have following update rule to arbitrary i and j:
V ij &LeftArrow; ( U T X ) ij + 2 &lambda; &Sigma; k = 1 N p jk V ik + q jk V ij 1 + | | v j - v k | | 2 ( U T UV ) ij + 2 &lambda; &Sigma; k = 1 N p jk V ij + q jk V ik 1 + | | v j - v k | | 2 V ij - - - ( 15 ) ,
U ij &LeftArrow; ( XV T ) ij ( UVV T ) ij U ij - - - ( 16 ) ,
Wherein: all elements in U and V is all positive number, and the dullness the upgrading objective function each time not increasing property of U or V.
4. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that threshold value is set and low-dimensional real number performance in training set is converted to binary code described in step 3, calculate the Hash codes of test sample book with probability statistics disaggregated model logistic regression, refer to and form hash function by following concrete steps:
Step 3.1: base U=[u id] ∈ R m × Dwith low-dimensional matrix V=[u jd] ∈ R d × N, wherein d < < D i, i=1 ..., n, obtains from formula (15) and formula (16), then needs to arrange threshold value and low-dimensional real number is showed V=[v 1..., v n] convert binary code to, if in vector v nin f element larger than threshold value, this real number value is just set to 1, otherwise is 0, wherein f=1 ..., d and n=1 ..., N;
Step 3.2: from the principle of quantity of information, by a uniform probability distribution, information source can arrive a maximum entropy; Especially, if the entropy of the code in data is very little, whole file can be mapped on a fraction of code; For guaranteeing the efficiency of semantic Hash, making semantic hash algorithm accomplish entropy maximization, and meeting entropy maximization principle, use v pintermediate value as v pthe threshold value of middle element, have half numerical value to be set as 1, half is set to 0 in addition, by the method, real number yardage is counted as binary code;
Step 3.3: from above process, can obtain the binary code of training intensive data; For making a new sample, directly obtain hash function, in the process, due to the environment of binary code, in test sample book, probability of use statistical classification model-logistic regression calculates Hash codes; Before obtaining logistic regression function, binary code representation is become V ^ = [ v ^ 1 , . . . , v ^ N ] , Wherein v ^ n &Element; { 0,1 } d , And n=1 ..., N; Training sample is expressed as thus correlation regression matrix Θ based on d × d is expressed as logistic regression function is expressed as:
J ( &Theta; ) = 1 N &Sigma; j = 1 N Cost ( h &Theta; ( v n ) , v ^ n ) - - - ( 17 ) ,
Cost ( h &Theta; ( v n ) , v ^ n ) = - log ( h &Theta; ( v n ) ) if y = 1 - log ( 1 - h &Theta; ( v n ) ) if y = 0 - - - ( 18 ) ,
Y is 1 or is 0; Relevant regression matrix function is defined as:
J ( &Theta; ) = - 1 N { &Sigma; n = 1 N [ v ^ n log ( h &Theta; ( v n ) ) + ( 1 - v ^ n ) log ( 1 - h &Theta; ( v n ) ) ] + &delta; | | &Theta; | | 2 } - - - ( 19 ) ,
Wherein 1 is matrix, the δ of N × 1 || Θ || 2for avoiding the regularization term of over-fitting in logistic regression;
Step 3.4: in order to find the parameter Θ that can minimize J (Θ), use Gradient Descent and repeatedly upgrade the mode of each parameter, more new formula is as follows:
&Theta; j + 1 = &Theta; j - &alpha; ( 1 N &Sigma; n = 1 N ( h &Theta; ( v n ) - v ^ n ) v n T ) - &alpha;&delta; N &Theta; j - - - ( 20 ) ,
More new formula can work as Θ j+1and Θ jbetween difference || Θ j+1j|| 2arrive convergence, then obtain regression matrix Θ;
Step 3.5: finally by linear mapping matrix the low-dimensional obtaining real number represents, because h Θbe sigmoid function, the Hash codes for new sample is expressed as:
Wherein: illustrate each h Θinput all get nearest integer function, defining binary threshold value is 0.5; If from h Θ(QX) bit is greater than 0.5, can be expressed as 1, otherwise is 0; Obtain the SSPH code of training sample and test sample book thus, wherein the search method flow process of SSPH is expressed as follows:
Random structure conformal Hash search method (SSPH), input:
One group of training matrix: X = { x i &Element; R d } i = 1 n ;
D is the target dimension of Hash codes;
To the learning rate α of logistic regression;
Regularization parameter { δ, λ };
Export: basis matrix U and regression matrix Θ;
One is calculate basis matrix U and low-dimensional matrix V with formula (15) and formula (16);
Two is until convergence;
Three is obtain regression matrix Θ from formula (20), is coded in definition in formula (21) to the SSPH of a sample.
5. a kind of random structure conformal Hash information search method according to claim 1, it is characterized in that the Hamming distance between the calculation training data described in step 4 and test sample book and XOR computing, draw final result, refer to following complicated dynamic behaviour analysis:
The computation complexity of random structure conformal Hash search method (SSPH) comprises 3 parts, Part I calculates NMF, computation complexity is 0 (NMKD), N is the size of database, M and D is the dimension of high dimensional data and low-dimensional data respectively, and K is the number of class in database; Part II is the cost function (formula 6) in calculating target function, and computation complexity is 0 (N 2d); Part III is logistic regression process, and complexity is 0 (ND 2); Therefore, the whole computation complexity of SSPH is 0 (tNMKD+N 2d+tND 2), wherein t is the number of times of iteration.
CN201410604395.6A 2014-10-30 2014-10-30 Random structure conformal Hash information retrieval method Pending CN104376051A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410604395.6A CN104376051A (en) 2014-10-30 2014-10-30 Random structure conformal Hash information retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410604395.6A CN104376051A (en) 2014-10-30 2014-10-30 Random structure conformal Hash information retrieval method

Publications (1)

Publication Number Publication Date
CN104376051A true CN104376051A (en) 2015-02-25

Family

ID=52554958

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410604395.6A Pending CN104376051A (en) 2014-10-30 2014-10-30 Random structure conformal Hash information retrieval method

Country Status (1)

Country Link
CN (1) CN104376051A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808723A (en) * 2016-03-07 2016-07-27 南京邮电大学 Image retrieval method based on image semantics and visual hashing
CN105843555A (en) * 2016-03-18 2016-08-10 南京邮电大学 Stochastic gradient descent based spectral hashing method in distributed storage
CN106484782A (en) * 2016-09-18 2017-03-08 重庆邮电大学 A kind of large-scale medical image retrieval based on the study of multinuclear Hash
CN106815349A (en) * 2017-01-19 2017-06-09 银联国际有限公司 The temporal filtering method and event filtering method matched based on hash algorithm and canonical
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN110188223A (en) * 2019-06-06 2019-08-30 腾讯科技(深圳)有限公司 Image processing method, device and computer equipment
CN116244483A (en) * 2023-05-12 2023-06-09 山东建筑大学 Large-scale zero sample data retrieval method and system based on data synthesis
CN117609488A (en) * 2024-01-22 2024-02-27 清华大学 Method and device for searching small-weight code words, computer storage medium and terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102034085A (en) * 2010-09-27 2011-04-27 山东大学 Video copy detection method based on local linear imbedding
US20110299721A1 (en) * 2010-06-02 2011-12-08 Dolby Laboratories Licensing Corporation Projection based hashing that balances robustness and sensitivity of media fingerprints
CN102819582A (en) * 2012-07-26 2012-12-12 华数传媒网络有限公司 Quick searching method for mass images

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110299721A1 (en) * 2010-06-02 2011-12-08 Dolby Laboratories Licensing Corporation Projection based hashing that balances robustness and sensitivity of media fingerprints
CN102034085A (en) * 2010-09-27 2011-04-27 山东大学 Video copy detection method based on local linear imbedding
CN102819582A (en) * 2012-07-26 2012-12-12 华数传媒网络有限公司 Quick searching method for mass images

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI LIU 等: ""Latent Structure Preserving Hashing",", 《INTERNATIONAL JOURNAL OF COMPUTER VISION》 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105808723A (en) * 2016-03-07 2016-07-27 南京邮电大学 Image retrieval method based on image semantics and visual hashing
CN105808723B (en) * 2016-03-07 2019-06-28 南京邮电大学 The picture retrieval method hashed based on picture semantic and vision
CN105843555A (en) * 2016-03-18 2016-08-10 南京邮电大学 Stochastic gradient descent based spectral hashing method in distributed storage
CN105843555B (en) * 2016-03-18 2018-11-02 南京邮电大学 Spectrum hash method based on stochastic gradient descent in distributed storage
CN106484782B (en) * 2016-09-18 2019-11-12 重庆邮电大学 A kind of large-scale medical image retrieval based on the study of multicore Hash
CN106484782A (en) * 2016-09-18 2017-03-08 重庆邮电大学 A kind of large-scale medical image retrieval based on the study of multinuclear Hash
CN106873964A (en) * 2016-12-23 2017-06-20 浙江工业大学 A kind of improved SimHash detection method of code similarities
CN106815349A (en) * 2017-01-19 2017-06-09 银联国际有限公司 The temporal filtering method and event filtering method matched based on hash algorithm and canonical
CN110188223A (en) * 2019-06-06 2019-08-30 腾讯科技(深圳)有限公司 Image processing method, device and computer equipment
CN110188223B (en) * 2019-06-06 2022-10-04 腾讯科技(深圳)有限公司 Image processing method and device and computer equipment
CN116244483A (en) * 2023-05-12 2023-06-09 山东建筑大学 Large-scale zero sample data retrieval method and system based on data synthesis
CN116244483B (en) * 2023-05-12 2023-07-28 山东建筑大学 Large-scale zero sample data retrieval method and system based on data synthesis
CN117609488A (en) * 2024-01-22 2024-02-27 清华大学 Method and device for searching small-weight code words, computer storage medium and terminal
CN117609488B (en) * 2024-01-22 2024-03-26 清华大学 Method and device for searching small-weight code words, computer storage medium and terminal

Similar Documents

Publication Publication Date Title
CN104376051A (en) Random structure conformal Hash information retrieval method
Izakian et al. Anomaly detection and characterization in spatial time series data: A cluster-centric approach
CN105045812B (en) The classification method and system of text subject
Yu et al. Short term wind power prediction for regional wind farms based on spatial-temporal characteristic distribution
CN107871014A (en) A kind of big data cross-module state search method and system based on depth integration Hash
Chen et al. HAPGN: Hierarchical attentive pooling graph network for point cloud segmentation
CN104462196B (en) Multiple features combining Hash information search method
CN111291556B (en) Chinese entity relation extraction method based on character and word feature fusion of entity meaning item
CN104657350A (en) Hash learning method for short text integrated with implicit semantic features
CN105184298A (en) Image classification method through fast and locality-constrained low-rank coding process
CN109284411B (en) Discretization image binary coding method based on supervised hypergraph
CN111125411A (en) Large-scale image retrieval method for deep strong correlation hash learning
US11841839B1 (en) Preprocessing and imputing method for structural data
CN104850533A (en) Constrained nonnegative matrix decomposing method and solving method
Vincent-Cuaz et al. Template based graph neural network with optimal transport distances
CN112749752A (en) Hyperspectral image classification method based on depth transform
CN104318271A (en) Image classification method based on adaptability coding and geometrical smooth convergence
Nugraha et al. Particle Swarm Optimization–Support Vector Machine (PSO-SVM) Algorithm for Journal Rank Classification
CN117251754A (en) CNN-GRU energy consumption prediction method considering dynamic time packaging
Fan et al. Cadtransformer: Panoptic symbol spotting transformer for cad drawings
Klomsae et al. A string grammar fuzzy-possibilistic C-medians
He et al. Classification of metro facilities with deep neural networks
CN106033546A (en) Behavior classification method based on top-down learning
Laptin et al. Shape of basic clusters: using analogues of Hough transform in higher dimensions
CN113516019B (en) Hyperspectral image unmixing method and device and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150225