Given an observation vector y from one of the L classes in the training set, one can compute its coefficients αˆ by solving either (8) or (9).
From: Handbook of Statistics , 2013
Handbook of Statistics
Vishal M. Patel , Rama Chellappa , in Handbook of Statistics, 2013
2.1 Sparse representation-based classification
In object recognition, given a set of labeled training samples, the task is to identify the class to which a test sample belongs to. Following Wright et al., (2009), we briefly describe the use of sparse representations for biometric recognition, however, this framework can be applied to a general object recognition problem.
Suppose that we are given distinct classes and a set of training images per class. One can extract an N-dimensional vector of features from each of these images. Let be an matrix of features from the kth class, where denote the feature from the jth training image of the kth class. Define a new matrix or dictionary , as the concatenation of training samples from all the classes as
We consider an observation vector of unknown class as a linear combination of the training vectors as
(5)
with coefficients . The above equation can be written more compactly as
(6)
where
(7)
and denotes the transposition operation. We assume that given sufficient training samples of the kth class, , any new test image that belongs to the same class will lie approximately in the linear span of the training samples from the class . This implies that most of the coefficients not associated with class in (7) will be close to zero. Hence, is be a sparse vector.
In order to represent an observed vector as a sparse vector , one needs to solve the system of linear Eq. (6). Typically and hence the system of linear Eq. (6) is underdetermined and has no unique solution. As mentioned earlier, if is sparse enough and satisfies certain properties, then the sparsest can be recovered by solving the following optimization problem:
(8)
When noisy observations are given, Basis Pursuit DeNoising (BPDN) can be used to approximate
(9)
where we have assumed that the observations are of the following form:
(10)
with .
Given an observation vector from one of the classes in the training set, one can compute its coefficients by solving either (8) or (9). One can perform classification based on the fact that high values of the coefficients will be associated with the columns of from a single class. This can be done by comparing how well the different parts of the estimated coefficients, , represent . The minimum of the representation error or the residual error can then be used to identify the correct class. The residual error of class is calculated by keeping the coefficients associated with that class and setting the coefficients not associated with class to zero. This can be done by introducing a characteristic function, , that selects the coefficients associated with the kth class as follows:
(11)
Here the vector has value one at locations corresponding to the class and zero for other entries. The class, , which is associated with an observed vector, is then declared as the one that produces the smallest approximation error
(12)
The sparse representation-based classification method is summarized in Algorithm 1.
For classification, it is important to be able to detect and then reject the test samples of poor quality. To decide whether a given test sample has good quality, one can use the notion of Sparsity Concentration Index (SCI) proposed in Wright et al. (2009). The SCI of a coefficient vector is defined as
(13)
SCI takes values between 0 and 1. SCI values close to 1 correspond to the case where the test image can be approximately represented by using only images from a single class. The test vector has enough discriminating features of its class, so has high quality. If SCI then the coefficients are spread evenly across all classes. So the test vector is not similar to any of the classes and has of poor quality. A threshold can be chosen to reject the images with poor quality. For instance, a test image can be rejected if and otherwise accepted as valid, where is some chosen threshold between 0 and 1.
ASYMPTOTICALLY UNIMPROVABLE SOLUTION OF MULTIVARIATE PROBLEMS
VADIM I. SERDOBOLSKII , in Multiparametric Statistics, 2008
Problem Setting
Let x be an observation vector from an n-dimensional population
with expectation Ex = 0, with fourth moments of all components and a nondegenerate covariance matrix Σ = cov(x, x). A sample
= {xm} of size N is used to calculate the mean vector and sample covariance matrix
We use the following asymptotical setting. Consider a hypothetical sequence of estimation problems
where
is a population with the covariance matrix ∑ = cov(x, x),
is a sample of size N from
, is an estimator Σ−1 calculated as function of the matrix C (we do not write the indexes n for arguments of
). Our problem is to construct the best statistics .
We begin by consideration of more simple problem of improving estimators of Σ−1 by the introduction of a scalar multiple of C−1 (shrinkage estimation) for normal populations. Then, we consider a wide class of estimators for a wide class of populations.
Jaroslav Hájek , ... Pranab K. Sen , in Theory of Rank Tests (Second Edition), 1999
PROBLEMS AND COMPLEMENTS TO CHAPTER 10
Section 10.1.
1.
Define the aligned observation vectors, i = 1,…,n, as in Subsection 10.1.1. and denote their joint distributions under H0 and Kn by Pn and Qn respectively. Use LeCam's lemmas 1o verify that {Qn} is contiguous to {Pn}.
2.
Use the contiguity result in the preceding problem along with LeCam's third lemma, and extend the joint asymptotic normality of the aligned rank statistics to contiguous alternatives in Kn.
3.
Derive the convergence in formula (10.1.1.14).
4.
Show that ℒN, defined by (10.1.1.9), has asymptotically, under Kn, a non-central χ2 distribution with p−1 degrees of freedom and non-centrality parameter (10.1.1.17).
5.
Prove that the (p − 1)-multiple of classical ANOVA (variance-ratio) test statistic has asymptotically, under Kn, a non-central distribution with p − 1 degrees of freedom and non-centrality parameter (10.1.1.18).
6.
Prove the inequality (10.1.1.20). and that the equality sign holds in it only when , excepting on a set of null measure.
7.
Prove the asymptotic linearity result (10.1.2.9).
8.
Show that is asymptotically normal, as stated in (10.1.2.11).
9.
Show that the representation (10.1.2.14) holds, and that it implies the asymptotic normality in (10.1.2.11).
10.
Prove the asymptotic result (10.1.2.15).
11.
Prove the asymptotic result in relation (10.1.2.16).
12.
Using the projection method, show (10.1.2.17).
13.
Consider the setup of Subsection 10.1.2, and testing H0 : β1 = 0 against H1: β1 ≠ 0 by means of the statistic given by (10.1.2.21). If H0 does not hold, i.e. β ≠ 0, show that converges in probability to a positive constant, so that is Op(n).
14.
Verify that {qn}, the joint density of {Yi; 1 ≤ i ≤ n} under contiguous alternatives {Hn} given by (10.1.2.22), is contiguous to {pn}, the joint density under H0.
15.
Show that, under {Hn}, is asymptotically normal with parameters given before and in (10.1.2.24).
16.
Prove that, under {Hn}, the statistic has asymptotically the non-central χ2 distribution as given in (10.1.2.25) and (10.1.2.20).
17.
Having still the setup of Subsection 10.1.2, prove that the asymptotically optimal aligned rank test uses the score function ϕ0 given by (10.1.2.27).
18.
(Continuation) Verify the equality (10.1.2.29). i.e. , 0 ≤ u ≤ 1.
VADIM I. SERDOBOLSKII , in Multiparametric Statistics, 2008
Limit Spectra
We investigate here the limiting behavior of spectral functions for the matrices SandC under the increasing dimension asymptotics. Consider a sequence
= {
N} of problems
(11)
in which spectral functions of matrices CandS are investigated over samples
of size N from populations
with cov(x,x) = Σ (we do not write out the subscripts in arguments of
n). For each problem
n, we consider functions
where and are eigenvalues of SandC, respectively,i = 1,2,…,n.
We restrict (11) by the following conditions.
A.
For each n, the observation vectors in
are such that Ex = 0 and the four moments of all components of x exist.
B.
The parameter M does not exceed a constant c0, where c0does not depend on n. The parameter γ vanishes as n → ∞ in
.
C.
In
,n/N → λ > 0.
D.
In
for each n, the eigenvalues of matrices Σ are located on a segment [c1,c2], where c1> 0 and c2does not depend on n, and FnΣ(u) → FΣ(u) as n → ∞ almost for any u ≥ 0.
Corollary(of Theorem 2.1). Under Assumptions A-D for any z ∈
, the limit exists such that
(12)
and for each z, we have
.
Let us investigate the analytical properties of solutions to (12).
Theorem 3.3. If h(z)satisfies(12),c1> 0, λ > 0,andλ ≠ 1,then
1.
|h(z) ≤ α (z) and h(z) is regular near any point z ∈
2.
for any v = Re z > 0 such that or we have
;
3.
ifv1≤v≤v2, then where Ω → 0 as ε → + 0;
4.
ifv = Re z > 0 then s(−v) ≥ (1 +c2λ |v|)− 1;
5.
if |z| → ∞ on the main sheet of the analytical function h(z), then we have
if 0 < λ < 1, then zh(z) = − (1 – λ)− 1Λ− 1+ O(|z|− 1),
if λ = 1, then zh2(z) = − Λ− 1+ O(|z− ½|),
if λ > 1, then zs(z) = − β0+ O(|z|−),
where β0is a root of the equation
.
Proof. The existence of the solution to (12) follows from Theorem 2.1. Suppose Im(z) > 0, then |h(z)| α = α (z). To be concise, denote h =h(z). For all u > 0 and z outside the beam z > 0, we have |1 −zsu|− 1. Differentiating h(z) in (12), we prove the regularity of h(z). Define
.
Let us rewrite (12) in the form
(13)
It follows that
Dividing by b2, we use the inequality . Fix some v = Re z > 0 and tend z = ɛ → +0. It follows that the product
.
Suppose that Im h does not tend to 0 (υ is fixed). Then, there exists a sequence {zk} such that, for zk=v+iɛk,h =h(zk),s =s(zk), we have Im h → a, where a≠ 0. For these zk, we obtain . We apply the Cauchy-Bunyakovskii inequality to (5). It follows that . We obtain that |h − 1|2≤ λ−+ 0(1) as ɛk→ +0. It follows that |s − 1|2≤ λ + 0(1). So the values s are bounded for {zk}. On the other hand, it follows from (12) that Im h =b1Im(zs) =b1. We find that as Im z → 0. But Im h → q ≠ 0 for {zk}. It follows that . Combining this with the inequality we find that . Note that b1is finite for {zk} and . Substitute the boundaries for |s|. We obtain that as εk→ + 0. We can conclude that Im h → 0 for any positive v outside the interval [v1,v2]. This proves the second statement of our theorem.
Now suppose v1≤v≤v2. From (12), we obtain the inequality (Imh)2< (c1vλ)− 1. But h is bounded. It follows that the quantity (Imh)2≤ (c1vλ)− 1. The third statement of our theorem is proved.
Further, let v = Re z > 0. Then, the functions hands are real and non-negative. We multiply both parts of (12) by λ. It follows that . We obtain s≥ (1 +c2λ |z|)− 1.
Let us prove the fifth theorem statement. Let λ < 1. For real z → − ∞, the real value of 1 −zsu in (12) tends to infinity.
Consequently,h → 0 and s → 1 − λ. For sufficiently large |Rez|, we have
where . We conclude that
for real z< 0 and for any z ∈
as |z| → ∞ in view of the properties of the Laurent series. Now let λ = 1. Then h =s. From (12), we obtain that h → 0 as z → − ∞ and h2= Λ− 1|z|− 1+ O(|z|− 2). Now suppose that Λ > 1,z = −t< 0, and t → ∞. Then, by Lemma 3.6, we have s≥ 0,h≥ 1 − 1/λ, and s → 0.Equation (12)impliests→ β0as is stated in the theorem formulation. This completes the proof of Theorem 2.3.
Remark 7. Under Assumptions A-D for each u≥ 0, the limit exists
(14)
Indeed, to prove the convergence, it is sufficient to cite Corollary 3.2.1 from [22] that states the convergence of {hnS(z)} and {hnC(z)} almost surely. By Lemma 3.5, both these sequences converge to the same limit h(z). To prove that the limits of FnS(u) and FnC(u) coincide, it suffices to prove the uniqueness of the solution to (12). It can be readily proved if we perform the inverse Stieltjes transformation.
Theorem 3.4. Under Assumptions A-D,
1.
if λ = 0, then Fu=FΣ(u) almost everywhere for u≥ 0;
2.
if λ > 0 and λ ≠ 0, then F(0) =F(u1− 0), where and c1andc2are bounds of the limit spectra Σ;
3.
if y > 0, λ ≠ 1, and u > 0, then the derivative F′ (u) of the function F(u) exists and ;
Proof. Let λ = 0. Then s(z) = 1. In view of (12), we have
At the continuity points of FΣ (u), the derivative
wherez =u−iɛ .
Let λ > 0. By Theorem 2.2 for u<u1and for u >u2(note that u1> 0 if λ < 0), the values Im[(u−iɛ)− 1h((u−iɛ)− 1)] → 0 as ε → + 0. But we have
(15)
It follows that F′ (u) exists and F′ (u) for 0 <u<u1and for u >u2. The points of the increase of F(u) can be located only at the point u = 0 or on the segment [u1,u2]. If λ < 1 and |z| → ∞, we have and, consequently,F(0) = 0. If λ > 1 and |z| → ∞, then and F(0) =1 − λ− 1. The second statement of our theorem is proved.
Now, let z =v+ ɛ, where v > 0 is fixed and ε → + 0. Then, using (12) we obtain that Im h =b1Im(zs). Obviously,
.
If Im h remains finite, then b1→ (λv)− 1. Performing limit transition in (15), we prove the last statement of the theorem.
Theorem 3.5. If Assumptions A-D hold and0 < λ < 1,then for any complex z, z′ outside of the half-axis z > 0,we have
where c3andζ > 0do not depend on z and z′ .
Proof. From (12), we obtain
By definition, the function h(z) is differentiable for each z outside the segment
= [v1,v2],v1> 0.
Denote a δ -neighborhood of the segment
by
δ. If z is outside of
δ, then the derivative h′ (z) exists and is uniformly bounded. It suffices to prove our theorem for v ∈
1, where
1=
δ− {z: Im z = 0}. Choose . Then δ1< |z| < δ2, where δ2does not depend on z. We estimate the absolute value of the derivative h′ (z). For Im ≠ 0, from(15) by the differentiation we obtain
(16)
whereX1= (1 −zs(z)u) ≠ 0. Denote
and let α with subscripts denote constants not depending on z. The right-hand side of (16) is not greater than α1b1forz ∈
1and therefore |h′ (z)| < α2b1|ϕ (z)|− 1.
We consider two cases. Denote α3= (2δ2c2)− 1.
At first, let Re s(z) =s0< α3. Using the relation h1=b1, we obtain that the quantity − Im ϕ (z) equals
.
In the integrand here, we have z0> 0, 1 −z0s0> ½,z1h1< 0. From the Cauchy-Bunyakovskii inequality, it follows that
.
Hence |Im ϕ (z)| >b1h1and |h′ (z)|. Let
.
Definep = λ− 1z0|z|− 2−b1. We have
Here |h1| < α4,z0≥ δ1≥ 0,s0≥ α3> 0, and we obtain that p > 0 if z1< α6, where α6= α3α5/λ α4. If zε
1andz1> α6, then the Hö lder inequality follows from the existence of a uniformly bounded derivative of the analytic function h(z) in a closed domain.
Now let z ∈
,z1< α6,p > 0, and s0> α3> 0. Then, |h′ (z)| ≤ α7b1|Reϕ (z)|− 1, where
.
Substitutingb1/Im(zs(z)) =h1and taking into account that s0> 0, we obtain that . Thus, for v ∈
δand 0 <z1< α6for any s0, it follows that |h′ (z)| < α8max . Calculating the derivative along the vertical line we obtain the inequality whence
if Im z. Im z′ > 0. The Hö lder inequality for h1= Im h(z) with ζ =1/3 follows. This completes the proof of Theorem 3.5.
Example. Consider limit spectra of matrix Σ of a special form of the "ρ -model" considered first in [63]. It is of a special interest since it admits an analytical solution to the dispersion equation (12). For this model, the limit spectrum of Σ is located on a segment [c1,c2], where and . Its limit spectrum density is
.
The moments are
If ρ > 0, the integral
where k = σ2(1 − ρ). The function η = η (z)satisfies the equation ρ η1+ (kz− ρ − 1)η + 1 =0. The equation h(z) = η (zs(z)) can be transformed to the equation (h− 1)(1 − ρh) =kzhs, which is quadratic with respect to h =h(z),s = 1 − λ + λh. If λ > 0, its solution is
.
The moments Mk= (k!)− 1h(k)(0) for k = 0,1,2,3 are
Differentiating the functions of the inverse argument, we find that, in particular, Λ− 1=k− 1, Λ− 2=k− 2(1 + ρ),M− 1=k− 1=k− 1(1 − λ)− 1, M− 2=k− 2(ρ + λ (1 − ρ))(1 − λ)− 3. The continuous limit spectrum of the matrix C is located on the segment [u1,u2], where
and has the density
.
If λ > 1, then the function F(u) has a jump 1 − λ− 1at the point u = 0. If λ = 0, then F(u) =FΣ(u) has a form of a unit step at the point u = σ2. The density f(u) satisfies the Hö lder condition with ζ = 1/2.
In a special case when Σ =I and ρ = 0, we obtain the limit spectral density foru1<u<u2, where . This "semicircle" law of spectral density was first found by Marchenko and Pastur [43].
4.02.4 Handling Future Observations with Missing Data
Missing measurements are a frequent occurrence in process industries. Therefore, the new observation vectorxnew (Figure 4) may have a few elements missing. Latent variable methods that model the process space (PCA, PCR, and PLS) make it possible to infer the corresponding score values of xnew by using the available elements in the vector together with the model built from the training data set.
The fact that process variables are highly correlated and that there is redundancy in process data (i.e., many variables are affected by the same event) makes this possible. Redundancy is beneficial for handling missing data. More details on methods for treatment of missing data in regression can be found in Chapter 3.06.
A variety of algorithms have been suggested32,33 to handle missing data, with different degree of complexity: trimmed score method (TRI), single-component projection (SCP), projection to the model plane (PMP) – using PLS or ordinary least squares (PMPPLS, PMPOLS), iterative imputation of missing data (II), a method based on the minimization of the squared prediction error (SPE), conditional mean replacement (CMR), trimmed score regression (TSR), and regression on known data (KDR).
Suppose that , where without loss of generality we assume that x# is the vector of missing observations. (Following this convention, p* and P* are loadings corresponding to the known x*.) The methods can be seen as different ways to impute values for the missing variables vector, x#. By setting the missing values equal to their expected mean value (i.e., for mean-centered data x# = 0), we have the TRI method.33
SCP is the simplest but also the poorest performing approach: It calculates each of the scores independently and sequentially as , where z* is x* deflated by the first i − 1 components.
Nelson et al.32 showed that superior results can be obtained by calculating all of the scores at once by projecting onto the hyperplane formed by the P* vectors. In the PMP method, the known x* vector is regressed onto the matrix P*. Sometimes, depending on the measurements missing, some of the columns of P* may become highly correlated and P*TP* becomes ill-conditioned. It was suggested32 to use PLS, PCR, or regularized least squares regression for the projection.
CMR32 and TSR33 use the known score T matrix from the training data together with the loadings (P*) and the available measurements (x*) to estimate the score vector. A singularity problem that may arise in CMR may be solved by a procedure suggested by Nelson et al.,32 where the estimated score vector is calculated in two steps: First a parameter β is computed using PLS from T=X*β, where T and X*, respectively, represent the score matrix and those columns from the training data set corresponding to known values; then β along with the current available data vector x* is used to compute an estimate of the score vector.
In iterative imputation, one may use an initial estimate of the final scores (say, those given by SCP method) to forecast the missing values x̂#, (using their corresponding loadings), then create the new vector and recalculate a score estimate, and iterate until convergence.
Arteaga and Ferrer33 presented an extensive study on the various methods. Iterative imputation and SPE methods are equivalent to PMP; KDR is equivalent to CMR. They concluded that based on the best prediction of the missing values, KDR is statistically superior to the other methods. The TSR is practically equivalent to the KDR and has the advantage that a much smaller matrix needs inversion. Additionally, TSR is statistically superior to PMP method.
Before the system is implemented online, there should be a plan for the operators as to how to respond if the values of several variables stop being recorded. For example, if there are three thermocouples in a reactor and one fails, common sense dictates that we can afford to continue the monitoring scheme. On the contrary, if there is only one sensor for a variable uncorrelated with any other, the value for this variable cannot be assessed from the rest of the variables in the system; therefore depending of the importance of this variable, one may not be able to rely on the monitoring scheme until the failed sensor is replaced. This idea was treated quantitatively by Nelson52 and Nelson et al.,53 where they analyzed the uncertainty resulting from missing measurements for the predictions of the values of the Hotelling's T2 and the SPE. Rather than representing an object with missing measurements by a single point, an estimate of the uncertainty regions in the score, Hotteling's T2, and SPE spaces arising from the missing measurements is provided. They suggested measures to distinguish between situations where model performance will continue to be acceptable and situations where it will be unacceptable, and therefore if the missing measurements cannot be recovered the application must be shut down.
Missing data methods did find their way in the industrial applications. In their industrial perspective on implementing online applications of multivariate statistics, Miletic et al.2 emphasize that missing data handling is a necessary feature for both the offline modeling and the online systems and report that are using the methods proposed by Nelson et al.32
Parameter Estimation of Chaotic Systems Using Density Estimation of Strange Attractors in the State Space
Yasser Shekofteh , ... Sajad Jafari , in Recent Advances in Chaotic Systems and Synchronization, 2019
5.1 The GMM of the Chaotic System
The chaotic system (1) has three variables in the state space. So, the observation vector of its attractor will be formed as v = [x, y, z], and we must select D = 3 as the state space dimension in Eq. (2). To generate the attractor points of the chaotic system (1) as real data, its model has been simulated with parameters a = 1.0 and b = 1.0 by a fourth-order Runge-Kutta method with a step size of 10 ms [27,28]. For training data at the first phase, a set of sequential samples of the system (1) including 100,000 samples (equal to 1000 s time length) has been recorded. The initial conditions were set to (− 0.10, − 5.05, − 6.00) as initial conditions of the system (1). Here, we assume that this recorded training data must lead us to estimate unknown parameters of the chaotic system (1), a and b, by minimization of the GMM-based cost function.
Using obtained training data from the chaotic system, we can learn a GMM in order to model the geometry of the attractor in the state space. In other words, the GMM computation fits a parametric model to the distribution of the attractor in the state space. Fig. 5 shows the attractor of the chaotic system (1) in a three-dimensional state space along with its GMM modeling using M = 64 Gaussian components. In this figure, every three-dimensional ellipsoid corresponds to one of the Gaussian components.
Fig. 5. Plot of the chaotic attractor of the system (1) and its GMM modeling with M = 64 components in the 3D state space. Here, the parameters of the system (1) are set to a = 1.0 and b = 1.0.
As can be seen from Fig. 5, the Gaussian components attempt to cover the attractor in the state space. To show the effect of the number of Gaussian mixtures, in Fig. 6, the attractor of the chaotic system (1) and its GMM models are shown for different values of M = 16, 32, 48, and 64.
Fig. 6. Plot of the chaotic attractor of the system (1) and its GMM modeling with M = 16, 32, 48, and 64 components in the 3D state space.
As can be seen from Fig. 6, when we increase the number of Gaussian components, more details of the trajectory of the chaotic attractor can be covered by the added Gaussian components. Therefore, in these experiments, the best GMM modeling of the attractor can be obtained by M = 64, which shows a precise model of the chaotic attractor. Therefore, by increasing the number of Gaussian components in the GMM, it can cover more complexity of the given time series in its model. The higher value of M can improve the performance of the cost function, but it also increases the computational cost and may be lead to overfitting problems.
In Fig. 7, the information criteria such as AIC, BIC, and the negative of the log-likelihood are considered for the GMM selection problem. It shows that M = 64 is a good choice for the number of GMM components, because of minimization of the criteria.
Fig. 7. Plot of the information criteria values to select the best GMM.
G.J. McLachlan , in Comprehensive Chemometrics, 2009
2.30.18 Mixed Feature Data
We consider now the case where some of the feature variables are discrete. That is, the observation vectoryj on the jth entity to be clustered consists of p1 discrete variables, represented by the subvector y1j, in addition to p2 continuous variables represented by the subvector y2j (j = 1,…,n). The ith component density of the jth observation
can then be written as
(84)
The symbol fi is being used generically here to denote a density where, for discrete random variables, the density is really a probability function.
In discriminant and cluster analyses, it has been found that it is reasonable to proceed by treating the discrete variables as if they are independently distributed within a class or cluster. This is known as the NAIVE assumption.49,50 Under this assumption, the ith component-conditional density of the vector y1j of discrete features is given by
(85)
where fiv (y1vj) denotes the ith component-conditional density of the vth discrete feature variable y1vj in y1j.
If y1v denotes one of the distinct values taken on by the discrete variable y1vj, then under Equation (85) the (k + 1)th update of fiv(y1v) is
(86)
where δ[y1vj, y1vj] = 1 if y1vj = y1v and is zero otherwise, and Ψ(k) is the current estimate of the vector of all the unknown parameters that now include the probabilities for the discrete variables. In Equation (86), the constants c1 and c2, which are both equal to zero for the maximum likelihood estimate, can be chosen to limit the effect of zero estimates of fiv(y1v) for rare values y1v. One choice is c2 = 1 and c1 = 1/dv, where dv is the number of distinct values in the support of y1vj.49
We can allow for some dependence between the vector y2j of continuous variables and the discrete-data vector y1j by adopting the location model as, for example, in Hunt and Jorgensen.51 With the location model, fi(y2j∣y1j) is taken to be multivariate normal with a mean that is allowed to be different for some or all of the different levels of y1j.
As an alternative to the use of the full mixture model, we may proceed conditionally on the realized values of the discrete feature vector y1j, as in McLachlan and Chang.52 This leads to the use of the conditional mixture model for the continuous feature vector y2j,
(87)
where πi(y1j) denotes the conditional probability of ith component membership of the mixture given the discrete data in y1j. A common model for πi(y1j) is the logistic model under which
A.J. Ferrer-Riquelme , in Comprehensive Chemometrics, 2009
1.04.9.2.2 PCA-based MSPC: online process monitoring (Phase II)
Once the reference PCA model and the control limits for the multivariate control charts are obtained, new process observations can be monitored online. When a new observation vectorzi is available, after preprocessing it is projected onto the PCA model yielding the scores and the residuals, from which the value of the Hotelling and the value of the SPE are calculated. This way, the information contained in the original K variables is summarized in these two indices, which are plotted in the corresponding multivariate and SPE control charts. No matter what the number of the original variables K is, only two points have to be plotted on the charts and checked against the control limits. The SPE chart should be checked first. If the points remain below the control limits in both charts, the process is considered to be in control. If a point is detected to be beyond the limits of one of the charts, then a diagnostic approach to isolate the original variables responsible for the out-of-control signal is needed. In PCA-based MSPC, contribution plots37 are commonly used for this purpose.
Contribution plots can be derived for abnormal points in both charts. If the SPE chart signals a new out-of-control observation, the contribution of each original kth variable to the SPE at this new abnormal observation is given by its corresponding squared residual:
(37)
where enew,k is the residual corresponding to the kth variable in the new observation and is the prediction of the kth variable xnew,k from the PCA model.
In case of using the DModX statistic, the contribution of each original kth variable to the DModX is given by44
(38)
where wk is the square root of the explained sum of squares for the kth variable. Variables with high contributions in this plot should be investigated.
If the abnormal observation is detected by the chart, the diagnosis procedure is carried out in two steps: (i) a bar plot of the normalized scores for that observation (tnew,a/λa)2 is plotted and the ath score with the highest normalized value is selected; (ii) the contribution of each original kth variable to this ath score at this new abnormal observation is given by
(39)
where pak is the loading of the kth variable at the ath component. A plot of these contributions is created. Variables on this plot with high contributions but with the same sign as the score should be investigated (contributions of the opposite sign will only make the score smaller). When there are some scores with high normalized values, an overall average contribution per variable can be calculated over all the selected scores.39
Contribution plots are a powerful tool for fault diagnosis. They provide a list of process variables that contribute numerically to the out-of-control condition (i.e., they are no longer consistent with NOCs), but they do not reveal the actual cause of the fault. Those variables and any variables highly correlated with them should be investigated. Incorporation of technical process knowledge is crucial to diagnose the problem and discover the root causes of the fault.
Apart from the and SPE control charts, other charts such as the univariate time-series plots of the scores or scatter score plots can be useful (both in Phase I and II) for detecting and diagnosing out-of-control situations and also for improving process understanding.
Dag Tjøstheim , ... Bård Støve , in Statistical Modeling Using Local Gaussian Approximation, 2022
9.2.2 Estimation of the joint dependence function
Let be a parametric family of p-variate density functions. Below ψ is taken to be the multinormal. We recall from Chapter 4 that Hjort and Jones (1996) estimate the unknown density f using the sample by fitting ψ locally. The local parameter estimate maximizes the local likelihood function
(9.3)
where K is a kernel function that integrates to one and is symmetric about the origin, B is a positive definite matrix of bandwidths, and , being the determinant. For small bandwidths, the local estimate is close to in the limit, because if the bandwidth matrix B is held fixed and , we have
(9.4)
for some value of the parameter toward which converges in probability. However, for finite sample sizes, the curse of dimensionality comes into play. The number of coordinates in typically grows with the dimension of x, making the local estimates difficult to obtain at every point in the sample space. One solution might be increasing the bandwidths so that the estimation becomes almost parametric. However, here we propose a different path around the Curse directly exploiting decomposition (9.2). The first step might be choosing a standardized multivariate normal distribution as parametric family in (9.3) for modeling in (9.2) locally:
(9.5)
where R denotes the local correlation matrix. Using a univariate local fit, the local Gaussian expectations and variances in (9.5) are constant and equal to zero and one, respectively, reflecting our knowledge that the margins of the unknown density function are standard normal. However, as in the p-variate case, as briefly described in Chapter 4.9, the local mean μ and variance σ in general depend on z. In this chapter, we make the additional assumption in our p-dimensional local Gaussian approximation that and . This is more restrictive than in Chapters 7 and 8, where it was assumed that and in the bivariate case, and this more general assumption was crucial in obtaining the local spectral results in Chapter 8.
With this more restrictive assumption that and , we are left with the problem of estimating the pairwise correlations , , in (9.5). Fitting the Gaussian distribution according to the scheme described above results in a local correlation matrix at each point. Specifically, the estimated local correlations are written as , , indicating that each parameter depends on all variables. The dependence between variables is captured in the variation of the parameter estimates in the p-dimensional Euclidean space, and its estimate maximizes the local likelihood function (9.3). However, as mentioned, the quality of the estimate deteriorates quickly with the dimension.
If the data were jointly normally distributed, there would be no dimensionality problem, since the entire distribution would be characterized by the global correlation coefficients between pairs of variables, and their empirical counterparts are easily computed from the data. A local Gaussian fit would then coincide with a global fit and result in estimates of the form , where the arguments indicate which of the transformed observation variables were used to obtain the estimate. This points to a natural simplification, which we may use to estimate the density , analogous to the additive regression model in Chapter 2.7.1. We allow the local correlations to depend on their own variables only:
(9.6)
We could also simplify the estimation problem by estimating the local means and variances as functions of "their own" coordinate only: and , but, as mentioned before, here we have chosen the stricter approximation
(9.7)
We refer to Section 9.7 for a further discussion of this point.
The resulting estimation is carried out in four steps:
1.
Estimate the marginal distributions using the logspline method (or the empirical distribution function) and transform each observation vector to pseudo-standard normality as described in the previous subsection.
2.
Estimate the joint density of the transformed data using the Hjort and Jones (1996) local likelihood function (9.3), the standardized normal parametric family (9.5), and simplifications (9.6) and (9.7). In practice, this means fitting the bivariate version of (9.5) to each pair of the transformed variables . Put the estimated local correlations into the estimated local correlation matrix: .
3.
Let and obtain the final estimate of by replacing with , and the marginal distribution and density functions with their logspline estimates in (9.2):
(9.8)
4.
Normalize the density estimate so that it integrates to one.
The existence of population values corresponding to the estimated local correlations is discussed in the following section. It is clear that assumptions (9.6) and (9.7) represent an approximation to most multivariate distributions. The authors are aware of no other distributions than those possessing the Gaussian copula or step functions thereof as in Tjøstheim and Hufthammer (2013) or Chapter 4.3, for which (9.6) and (9.7) are exact properties of the true local correlations. In that case the local correlations are constant or stepwise constant in all its variables. The quality of the LGDE thus depends to a large degree on the severity of assumptions (9.6) and (9.7) on the underlying density. The pairwise assumption is hard to interpret except in general statements about "pairwise dependence structures", and so we proceed in this chapter to explore the impact of (9.6) and (9.7) in practice in Section 9.6 and the subsequent discussion in Section 9.7. Before we do that, we take a closer look at the theoretical foundations of the LGDE.
VADIM I. SERDOBOLSKII , in Multiparametric Statistics, 2008
The Kolmogorov Asymptotics
In 1967, Andrei Nikolaevich Kolmogorov was interested in the dependence of errors of discrimination on sample size. He solved the following problem. Let x be a normal observation vector, and be sample averages calculated over samples from population number ν = 1, 2. Suppose that the covariance matrix is the identity matrix. Consider a simplified discriminant function
and the classification rule w(x) > 0 against w(x) ≤ 0. This function leads to the probability of errors , where G and D are quadratic functions of sample averages having a noncentral χ2 distribution. To isolate principal parts of G and D, Kolmogorov proposed to consider not one statistical problem but a sequence of n-dimensional discriminant problems in which the dimension n increases along with sample sizes Nν, so that and . Under these assumptions, he proved that the probability of error αn converges in probability
(7)
where J is the square of the Euclidean limit "Mahalanobis distance" between centers of populations. This expression is remarkable by that it explicitly shows the dependence of error probability on the dimension and sample sizes. This new asymptotic approach was called the "Kolmogorov asymptotics."
Later, L. D. Meshalkin and the author of this book deduced formula (7) for a wide class of populations under the assumption that the variables are independent and populations approach each other in the parameter space (are contiguous) [45], [46].
In 1970, Yu. N. Blagoveshchenskii and A. D. Deev studied the probability of errors for the standard sample Fisher-Andersen-Wald discriminant function for two populations with unknown common covariance matrix. A. D. Deev used the fact that the probability of error coincides with the distribution function g(x). He obtained an exact asymptotic expansion for the limit of the error probability α. The leading term of this expansion proved to be especially interesting. The limit probability of an error (of the first kind) proved to be
where the factor , with λ = λ1λ2/(λ1 + λ2), accounts for the accumulation of estimation inaccuracies in the process of the covariance matrix inversion. It was called "the Deev formula." This formula was thoroughly investigated numerically, and a good coincidence was demonstrated even for not great n, N.
Note that starting from Deev's formulas, the discrimination errors can be reduced if the rule g(x) > θ against g(x) ≤ θ with θ = (λ1 — λ2)/2 ≠ 0 is used. A. D. Deev also noticed [18] that the half-sum of discrimination errors can be further decreased by weighting summands in the discriminant function.
After these investigations, it became obvious that by keeping terms of the order of n/N, one obtains a possibility of using specifically multidimensional effects for the construction of improved discriminant and other procedures of multivariate analysis. The most important conclusion was that traditional consistent methods of multivariate statistical analysis should be improvable, and a new progress in theoretical statistics is possible, aiming at obtaining nearly optimal solutions for fixed samples.
The Kolmogorov asymptotics (increasing dimension asymp–totics [3]) may be considered as a calculation tool for isolating leading terms in case of large dimension. But the principal role of the Kolmogorov asymptotics is that it reveals specific regularities produced by estimation of a large number of parameters. In a series of further publications, this asymptotics was used as a main tool for investigation of essentially many-dimensional phenomena characteristic of high-dimensional statistical analysis. The constant n/N became an acknowledged characteristics in many-dimensional statistics.
In Section 5.1, the Kolmogorov asymptotics is applied for the development of theory allowing to improve the discriminant analysis of vectors of large dimension with independent components. The improvement is achieved by introducing appropriate weights of contributions of independent variables in the discriminant function. These weights are used for the construction of asymptotically unimprovable discriminant procedure. Then, the problem of selection of variables for discrimination is solved, and the optimum selection threshold is found.
But the main success in the development of multiparametric solutions was achieved by combining the Kolmogorov asymptotics with the spectral theory of random matrices developed independently at the end of 20th century in another region.
0 Response to "Vector Ovector Observation Size Mismatch Between Continuous Agent Rolleragent"
Post a Comment