Vector Ovector Observation Size Mismatch Between Continuous Agent Rolleragent

Observation Vector

Given an observation vector y from one of the L classes in the training set, one can compute its coefficients αˆ by solving either (8) or (9).

From: Handbook of Statistics , 2013

Handbook of Statistics

Vishal M. Patel , Rama Chellappa , in Handbook of Statistics, 2013

2.1 Sparse representation-based classification

In object recognition, given a set of labeled training samples, the task is to identify the class to which a test sample belongs to. Following Wright et al., (2009), we briefly describe the use of sparse representations for biometric recognition, however, this framework can be applied to a general object recognition problem.

Suppose that we are given $L$ distinct classes and a set of $n$ training images per class. One can extract an N-dimensional vector of features from each of these images. Let $B_{k} = [x_{k 1}, \dots, x_{kj}, \dots, x_{kn}]$ be an $N \times n$ matrix of features from the kth class, where $x_{kj}$ denote the feature from the jth training image of the kth class. Define a new matrix or dictionary $B$ , as the concatenation of training samples from all the classes as

$\begin{matrix} B = & [B_{1}, \dots, B_{L}] \in R^{N \times (n . L)} \\ = & [x_{11}, \dots, x_{1 n} | x_{21}, \dots, x_{2 n} | \dots \dots | x_{L 1}, \dots, x_{Ln}] . \end{matrix}$

We consider an observation vector $y \in R^{N}$ of unknown class as a linear combination of the training vectors as

(5) $y = \sum_{i = 1}^{L} \sum_{j = 1}^{n} α_{ij} x_{ij}$

with coefficients $α_{ij} \in R$ . The above equation can be written more compactly as

(6) $y = B α,$

where

(7) $α = [α_{11}, \dots, α_{1 n} | α_{21}, \dots, α_{2 n} | \dots \dots | α_{L 1}, \dots, α_{Ln}]^{T}$

and $.^{T}$ denotes the transposition operation. We assume that given sufficient training samples of the kth class, $B_{k}$ , any new test image $y \in R^{N}$ that belongs to the same class will lie approximately in the linear span of the training samples from the class $k$ . This implies that most of the coefficients not associated with class $k$ in (7) will be close to zero. Hence, $α$ is be a sparse vector.

In order to represent an observed vector $y \in R^{N}$ as a sparse vector $α$ , one needs to solve the system of linear Eq. (6). Typically $L \cdot n ≫ N$ and hence the system of linear Eq. (6) is underdetermined and has no unique solution. As mentioned earlier, if $α$ is sparse enough and $B$ satisfies certain properties, then the sparsest $α$ can be recovered by solving the following optimization problem:

(8) $\hat{α} = \underset{α^{'}}{argmin} ‖ α^{'} ‖_{1} subject to y = B α^{'} .$

When noisy observations are given, Basis Pursuit DeNoising (BPDN) can be used to approximate $α$

(9) $\hat{α} = \underset{α^{'}}{argmin} ‖ α^{'} ‖_{1} subject to ‖ y - B α' ‖_{2} ⩽ ε,$

where we have assumed that the observations are of the following form:

(10) $y = B α + η$

with $‖ η ‖_{2} ⩽ ε$ .

Given an observation vector $y$ from one of the $L$ classes in the training set, one can compute its coefficients $\hat{α}$ by solving either (8) or (9). One can perform classification based on the fact that high values of the coefficients $\hat{α}$ will be associated with the columns of $B$ from a single class. This can be done by comparing how well the different parts of the estimated coefficients, $\hat{α}$ , represent $y$ . The minimum of the representation error or the residual error can then be used to identify the correct class. The residual error of class $k$ is calculated by keeping the coefficients associated with that class and setting the coefficients not associated with class $k$ to zero. This can be done by introducing a characteristic function, $Π_{k} : R^{n} \to R^{n}$ , that selects the coefficients associated with the kth class as follows:

(11) $r_{k} (y) = ‖ y - B Π_{k} (\hat{α}) ‖_{2} .$

Here the vector $Π_{k}$ has value one at locations corresponding to the class $k$ and zero for other entries. The class, $d$ , which is associated with an observed vector, is then declared as the one that produces the smallest approximation error

(12) $d = \underset{k}{argmin} r_{k} (y) .$

The sparse representation-based classification method is summarized in Algorithm 1.

Algorithm 1 Sparse representation-based classification (SRC) algorithm

Input: $B \in R^{N \times (n \cdot L)}, y \in R^{N}$ .

1.: Solve the BP (8) or BPDN (9) problem.
2.: Compute the residual using (11).
3.: Identify $y$ using (12).

Output: Class label of $y$ .

For classification, it is important to be able to detect and then reject the test samples of poor quality. To decide whether a given test sample has good quality, one can use the notion of Sparsity Concentration Index (SCI) proposed in Wright et al. (2009). The SCI of a coefficient vector $α \in R^{(L . n)}$ is defined as

(13) $SCI (α) = \frac{\frac{L \cdot \max ‖ Π_{i} (α) ‖_{1}}{‖ α ‖_{1}} - 1}{L - 1} .$

SCI takes values between 0 and 1. SCI values close to 1 correspond to the case where the test image can be approximately represented by using only images from a single class. The test vector has enough discriminating features of its class, so has high quality. If SCI $= 0$ then the coefficients are spread evenly across all classes. So the test vector is not similar to any of the classes and has of poor quality. A threshold can be chosen to reject the images with poor quality. For instance, a test image can be rejected if $SCI (\hat{α}) < λ$ and otherwise accepted as valid, where $λ$ is some chosen threshold between 0 and 1.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444538598000084

ASYMPTOTICALLY UNIMPROVABLE SOLUTION OF MULTIVARIATE PROBLEMS

VADIM I. SERDOBOLSKII , in Multiparametric Statistics, 2008

Problem Setting

Let x be an observation vector from an n-dimensional population

with expectation Ex = 0, with fourth moments of all components and a nondegenerate covariance matrix Σ = cov(x, x). A sample

= {x_m} of size N is used to calculate the mean vector $\bar{x}$ and sample covariance matrix

$C = N^{- 1} \sum_{m = 1}^{N} (x_{m} - \bar{x}) (x_{m} - \bar{x})^{T} .$

We use the following asymptotical setting. Consider a hypothetical sequence of estimation problems

$B = {{(G, \sum, N, X, C, {\sum^{^}}^{- 1})}_{n}}, n = 1, 2, …,$

where

is a population with the covariance matrix ∑ = cov(x, x),

is a sample of size N from

, ${\sum^{^}}^{- 1}$ is an estimator Σ⁻¹ calculated as function of the matrix C (we do not write the indexes n for arguments of

). Our problem is to construct the best statistics ${\sum^{^}}^{- 1}$ .

We begin by consideration of more simple problem of improving estimators of Σ⁻¹ by the introduction of a scalar multiple of C ⁻¹ (shrinkage estimation) for normal populations. Then, we consider a wide class of estimators for a wide class of populations.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444530493500072

Miscellaneous topics in regression rank tests

Jaroslav Hájek , ... Pranab K. Sen , in Theory of Rank Tests (Second Edition), 1999

PROBLEMS AND COMPLEMENTS TO CHAPTER 10

Section 10.1.

1.: Define the aligned observation vectors ${\hat{X}}_{i}$ , i = 1,…,n, as in Subsection 10.1.1. and denote their joint distributions under H ₀ and K_n by P_n and Q_n respectively. Use LeCam's lemmas 1o verify that {Q_n } is contiguous to {P_n }.
2.: Use the contiguity result in the preceding problem along with LeCam's third lemma, and extend the joint asymptotic normality of the aligned rank statistics to contiguous alternatives in K_n.
3.: Derive the convergence in formula (10.1.1.14).
4.: Show that ℒ_N , defined by (10.1.1.9), has asymptotically, under K_n , a non-central χ² distribution with p−1 degrees of freedom and non-centrality parameter (10.1.1.17).
5.: Prove that the (p − 1)-multiple of classical ANOVA (variance-ratio) test statistic has asymptotically, under K_n , a non-central $χ^{2}$ distribution with p − 1 degrees of freedom and non-centrality parameter (10.1.1.18).
6.: Prove the inequality (10.1.1.20). and that the equality sign holds in it only when $ϕ (F (x)) \equiv x$ , excepting on a set of null measure.
7.: Prove the asymptotic linearity result (10.1.2.9).
8.: Show that ${\hat{β}}_{2 n}$ is asymptotically normal, as stated in (10.1.2.11).
9.: Show that the representation (10.1.2.14) holds, and that it implies the asymptotic normality in (10.1.2.11).
10.: Prove the asymptotic result (10.1.2.15).
11.: Prove the asymptotic result in relation (10.1.2.16).
12.: Using the projection method, show (10.1.2.17).
13.: Consider the setup of Subsection 10.1.2, and testing H ₀ : β₁ = 0 against H ₁: β₁ ≠ 0 by means of the statistic ${\hat{ℒ}}_{n 1}$ given by (10.1.2.21). If H ₀ does not hold, i.e. β ≠ 0, show that $n^{- 1} {\hat{ℒ}}_{n}$ converges in probability to a positive constant, so that ${\hat{L}}_{n 1}$ is O_p (n).
14.: Verify that {q_n }, the joint density of {Y_i ; 1 ≤ i ≤ n} under contiguous alternatives {H_n } given by (10.1.2.22), is contiguous to {pn }, the joint density under H _0.
15.: Show that, under {H_n }, $n^{- 1 / 2} {\hat{ℒ}}_{n 1}$ is asymptotically normal with parameters given before and in (10.1.2.24).
16.: Prove that, under {H_n }, the statistic ${\hat{ℒ}}_{n 1}$ has asymptotically the non-central χ² distribution as given in (10.1.2.25) and (10.1.2.20).
17.: Having still the setup of Subsection 10.1.2, prove that the asymptotically optimal aligned rank test uses the score function ϕ₀ given by (10.1.2.27).
18.: (Continuation) Verify the equality (10.1.2.29). i.e. $ϕ_{0} (x) \equiv ϕ (x, f)$ , 0 ≤ u ≤ 1.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012642350150028X

SPECTRAL THEORY OF SAMPLE COVARIANCE MATRICES

VADIM I. SERDOBOLSKII , in Multiparametric Statistics, 2008

Limit Spectra

We investigate here the limiting behavior of spectral functions for the matrices SandC under the increasing dimension asymptotics. Consider a sequence

= {

_N} of problems

(11) $B_{n} = {(S, Σ, N, X, S, C)}_{n} n = 1, 2, \dots$

in which spectral functions of matrices CandS are investigated over samples

of size N from populations

with cov(x,x) = Σ (we do not write out the subscripts in arguments of

_n). For each problem

_n, we consider functions

$\begin{array}{l} h_{0 n} (t) = n^{- 1} tr {((I - z S)}^{- 1}, h_{n} (t) = n^{- 1} tr {((I - z C)}^{- 1}, \\ F_{n S} (u) = \frac{1}{n} \sum_{i = 1}^{n} ind (λ_{i}^{S} \leq u), F_{n C} (u) = \frac{1}{n} \sum_{i = 1}^{n} ind (λ_{i}^{C} \leq u), \end{array}$

where $λ_{i}^{S}$ and $λ_{i}^{C}$ are eigenvalues of SandC, respectively,i = 1,2,…,n.

We restrict (11) by the following conditions.

A.: For each n , the observation vectors in

are such that Ex = 0 and the four moments of all components of x exist.
B.: The parameter M does not exceed a constant c ₀, where c ₀does not depend on n. The parameter γ vanishes as n → ∞ in

.
C.: In

,n/N → λ > 0.
D.: In

for each n, the eigenvalues of matrices Σ are located on a segment [c ₁,c ₂], where c ₁> 0 and c ₂does not depend on n, and F _nΣ(u) → F _Σ(u) as n → ∞ almost for any u ≥ 0.

Corollary(of Theorem 2.1). Under Assumptions A-D for any z ∈

, the limit exists $\lim_{n \to \infty} h_{n} (z) = h (z)$ such that

(12) $h (z) = \int {(1 - z s (z) u)}^{- 1} d F_{\sum} (u), s (z) = 1 - λ + λ h (z),$

and for each z, we have

$\lim_{n \to \infty} ‖ E {(I - z C)}^{- 1} - {(I - z s (z) \sum)}^{- 1} ‖ \to 0.$

Let us investigate the analytical properties of solutions to (12).

Theorem 3.3 . If h(z)satisfies(12),c ₁> 0, λ > 0,andλ ≠ 1,then

1.

|h(z) ≤ α (z) and h(z) is regular near any point z ∈

2.

for any v = Re z > 0 such that $v < v_{2} = c_{1}^{- 1} {(1 - \sqrt{λ})}^{- 2}$ or $v > v_{1} = c_{2}^{- 1} {(1 + \sqrt{λ})}^{- 2},$ we have

$\lim_{ε \to + 0} Im h (v + i ε) = 0;$

;

3.

ifv ₁≤v≤v ₂, then $0 \leq Im h (v + i ε) \leq {(c_{1} λ v)}^{- 1 / 2} + ω,$ where Ω → 0 as ε → + 0;

4.

ifv = Re z > 0 then s(−v) ≥ (1 +c ₂λ |v|)^{− 1};

5.

if |z| → ∞ on the main sheet of the analytical function h(z), then we have

if 0 < λ < 1, then zh(z) = − (1 – λ)_{− 1}Λ_{− 1}+ O(|z|^{− 1}),

if λ = 1, then zh ²(z) = − Λ_{− 1}+ O(|z ^{− ½}|),

if λ > 1, then zs(z) = − β₀+ O(|z|₋),

where β₀is a root of the equation

$\int {(1 + β_{0} u)}^{- 1} d F_{\sum} (u) = 1 - λ^{- 1} .$

Proof. The existence of the solution to (12) follows from Theorem 2.1. Suppose Im(z) > 0, then |h(z)| α = α (z). To be concise, denote h =h(z). For all u > 0 and z outside the beam z > 0, we have |1 −zsu|^{− 1}. Differentiating h(z) in (12), we prove the regularity of h(z). Define

$b_{v} = b_{v} (z) = \int {| 1 - z s (z) u |}^{- 2} u^{v} d F_{\sum} (u), v = 1, 2.$

Let us rewrite (12) in the form

(13) $(h - 1) / s = z \int u {(1 - z s u)}^{- 1} d F_{\sum} (u) .$

It follows that

$Im [(h - 1) / s] = {| s |}^{- 2} Im h = b_{1} Im z + b_{2} λ {| z |}^{2} Im h .$

Dividing by b ₂, we use the inequality $b_{1} / b_{2} \leq c_{1}^{- 1}$ . Fix some v = Re z > 0 and tend z = ɛ → +0. It follows that the product

$({| s |}^{- 2} b_{2}^{- 1} - λ v^{2}) Im h \to 0.$

Suppose that Im h does not tend to 0 (υ is fixed). Then, there exists a sequence {z_k } such that, for z _k=v+iɛ_k,h =h(z _k),s =s(z _k), we have Im h → a, where a≠ 0. For these z_k , we obtain ${| s |}^{- 2} b_{2}^{- 1} \to λ v^{2} as ɛ_{k} \to + 0$ . We apply the Cauchy-Bunyakovskii inequality to (5). It follows that ${{| h - 1 |}^{2} / | z_{k} s |}^{2} \leq b_{2}$ . We obtain that |h − 1|²≤ λ⁻+ 0(1) as ɛ_k→ +0. It follows that |s − 1|²≤ λ + 0(1). So the values s are bounded for {z_k }. On the other hand, it follows from (12) that Im h =b ₁Im(zs) =b ₁. We find that $(b_{1}^{- 1} - λ v) Im h \to 0$ as Im z → 0. But Im h → q ≠ 0 for {z_k }. It follows that $b_{1}^{- 1} \to λ v$ . Combining this with the inequality ${| s |}^{- 2} b_{2}^{- 1} \to λ v^{2},$ we find that ${| s |}^{- 2} b_{2}^{- 1} - b_{1}^{- 1} v \to 0$ . Note that b ₁is finite for {z_k } and $c_{2}^{- 2} \leq b_{1} b_{2}^{- 1} \leq c_{1}^{- 1}$ . Substitute the boundaries $(1 \pm \sqrt{λ}) + o (1)$ for |s|. We obtain that $v_{1} + o (1) \leq v \leq v_{2} + o (1)$ as ε_k→ + 0. We can conclude that Im h → 0 for any positive v outside the interval [v ₁,v ₂]. This proves the second statement of our theorem.

Now suppose v ₁≤v≤v ₂. From (12), we obtain the inequality (Imh)₂< (c ₁ vλ)_{− 1}. But h is bounded. It follows that the quantity (Imh)₂≤ (c ₁ vλ)^{− 1}. The third statement of our theorem is proved.

Further, let v = Re z > 0. Then, the functions hands are real and non-negative. We multiply both parts of (12) by λ. It follows that $(h - 1) / z s \leq b_{1} \leq c_{2}$ . We obtain s≥ (1 +c ₂λ |z|)^{− 1}.

Let us prove the fifth theorem statement. Let λ < 1. For real z → − ∞, the real value of 1 −zsu in (12) tends to infinity.

Consequently,h → 0 and s → 1 − λ. For sufficiently large |Rez|, we have

$h (z) = \sum_{k = 1}^{\infty} Λ_{- k} {(z s)}^{- k},$

where $Λ_{k} = \int u^{k} d F_{Σ} (u)$ . We conclude that

$h (z) = - {(1 - λ)}^{- 1} Λ_{- 1} z^{- 1} + O ({| z |}^{- 2})$

for real z< 0 and for any z ∈

as |z| → ∞ in view of the properties of the Laurent series. Now let λ = 1. Then h =s. From (12), we obtain that h → 0 as z → − ∞ and h ²= Λ_{− 1}|z|^{− 1}+ O(|z|^{− 2}). Now suppose that Λ > 1,z = −t< 0, and t → ∞. Then, by Lemma 3.6, we have s≥ 0,h≥ 1 − 1/λ, and s → 0.Equation (12)impliest _s→ β₀as is stated in the theorem formulation. This completes the proof of Theorem 2.3.

Remark 7. Under Assumptions A-D for each u≥ 0, the limit exists

(14) $\begin{array}{l} F (u) = \underset{n \to \infty}{plim} F_{n C} (u) \\ such that \int {(1 - z u)}^{- 1} d F (u) = h (z) . \end{array}$

Indeed, to prove the convergence, it is sufficient to cite Corollary 3.2.1 from [22] that states the convergence of {h_nS (z)} and {h_nC (z)} almost surely. By Lemma 3.5, both these sequences converge to the same limit h(z). To prove that the limits of F_nS (u) and F_nC (u) coincide, it suffices to prove the uniqueness of the solution to (12). It can be readily proved if we perform the inverse Stieltjes transformation.

Theorem 3.4. Under Assumptions A-D,

1.: if λ = 0, then F _u=F _Σ(u) almost everywhere for u≥ 0;
2.: if λ > 0 and λ ≠ 0, then F(0) =F(u ₁− 0), where $u_{1} = c_{1} {(1 - \sqrt{λ})}^{2},$ $u_{2} = c_{2} {(1 + \sqrt{λ})}^{2},$ and c ₁andc ₂are bounds of the limit spectra Σ;
3.: if y > 0, λ ≠ 1, and u > 0, then the derivative F′ (u) of the function F(u) exists and $F^{'} (u) \leq π^{- 1} {(c_{1} λ u)}^{- 1 / 2}$ ;

Proof. Let λ = 0. Then s(z) = 1. In view of (12), we have

$h (z) = \int {(1 - z u)}^{- 1} d F (u) = \int {(1 - z u)}^{- 1} d F_{\sum} (u) .$

At the continuity points of FΣ (u), the derivative

${F^{'}}_{\sum} (u) = \frac{1}{π} \lim_{ε \to + 0} Im \frac{1}{z} h (\frac{1}{z}) = F^{'} (u),$

wherez =u−iɛ .

Let λ > 0. By Theorem 2.2 for u<u ₁and for u >u ₂(note that u ₁> 0 if λ < 0), the values Im[(u−iɛ)^{− 1} h((u−iɛ)^{− 1})] → 0 as ε → + 0. But we have

(15) $Im \frac{h ({(u - i ε)}^{- 1})}{u - i ε} > {(2 ε)}^{- 1} [F (u + ε) - F (u - ε)] .$

It follows that F′ (u) exists and F′ (u) for 0 <u<u ¹and for u >u ₂. The points of the increase of F(u) can be located only at the point u = 0 or on the segment [u ₁,u ₂]. If λ < 1 and |z| → ∞, we have ${\int (1 - z u)}^{- 1} d F (u) \to 0$ and, consequently,F(0) = 0. If λ > 1 and |z| → ∞, then $h (z^{- 1}) z^{- 1} \approx (1 - λ^{- 1}) / Z,$ and F(0) =1 − λ^{− 1}. The second statement of our theorem is proved.

Now, let z =v+ ɛ, where v > 0 is fixed and ε → + 0. Then, using (12) we obtain that Im h =b ₁Im(zs). Obviously,

$| Im h | \leq \int {| 1 - z s u |}^{- 1} d F_{\sum} (u) \leq \frac{1}{c_{1} Im (z s)} = \frac{b_{1}}{c_{1} Im h} .$

If Im h remains finite, then b ₁→ (λv)^{− 1}. Performing limit transition in (15), we prove the last statement of the theorem.

Theorem 3.5. If Assumptions A-D hold and0 < λ < 1,then for any complex z, z′ outside of the half-axis z > 0,we have

$| h (z) - h (z^{'}) | < c_{3} {| z - z^{'} |}^{ζ},$

where c ₃ andζ > 0do not depend on z and z′ .

Proof. From (12), we obtain

$| h (z) | \leq λ^{- 1} \max (λ, | 1 - λ | + 2 c_{1}^{- 1} {| z |}^{- 1}) .$

By definition, the function h(z) is differentiable for each z outside the segment

= [v ₁,v ₂],v ₁> 0.

Denote a δ -neighborhood of the segment

_δ. If z is outside of

_δ, then the derivative h′ (z) exists and is uniformly bounded. It suffices to prove our theorem for v ∈

₁, where

₁=

_δ− {z: Im z = 0}. Choose $δ = δ_{1} = v_{1} / 2$ . Then δ₁< |z| < δ₂, where δ₂does not depend on z. We estimate the absolute value of the derivative h′ (z). For Im ≠ 0, from(15) by the differentiation we obtain

(16) $(z^{- 1} y^{- 1} - \int X^{- 2} u d F_{\sum} (u)) h^{'} (z) = \frac{s (z)}{z λ} \int X^{- 2} u d F_{\sum} (u),$

whereX1= (1 −zs(z)u) ≠ 0. Denote

$\begin{array}{l} ϕ (z) = \frac{1}{z y} - \int X^{- 2} u d F_{\sum} (u), b_{1} = \int {| X |}^{- 2} u d F_{\sum} (u), \\ h_{1} = Im h (z), z_{0} = Re z, z_{1} = Im z, s_{0} = Re s (z), \end{array}$

and let α with subscripts denote constants not depending on z. The right-hand side of (16) is not greater than α₁ b ₁forz ∈

₁and therefore |h′ (z)| < α₂ b ₁|ϕ (z)|^{− 1}.

We consider two cases. Denote α₃= (2δ₂ c ₂)^{− 1}.

At first, let Re s(z) =s ₀< α₃. Using the relation h ₁=b ₁, we obtain that the quantity − Im ϕ (z) equals

$z_{1} {| z |}^{- 2} λ^{- 1} + 2 b_{1}^{- 1} \int {| X |}^{- 4} u^{2} (1 - z_{0} s_{0} u + z_{1} λ h_{1} u) h_{1} d F_{\sum} (u) .$

In the integrand here, we have z ₀> 0, 1 −z ₀ s ₀> ½,z ₁ h ₁< 0. From the Cauchy-Bunyakovskii inequality, it follows that

$\int {| X |}^{- 4} u^{2} d F_{\sum} (u) \geq b_{1}^{2} .$

Hence |Im ϕ (z)| >b ₁ h ₁and |h′ (z)|. Let

$Re ϕ (z) = λ^{- 1} z_{0} {| z |}^{- 2} - b_{1} + 2 {[Im z s (z)]}^{2} \int {| X |}^{- 4} u^{3} d F_{\sum} (u) .$

Definep = λ^{− 1} z ₀|z|^{− 2}−b ₁. We have

$p = λ^{- 1} {| z |}^{- 2} z_{0} z_{1} {| Im z s (z) |}^{- 1} (s_{0} - λ h_{1} z_{1} / z_{0}) .$

Here |h ₁| < α₄,z ₀≥ δ₁≥ 0,s ₀≥ α₃> 0, and we obtain that p > 0 if z ₁< α₆, where α₆= α₃α₅/λ α₄. If zε

₁andz ₁> α₆, then the Hö lder inequality follows from the existence of a uniformly bounded derivative of the analytic function h(z) in a closed domain.

Now let z ∈

,z ₁< α₆,p > 0, and s ₀> α₃> 0. Then, |h′ (z)| ≤ α₇ b ₁|Reϕ (z)|^{− 1}, where

$\begin{array}{l} Re ϕ (z) \geq 2 {(Im z s (z))}^{2} c_{1} \int {| X |}^{- 4} u^{2} d F_{\sum} (u) \geq \\ \geq 2 {(Im z s (z))}^{2} c_{1} b_{1} = 2 c_{1} h_{1}^{2} . \end{array}$

Substitutingb ₁/Im(zs(z)) =h ₁and taking into account that s ₀> 0, we obtain that $| h^{'} (z) | \leq α_{7} h_{1}^{- 2}$ . Thus, for v ∈

_δand 0 <z ₁< α₆for any s ₀, it follows that |h′ (z)| < α₈max $(h_{1}^{- 1}, h_{1}^{- 2}) \leq α_{9} h_{1}^{- 2}$ . Calculating the derivative along the vertical line we obtain the inequality $h_{1}^{2} | d h_{1} / d z | \leq α_{9},$ whence

$h_{1}^{3} (z) \leq h_{1}^{3} (z^{'}) + 3 α_{9} | z - z^{'} | \leq {(h_{1} (z^{'}) + α_{10} {| z - z^{'} |}^{1 / 3})}^{3}$

if Im z. Im z′ > 0. The Hö lder inequality for h ₁= Im h(z) with ζ =1/3 follows. This completes the proof of Theorem 3.5.

Example. Consider limit spectra of matrix Σ of a special form of the "ρ -model" considered first in [63]. It is of a special interest since it admits an analytical solution to the dispersion equation (12). For this model, the limit spectrum of Σ is located on a segment [c ₁,c ₂], where $c_{1} = σ^{2} {(1 - \sqrt{p})}^{2}$ and $c_{2} = σ^{2} {(1 + \sqrt{p})}^{2}$ . Its limit spectrum density is

$\frac{d F_{\sum} (u)}{d u} = {\begin{cases} {(2 π ρ)}^{- 1} (1 - ρ) u^{- 2} \sqrt{(c_{2} - u) (u - c_{1})}, c_{1} \leq u \leq c_{2}, \\ 0 for u < c_{1} and for u > c_{2} . \end{cases}$

The moments $Λ_{k} = \int u^{k} d F_{Σ} (u) for k = 0, 1, 2, 3, 4$ are

$\begin{array}{l} Λ_{0} = 1, Λ_{1} = σ^{2} (1 - ρ), Λ_{2} = σ^{4} (1 - ρ), \\ Λ_{1} = σ^{6} (1 - ρ^{2}), Λ_{4} = σ^{8} (1 - ρ) (1 + 3 ρ + ρ^{2}) . \end{array}$

If ρ > 0, the integral

$η (z) = {\int (1 - z u)}^{- 1} d F_{Σ} (u) = \frac{1 + ρ - k z - \sqrt{{(1 + ρ - k z)}^{2} - 4 ρ}}{2 ρ},$

where k = σ²(1 − ρ). The function η = η (z)satisfies the equation ρ η¹+ (kz− ρ − 1)η + 1 =0. The equation h(z) = η (zs(z)) can be transformed to the equation (h− 1)(1 − ρh) =kzhs, which is quadratic with respect to h =h(z),s = 1 − λ + λh. If λ > 0, its solution is

$h = \frac{1 + ρ - k (1 - λ) z - \sqrt{{(1 + ρ - k (1 - λ) z)}^{2} - 4 (ρ + k z λ)}}{2 (ρ + k λ z)} .$

The moments M _k= (k!)^{− 1} h ^(k)(0) for k = 0,1,2,3 are

$\begin{array}{l} M_{0} = 1, M_{1} = σ^{2} (1 - ρ), M_{2} = σ^{4} (1 - ρ) (1 + λ (1 - ρ)), \\ M_{3} = σ^{6} (1 - ρ) (1 + ρ + 3 λ (1 - ρ) + λ^{2} {(1 - ρ)}^{2}) . \end{array}$

Differentiating the functions of the inverse argument, we find that, in particular, Λ_{− 1}=k ^{− 1}, Λ_{− 2}=k ^{− 2}(1 + ρ),M _{− 1}=k _{− 1}=k ^{− 1}(1 − λ)^{− 1}, M _{− 2}=k ^{− 2}(ρ + λ (1 − ρ))(1 − λ)^{− 3}. The continuous limit spectrum of the matrix C is located on the segment [u ₁,u ₂], where

$u_{1} = σ^{2} {(1 - \sqrt{λ + ρ (1 - λ)})}^{2}, u_{2} = σ^{2} {(1 + \sqrt{λ + ρ (1 - λ)})}^{2}$

and has the density

$f (u) = {\begin{cases} \frac{(1 - ρ) \sqrt{(u_{2} - u) (u - u_{1})}}{2 π u (ρ u + σ^{2} {(1 - ρ)}^{2} y)} if u \in [u_{1}, u_{2}], \\ 0 otherwise . \end{cases}$

If λ > 1, then the function F(u) has a jump 1 − λ^{− 1}at the point u = 0. If λ = 0, then F(u) =F _Σ(u) has a form of a unit step at the point u = σ². The density f(u) satisfies the Hö lder condition with ζ = 1/2.

In a special case when Σ =I and ρ = 0, we obtain the limit spectral density $F^{'} (u) = {(2 π)}^{- 1} u^{- 2} \sqrt{(u_{2} - u) (u - u_{1})}$ foru ₁<u<u ₂, where $u_{2, 1} = {(1 \pm \sqrt{λ})}^{2}$ . This "semicircle" law of spectral density was first found by Marchenko and Pastur [43].

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444530493500060

Volume 4

T. Kourti , in Comprehensive Chemometrics, 2009

4.02.4 Handling Future Observations with Missing Data

Missing measurements are a frequent occurrence in process industries. Therefore, the new observation vector x _new ( Figure 4 ) may have a few elements missing. Latent variable methods that model the process space (PCA, PCR, and PLS) make it possible to infer the corresponding score values of x _new by using the available elements in the vector together with the model built from the training data set.

The fact that process variables are highly correlated and that there is redundancy in process data (i.e., many variables are affected by the same event) makes this possible. Redundancy is beneficial for handling missing data. More details on methods for treatment of missing data in regression can be found in Chapter 3.06.

A variety of algorithms have been suggested ^32,33 to handle missing data, with different degree of complexity: trimmed score method (TRI), single-component projection (SCP), projection to the model plane (PMP) – using PLS or ordinary least squares (PMP_PLS, PMP_OLS), iterative imputation of missing data (II), a method based on the minimization of the squared prediction error (SPE), conditional mean replacement (CMR), trimmed score regression (TSR), and regression on known data (KDR).

Suppose that $x_{new}^{T} = {[x * x \neq]}^{T}$ , where without loss of generality we assume that x# is the vector of missing observations. (Following this convention, p ^* and P ^* are loadings corresponding to the known x ^*.) The methods can be seen as different ways to impute values for the missing variables vector, x#. By setting the missing values equal to their expected mean value (i.e., for mean-centered data x# = 0), we have the TRI method. ³³

SCP is the simplest but also the poorest performing approach: It calculates each of the scores independently and sequentially as ${\hat{t}}_{i} = z^{*} p_{i}^{*} / p_{i}^{* T} p_{i}^{*}$ , where z ^* is x ^* deflated by the first i − 1 components.

Nelson et al. ³² showed that superior results can be obtained by calculating all of the scores at once by projecting onto the hyperplane formed by the P ^* vectors. In the PMP method, the known x ^* vector is regressed onto the matrix P ^*. Sometimes, depending on the measurements missing, some of the columns of P ^* may become highly correlated and P ^*T P ^* becomes ill-conditioned. It was suggested³² to use PLS, PCR, or regularized least squares regression for the projection.

CMR ³² and TSR ³³ use the known score T matrix from the training data together with the loadings (P ^*) and the available measurements (x ^*) to estimate the score vector. A singularity problem that may arise in CMR may be solved by a procedure suggested by Nelson et al., ³² where the estimated score vector is calculated in two steps: First a parameter β is computed using PLS from T=X ^* β, where T and X ^* _, respectively, represent the score matrix and those columns from the training data set corresponding to known values; then β along with the current available data vector x ^* is used to compute an estimate of the score vector.

In iterative imputation, one may use an initial estimate of the final scores (say, those given by SCP method) to forecast the missing values x̂#, (using their corresponding loadings), then create the new vector and recalculate a score estimate, and iterate until convergence.

Arteaga and Ferrer ³³ presented an extensive study on the various methods. Iterative imputation and SPE methods are equivalent to PMP; KDR is equivalent to CMR. They concluded that based on the best prediction of the missing values, KDR is statistically superior to the other methods. The TSR is practically equivalent to the KDR and has the advantage that a much smaller matrix needs inversion. Additionally, TSR is statistically superior to PMP method.

Before the system is implemented online, there should be a plan for the operators as to how to respond if the values of several variables stop being recorded. For example, if there are three thermocouples in a reactor and one fails, common sense dictates that we can afford to continue the monitoring scheme. On the contrary, if there is only one sensor for a variable uncorrelated with any other, the value for this variable cannot be assessed from the rest of the variables in the system; therefore depending of the importance of this variable, one may not be able to rely on the monitoring scheme until the failed sensor is replaced. This idea was treated quantitatively by Nelson ⁵² and Nelson et al., ⁵³ where they analyzed the uncertainty resulting from missing measurements for the predictions of the values of the Hotelling's T ² and the SPE. Rather than representing an object with missing measurements by a single point, an estimate of the uncertainty regions in the score, Hotteling's T ², and SPE spaces arising from the missing measurements is provided. They suggested measures to distinguish between situations where model performance will continue to be acceptable and situations where it will be unacceptable, and therefore if the missing measurements cannot be recovered the application must be shut down.

Missing data methods did find their way in the industrial applications. In their industrial perspective on implementing online applications of multivariate statistics, Miletic et al. ² emphasize that missing data handling is a necessary feature for both the offline modeling and the online systems and report that are using the methods proposed by Nelson et al. ³²

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444527011000132

Parameter Estimation of Chaotic Systems Using Density Estimation of Strange Attractors in the State Space

Yasser Shekofteh , ... Sajad Jafari , in Recent Advances in Chaotic Systems and Synchronization, 2019

5.1 The GMM of the Chaotic System

The chaotic system (1) has three variables in the state space. So, the observation vector of its attractor will be formed as v = [x, y, z], and we must select D = 3 as the state space dimension in Eq. (2). To generate the attractor points of the chaotic system (1) as real data, its model has been simulated with parameters a = 1.0 and b = 1.0 by a fourth-order Runge-Kutta method with a step size of 10 ms [27,28]. For training data at the first phase, a set of sequential samples of the system (1) including 100,000 samples (equal to 1000 s time length) has been recorded. The initial conditions were set to (− 0.10, − 5.05, − 6.00) as initial conditions of the system (1). Here, we assume that this recorded training data must lead us to estimate unknown parameters of the chaotic system (1), a and b, by minimization of the GMM-based cost function.

Using obtained training data from the chaotic system, we can learn a GMM in order to model the geometry of the attractor in the state space. In other words, the GMM computation fits a parametric model to the distribution of the attractor in the state space. Fig. 5 shows the attractor of the chaotic system (1) in a three-dimensional state space along with its GMM modeling using M = 64 Gaussian components. In this figure, every three-dimensional ellipsoid corresponds to one of the Gaussian components.

As can be seen from Fig. 5, the Gaussian components attempt to cover the attractor in the state space. To show the effect of the number of Gaussian mixtures, in Fig. 6, the attractor of the chaotic system (1) and its GMM models are shown for different values of M = 16, 32, 48, and 64.

As can be seen from Fig. 6, when we increase the number of Gaussian components, more details of the trajectory of the chaotic attractor can be covered by the added Gaussian components. Therefore, in these experiments, the best GMM modeling of the attractor can be obtained by M = 64, which shows a precise model of the chaotic attractor. Therefore, by increasing the number of Gaussian components in the GMM, it can cover more complexity of the given time series in its model. The higher value of M can improve the performance of the cost function, but it also increases the computational cost and may be lead to overfitting problems.

In Fig. 7, the information criteria such as AIC, BIC, and the negative of the log-likelihood are considered for the GMM selection problem. It shows that M = 64 is a good choice for the number of GMM components, because of minimization of the criteria.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128158388000078

Volume 2

G.J. McLachlan , in Comprehensive Chemometrics, 2009

2.30.18 Mixed Feature Data

We consider now the case where some of the feature variables are discrete. That is, the observation vector y _j on the jth entity to be clustered consists of p ₁ discrete variables, represented by the subvector y _1j, in addition to p ₂ continuous variables represented by the subvector y _2j (j = 1,…,n). The ith component density of the jth observation

$y_{j} = {(y_{1 j}^{T}, y_{2 j}^{T})}^{T}$

can then be written as

(84) $f_{i} (y_{i}) = f_{i} (y_{1 j}) f_{i} (y_{2 j} | y_{1 j})$

The symbol f _i is being used generically here to denote a density where, for discrete random variables, the density is really a probability function.

In discriminant and cluster analyses, it has been found that it is reasonable to proceed by treating the discrete variables as if they are independently distributed within a class or cluster. This is known as the NAIVE assumption. ^49,50 Under this assumption, the ith component-conditional density of the vector y _1j of discrete features is given by

(85) $f_{i} (y_{1 j}) = \prod_{v = 1}^{p_{1}} f_{i v} (y_{1 v j})$

where f _iv (y _1vj) denotes the ith component-conditional density of the vth discrete feature variable y _1vj in y _1j.

If y _1v denotes one of the distinct values taken on by the discrete variable y _1vj, then under Equation (85) the (k + 1)th update of f _iv(y _1v) is

(86) $f_{iv}^{(k + 1)} (y_{1 v}) = \frac{Σ_{j = 1}^{n} τ_{i} (y_{j}; Ψ^{(k)}) δ [y_{1 v j}, y_{1 v}] + c_{1}}{Σ_{j = 1}^{n} τ_{i} (y_{j}; Ψ^{(k)}) + c_{2}}$

where δ[y _1vj, y _1vj] = 1 if y _1vj = y _1v and is zero otherwise, and Ψ^(k) is the current estimate of the vector of all the unknown parameters that now include the probabilities for the discrete variables. In Equation (86), the constants c ₁ and c ₂, which are both equal to zero for the maximum likelihood estimate, can be chosen to limit the effect of zero estimates of f _iv(y _1v) for rare values y _1v. One choice is c ₂ = 1 and c ₁ = 1/d _v, where d _v is the number of distinct values in the support of y _1vj. ⁴⁹

We can allow for some dependence between the vector y _2j of continuous variables and the discrete-data vector y _1j by adopting the location model as, for example, in Hunt and Jorgensen. ⁵¹ With the location model, f _i( y _2j∣ y _1j) is taken to be multivariate normal with a mean that is allowed to be different for some or all of the different levels of y _1j.

As an alternative to the use of the full mixture model, we may proceed conditionally on the realized values of the discrete feature vector y _1j, as in McLachlan and Chang. ⁵² This leads to the use of the conditional mixture model for the continuous feature vector y _2j,

(87) $f (y_{2 j} | y_{1 j}) = \sum_{i = 1}^{g} π_{i} (y_{1 j}) f_{i} (y_{2 j} | y_{1 j})$

where π_i( y _1j) denotes the conditional probability of ith component membership of the mixture given the discrete data in y _1j. A common model for π_i( y _1j) is the logistic model under which

(88) $π_{i} (y_{1 j}) = \frac{exp (β_{i 0} + β_{i}^{T} y_{1 j})}{1 + Σ_{h = 1}^{g - 1} exp (β_{h 0} + β_{h}^{T} y_{1 j})}$

where $β_{i} = {(β_{i 1}, \dots, β_{i p_{1}})}^{T} for i = 1, \dots, g - 1$ , and

$π_{g} (y_{1 j}) = 1 - \sum_{h - 1}^{g - 1} π_{h} (y_{1 j})$

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444527011000685

Statistical Control of Measures and Processes

A.J. Ferrer-Riquelme , in Comprehensive Chemometrics, 2009

1.04.9.2.2 PCA-based MSPC: online process monitoring (Phase II)

Once the reference PCA model and the control limits for the multivariate control charts are obtained, new process observations can be monitored online. When a new observation vector z _i is available, after preprocessing it is projected onto the PCA model yielding the scores and the residuals, from which the value of the Hotelling $T_{A}^{2}$ and the value of the SPE are calculated. This way, the information contained in the original K variables is summarized in these two indices, which are plotted in the corresponding multivariate $T_{A}^{2}$ and SPE control charts. No matter what the number of the original variables K is, only two points have to be plotted on the charts and checked against the control limits. The SPE chart should be checked first. If the points remain below the control limits in both charts, the process is considered to be in control. If a point is detected to be beyond the limits of one of the charts, then a diagnostic approach to isolate the original variables responsible for the out-of-control signal is needed. In PCA-based MSPC, contribution plots ³⁷ are commonly used for this purpose.

Contribution plots can be derived for abnormal points in both charts. If the SPE chart signals a new out-of-control observation, the contribution of each original kth variable to the SPE at this new abnormal observation is given by its corresponding squared residual:

(37) $Cont (SPE; x_{new, k}) = e_{new, k}^{2} = {(x_{new, k} - x_{new, k}^{*})}^{2}$

where e _new,k is the residual corresponding to the kth variable in the new observation and $x_{new, k}^{*}$ is the prediction of the kth variable x _new,k from the PCA model.

In case of using the DModX statistic, the contribution of each original kth variable to the DModX is given by ⁴⁴

(38) $Cont (DModX; x_{new, k}) = w_{k} e_{new, k}$

where w _k is the square root of the explained sum of squares for the kth variable. Variables with high contributions in this plot should be investigated.

If the abnormal observation is detected by the $T_{A}^{2}$ chart, the diagnosis procedure is carried out in two steps: (i) a bar plot of the normalized scores for that observation (t _new,a/λ_a)² is plotted and the ath score with the highest normalized value is selected; (ii) the contribution of each original kth variable to this ath score at this new abnormal observation is given by

(39) $Cont (t_{new, a}; x_{new, k}) = p_{ak} x_{new, k}$

where p _ak is the loading of the kth variable at the ath component. A plot of these contributions is created. Variables on this plot with high contributions but with the same sign as the score should be investigated (contributions of the opposite sign will only make the score smaller). When there are some scores with high normalized values, an overall average contribution per variable can be calculated over all the selected scores. ³⁹

Contribution plots are a powerful tool for fault diagnosis. They provide a list of process variables that contribute numerically to the out-of-control condition (i.e., they are no longer consistent with NOCs), but they do not reveal the actual cause of the fault. Those variables and any variables highly correlated with them should be investigated. Incorporation of technical process knowledge is crucial to diagnose the problem and discover the root causes of the fault.

Apart from the $T_{A}^{2}$ and SPE control charts, other charts such as the univariate time-series plots of the scores or scatter score plots can be useful (both in Phase I and II) for detecting and diagnosing out-of-control situations and also for improving process understanding.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978044452701100096X

Multivariate density estimation

Dag Tjøstheim , ... Bård Støve , in Statistical Modeling Using Local Gaussian Approximation, 2022

9.2.2 Estimation of the joint dependence function

Let $ψ (\cdot, θ)$ be a parametric family of p-variate density functions. Below ψ is taken to be the multinormal. We recall from Chapter 4 that Hjort and Jones (1996) estimate the unknown density f using the sample $X_{1}, \dots, X_{n}$ by fitting ψ locally. The local parameter estimate $\hat{θ} = \hat{θ} (x)$ maximizes the local likelihood function

(9.3) $L (X_{1}, \dots, X_{n}, θ) = L_{n} (θ, x) = n^{- 1} \sum_{i = 1}^{n} K_{B} (X_{i} - x) \log ψ (X_{i}, θ) - \int K_{B} (y - x) ψ (y, θ) d y,$

where K is a kernel function that integrates to one and is symmetric about the origin, B is a positive definite matrix of bandwidths, and $K_{B} (x) = | B |^{- 1} K (B^{- 1} x)$ , $| \cdot |$ being the determinant. For small bandwidths, the local estimate $\hat{f} (x) = ψ (x, {\hat{θ}}_{n} (x))$ is close to $f (x)$ in the limit, because if the bandwidth matrix B is held fixed and $u_{j} (\cdot, θ) = \partial / \partial θ_{j} \log ψ (\cdot, θ)$ , we have

(9.4) $0 = \frac{\partial L_{n} ({\hat{θ}}_{n}, x)}{\partial θ_{j}} \overset{P}{\to} \int K_{B} (y - x) u_{j} (y, θ_{B, K} (y)) {f (y) - ψ (y, θ_{B, K} (y))} d y$

for some value of the parameter $θ_{B, K} (x)$ toward which ${\hat{θ}}_{n} (x)$ converges in probability. However, for finite sample sizes, the curse of dimensionality comes into play. The number of coordinates in $θ = θ (x)$ typically grows with the dimension of x , making the local estimates difficult to obtain at every point in the sample space. One solution might be increasing the bandwidths so that the estimation becomes almost parametric. However, here we propose a different path around the Curse directly exploiting decomposition (9.2). The first step might be choosing a standardized multivariate normal distribution as parametric family in (9.3) for modeling $f_{Z}$ in (9.2) locally:

(9.5) $ψ (z, θ) = ψ (z, R) = {(2 π)}^{- p / 2} | R |^{- 1 / 2} \exp {- \frac{1}{2} z^{T} R^{- 1} z},$

where R denotes the local correlation matrix. Using a univariate local fit, the local Gaussian expectations and variances in (9.5) are constant and equal to zero and one, respectively, reflecting our knowledge that the margins of the unknown density function $f_{Z}$ are standard normal. However, as $B \to 0$ in the p-variate case, as briefly described in Chapter 4.9, the local mean μ and variance σ in general depend on z . In this chapter, we make the additional assumption in our p-dimensional local Gaussian approximation that $μ (z) \equiv 0$ and $σ^{2} (z) = 1$ . This is more restrictive than in Chapters 7 and 8, where it was assumed that $μ = μ (z) = μ (z_{1}, z_{2})$ and $σ \equiv σ (z) = σ (z_{1}, z_{2})$ in the bivariate case, and this more general assumption was crucial in obtaining the local spectral results in Chapter 8.

With this more restrictive assumption that $μ (z) \equiv 0$ and $σ^{2} (z) \equiv 1$ , we are left with the problem of estimating the pairwise correlations $ρ_{i j}$ , $1 ⩽ i < j ⩽ p$ , in (9.5). Fitting the Gaussian distribution according to the scheme described above results in a local correlation matrix at each point. Specifically, the estimated local correlations are written as ${\hat{ρ}}_{i j} = {\hat{ρ}}_{i j} (z_{1}, \dots, z_{p})$ , $i, j = 1, \dots, p$ , indicating that each parameter depends on all variables. The dependence between variables is captured in the variation of the parameter estimates in the p-dimensional Euclidean space, and its estimate maximizes the local likelihood function (9.3). However, as mentioned, the quality of the estimate deteriorates quickly with the dimension.

If the data were jointly normally distributed, there would be no dimensionality problem, since the entire distribution would be characterized by the global correlation coefficients between pairs of variables, and their empirical counterparts are easily computed from the data. A local Gaussian fit would then coincide with a global fit and result in estimates of the form ${\hat{ρ}}_{i j} = {\hat{ρ}}_{i j} (Z_{i}, Z_{j})$ , where the arguments indicate which of the transformed observation variables were used to obtain the estimate. This points to a natural simplification, which we may use to estimate the density $f_{Z}$ , analogous to the additive regression model in Chapter 2.7.1. We allow the local correlations to depend on their own variables only:

(9.6) ${\hat{ρ}}_{i j} (z_{1}, \dots, z_{p}) = {\hat{ρ}}_{i j} (z_{i}, z_{j}) .$

We could also simplify the estimation problem by estimating the local means and variances as functions of "their own" coordinate only: $μ_{i} (z) = μ_{i} (z_{i})$ and $σ_{i}^{2} (z) = σ_{i}^{2} (z_{i})$ , but, as mentioned before, here we have chosen the stricter approximation

(9.7) $μ (z) = 0 and σ^{2} (z) = 1 .$

We refer to Section 9.7 for a further discussion of this point.

The resulting estimation is carried out in four steps:

1.

Estimate the marginal distributions using the logspline method (or the empirical distribution function) and transform each observation vector to pseudo-standard normality as described in the previous subsection.

2.

Estimate the joint density of the transformed data using the Hjort and Jones (1996) local likelihood function (9.3), the standardized normal parametric family (9.5), and simplifications (9.6) and (9.7). In practice, this means fitting the bivariate version of (9.5) to each pair of the transformed variables $(Z_{i}, Z_{j})$ . Put the estimated local correlations into the estimated local correlation matrix: $\hat{R} (z) = {{\hat{ρ}}_{i j} (z_{i}, z_{j})}_{i, j = 1, \dots, p}$ .

3.

Let ${\hat{f}}_{Z} (z) = ψ (z, \hat{R} (z))$ and obtain the final estimate of $f (x)$ by replacing $f_{Z}$ with ${\hat{f}}_{Z}$ , and the marginal distribution and density functions with their logspline estimates in (9.2):

(9.8) $\hat{f} (x) = {\hat{f}}_{Z} (Φ^{- 1} ({\hat{F}}_{1} (x_{1})), \dots, Φ^{- 1} ({\hat{F}}_{p} (x_{p}))) \prod_{i = 1}^{p} \frac{{\hat{f}}_{i} (x_{i})}{ϕ (Φ^{- 1} ({\hat{F}}_{i} (x_{i})))} .$

4.

Normalize the density estimate so that it integrates to one.

The existence of population values corresponding to the estimated local correlations is discussed in the following section. It is clear that assumptions (9.6) and (9.7) represent an approximation to most multivariate distributions. The authors are aware of no other distributions than those possessing the Gaussian copula or step functions thereof as in Tjøstheim and Hufthammer (2013) or Chapter 4.3, for which (9.6) and (9.7) are exact properties of the true local correlations. In that case the local correlations are constant or stepwise constant in all its variables. The quality of the LGDE thus depends to a large degree on the severity of assumptions (9.6) and (9.7) on the underlying density. The pairwise assumption is hard to interpret except in general statements about "pairwise dependence structures", and so we proceed in this chapter to explore the impact of (9.6) and (9.7) in practice in Section 9.6 and the subsequent discussion in Section 9.7. Before we do that, we take a closer look at the theoretical foundations of the LGDE.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B978012815861600016X

INTRODUCTION

VADIM I. SERDOBOLSKII , in Multiparametric Statistics, 2008

The Kolmogorov Asymptotics

In 1967, Andrei Nikolaevich Kolmogorov was interested in the dependence of errors of discrimination on sample size. He solved the following problem. Let x be a normal observation vector, and $\bar{x_{v}}$ be sample averages calculated over samples from population number ν = 1, 2. Suppose that the covariance matrix is the identity matrix. Consider a simplified discriminant function

$g (x) = {({\bar{x}}_{1} - {\bar{x}}_{2})}^{T} (x− ({\bar{x}}_{1} + {\bar{x}}_{2}) / 2)$

and the classification rule w(x) > 0 against w(x) ≤ 0. This function leads to the probability of errors $α_{n} = Φ (- G / \sqrt{D})$ , where G and D are quadratic functions of sample averages having a noncentral χ² distribution. To isolate principal parts of G and D, Kolmogorov proposed to consider not one statistical problem but a sequence of n-dimensional discriminant problems in which the dimension n increases along with sample sizes N _ν, so that $N_{v} \to \infty$ and $n / N_{v} \to λ_{v} > 0, v = 1, 2$ . Under these assumptions, he proved that the probability of error α_n converges in probability

(7) $\underset{n \to \infty}{p \lim} α_{n} = Φ (- \frac{J - λ_{1} + λ_{2}}{2 \sqrt{J + λ_{1} + λ_{2}}}),$

where J is the square of the Euclidean limit "Mahalanobis distance" between centers of populations. This expression is remarkable by that it explicitly shows the dependence of error probability on the dimension and sample sizes. This new asymptotic approach was called the "Kolmogorov asymptotics."

Later, L. D. Meshalkin and the author of this book deduced formula (7) for a wide class of populations under the assumption that the variables are independent and populations approach each other in the parameter space (are contiguous) [45], [46].

In 1970, Yu. N. Blagoveshchenskii and A. D. Deev studied the probability of errors for the standard sample Fisher-Andersen-Wald discriminant function for two populations with unknown common covariance matrix. A. D. Deev used the fact that the probability of error coincides with the distribution function g(x). He obtained an exact asymptotic expansion for the limit of the error probability α. The leading term of this expansion proved to be especially interesting. The limit probability of an error (of the first kind) proved to be

$α = Φ (- Θ \frac{J - λ_{1} + λ_{2}}{2 \sqrt{J + λ_{1} + λ_{2}}}),$

where the factor $Θ = \sqrt{1 - λ}$ , with λ = λ₁λ₂ /(λ₁ + λ₂), accounts for the accumulation of estimation inaccuracies in the process of the covariance matrix inversion. It was called "the Deev formula." This formula was thoroughly investigated numerically, and a good coincidence was demonstrated even for not great n, N.

Note that starting from Deev's formulas, the discrimination errors can be reduced if the rule g(x) > θ against g(x) ≤ θ with θ = (λ₁ — λ₂)/2 ≠ 0 is used. A. D. Deev also noticed [18] that the half-sum of discrimination errors can be further decreased by weighting summands in the discriminant function.

After these investigations, it became obvious that by keeping terms of the order of n/N, one obtains a possibility of using specifically multidimensional effects for the construction of improved discriminant and other procedures of multivariate analysis. The most important conclusion was that traditional consistent methods of multivariate statistical analysis should be improvable, and a new progress in theoretical statistics is possible, aiming at obtaining nearly optimal solutions for fixed samples.

The Kolmogorov asymptotics (increasing dimension asymp–totics [3]) may be considered as a calculation tool for isolating leading terms in case of large dimension. But the principal role of the Kolmogorov asymptotics is that it reveals specific regularities produced by estimation of a large number of parameters. In a series of further publications, this asymptotics was used as a main tool for investigation of essentially many-dimensional phenomena characteristic of high-dimensional statistical analysis. The constant n/N became an acknowledged characteristics in many-dimensional statistics.

In Section 5.1, the Kolmogorov asymptotics is applied for the development of theory allowing to improve the discriminant analysis of vectors of large dimension with independent components. The improvement is achieved by introducing appropriate weights of contributions of independent variables in the discriminant function. These weights are used for the construction of asymptotically unimprovable discriminant procedure. Then, the problem of selection of variables for discrimination is solved, and the optimum selection threshold is found.

But the main success in the development of multiparametric solutions was achieved by combining the Kolmogorov asymptotics with the spectral theory of random matrices developed independently at the end of 20th century in another region.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780444530493500047

whitethentim.blogspot.com

Source: https://www.sciencedirect.com/topics/mathematics/observation-vector

Vector Ovector Observation Size Mismatch Between Continuous Agent Rolleragent

Observation Vector

Handbook of Statistics

2.1 Sparse representation-based classification

ASYMPTOTICALLY UNIMPROVABLE SOLUTION OF MULTIVARIATE PROBLEMS

Problem Setting

Miscellaneous topics in regression rank tests

PROBLEMS AND COMPLEMENTS TO CHAPTER 10

Section 10.1.

SPECTRAL THEORY OF SAMPLE COVARIANCE MATRICES

Limit Spectra

Volume 4

4.02.4 Handling Future Observations with Missing Data

Parameter Estimation of Chaotic Systems Using Density Estimation of Strange Attractors in the State Space

5.1 The GMM of the Chaotic System

Volume 2

2.30.18 Mixed Feature Data

Statistical Control of Measures and Processes

1.04.9.2.2 PCA-based MSPC: online process monitoring (Phase II)

Multivariate density estimation

9.2.2 Estimation of the joint dependence function

INTRODUCTION

The Kolmogorov Asymptotics

0 Response to "Vector Ovector Observation Size Mismatch Between Continuous Agent Rolleragent"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel