Welcome to hypercone.com on July 10 2009.
This is an internet experiment running to monitor browsing habbits of individuals through wikipedia contents.

Maximum spacing estimation

From Wikipedia, the free encyclopedia

Jump to: navigation, search

In mathematics, Maximum spacing estimation (MSE or MSP), or maximum product of spacing estimation (MPS), is a statistical method for fitting the parameters of a mathematical model to data (Cheng & Amin 1983). The concept underlying the method is based on the probability integral transform, in that a set of independent random samples derived from any random variable should on average be uniformly distributed with respect to the cumulative distribution function of the random variable. The MPS method chooses the parameter values that make the observed data as uniform as possible, according to a specific quantitative measure of uniformity.

One of the most common methods for estimating the parameters of a distribution from data, the method of maximum likelihood (ML), can break down in various cases, such as certain mixtures of continuous distributions or heavy-tailed continuous distributions where the location or scale parameters are unknown. The maximum spacing estimation method is consistent with Kullback-Leibler information theory and can be used to estimate parameters in these situations (Ranneby 1984). The method was independently derived by Russel Cheng and Nik Amin, then at the University of Wales Institute of Science and Technology, and Bo Ranneby, then at the Swedish University of Agricultural Sciences (Ranneby 1984).

Apart from its use in pure mathematics and statistics, the method is applicable to applied sciences such as hydrology (Hall et al. 2004) and econometrics (Anatolyev & Kosenok 2004),

Contents

[edit] Definition

[edit] Theory

There have been multiple explanations given for the use of maximum spacing methods. Ranneby (1984) justifies the method by demonstrating that it is an estimator of the Kullback–Leibler divergence, similar to maximum likelihood estimation, but with more robust properties for various classes of problems. Cheng and Amin (1983) explain that due to the probability integral transform at the true paramater, the "spacing" between each observation should be uniformly distributed. This would imply that the difference between the values of the cumulative distribution function at consecutive observations should be equal. This is the case that maximizes the geometric mean, so solving for the parameters that maximize the geometric mean would achieve the "best" fit as defined this way.

[edit] Formal definition

Given an independent and identically distributed (iid) random sample of size n from a statistical population with a univariate distribution, let the cumulative distribution function be:


F(x;\theta^0)\colon \theta^0\in\Theta, \Theta\subseteq\mathbb{R}^k (k\geq 1)

Let X_1,\ldots,X_n be the ordered sample, i.e. take the sample and place the observations in size order from smallest to largest.

Let \theta\in\Theta be an estimator for θ0. Then F(x;θ) is an estimator for F(x0).

Pyke (1972)[note 1] defines the first-order spacing as:

 D_i(\theta) = F(X_i;\theta) - F( X_{i-1};\theta),\quad D_1(\theta) = F(X_1;\theta),\quad D_{n+1}(\theta) = 1 - F(X_n;\theta)

This may be thought of as the "spacing" between the values of the distribution function at adjacent order statistics, starting at 0 and ending at 1, as D1 = F(X1;θ) − 0.

Let Gn(θ) be the statistic defined as the geometric mean of the first-order spacings of the ordered sample, and let Sn(θ) be the natural logarithm of Gn(θ).


\begin{align}
 G_n(\theta) & = \left( \prod_{i=1}^{n+1}{D_i}(\theta) \right)^{1/(n+1)}\\
 S_n(\theta) & = \ln G_n(\theta) \\
 S_n(\theta) & = \frac{1}{n+1}\sum_{i=1}^{n+1}\ln{D_i}(\theta)\\
 \hat{\theta} & = \underset{\theta\in\Theta}{\operatorname{arg\,max}} \; S_n(\theta)
\end{align}

In other words, if there exists a \hat{\theta}\in\Theta that maximizes Sn(θ), then \hat{\theta} is the maximum spacing estimator of θ0.

In practice, optimization is usually performed by minimizing Sn(θ), similar to maximum likelihood procedures which usually minimize the negative logliklihood.

Ranneby (1984) defines Sn(θ) somewhat differently as:

S_n(\theta) = \frac{1}{n+1}\sum_{i=1}^{n+1}\ln {\left((n+1)\cdot D_i(\theta) \right)}

The maximum spacing estimate under this statistic is identical to that under Cheng and Amin's original definition, as the (n + 1) term is constant with respect to θ. Similarly, Cheng and Stephens (1989), when discussing goodness of fit, use Moran's statistic which is the Cheng and Amin definition multiplied by − (n + 1), and minimization instead of maximization is used to find the estimate.

[edit] Properties

[edit] Consistency

The maximum spacing estimator is a consistent estimator in that it converges in probability to the true value of the parameter, θ0, as the sample size increases to infinity (Ranneby 1984). The consistency of maximum spacing estimation holds under much more general conditions than for maximum likelihood estimators (Cheng & Amin 1983).

[edit] Efficiency

Maximum spacing estimators are at least as asymptotically efficient estimators as maximum likelihood estimators, where the latter exist. However, MSEs may exist in cases where MLEs do not (Cheng & Amin 1983).

[edit] Sensitivity

Maximum spacing estimators are sensitive to closely spaced observations, and especially ties (Cheng & Stephens 1989). Given

X_{i+k} = X_{i+k-1}=\cdots=X_i, \,

we get

 D_{i+k}(\theta) = D_{i+k-1}(\theta) = \cdots = D_{i+1}(\theta) = 0. \,

When the ties are due to multiple observations, Cheng and Amin (1983) show the repeated spacings (those that would otherwise be zero) should be replaced by the corresponding likelihood. That is, one should substitute fi(θ) for Di(θ), as


\lim_{i \to i-1} \int_{x_{i-1}}^{x_i} f(t;\theta) \; dt = f(x_{i-1},\theta) = f(x_{i},\theta),

since xi = xi − 1.

When ties are due to rounding error, Cheng and Stephens (1989) suggest another method to remove the effects.[note 2] Given r tied observations from xi to xi + r − 1, let δ represent the round-off error. All of the true values should then fall in the range x \pm \delta. The corresponding points on the distribution should now fall between y_L = F(x-\delta, \hat\theta) and y_U = F(x+\delta, \hat\theta).

Set:


D_j = \frac{y_U-y_L}{r-1} \quad (j=i+1,\ldots,i+r-1).

This is the equivalent to assuming that that the rounded values are uniformly spaced in the interval.

The MSE method is also sensitive to secondary clustering (Cheng & Stephens 1989). For example, when a set of observations is thought to come from a single normal distribution but in fact comes from a mixture normals with different means. Or if the data is thought to come from an exponential distribution when it actually comes from a gamma distribution. In the latter case, smaller spacings may occur in the lower tail. A high value of M(θ) would indicate this secondary clustering effect, and should suggest a closer look at the data (Cheng & Stephens 1989).

[edit] Goodness of fit

The statistic Sn(θ) is also a form of Moran or Moran-Darling statistic, M(θ), which can be used to test goodness of fit.[note 3] It has been shown that the statistic, when defined as

S_n(\theta) = M_n(\theta)= -\sum_{j=1}^{n+1}\ln{D_j(\theta)},

is asymptotically normal and a chi-squared approximation exists for small samples (Cheng & Stephens 1989). In the case where we know θ0, the true parameter, it has a normal distribution with


\begin{align}
 \mu_M & \approx (n+1)(\ln(n+1)+\gamma)-\frac{1}{2}-\frac{1}{12(n+1)},\\
 \sigma^2_M & \approx (n+1)\left ( \frac{\pi^2}{6} -1 \right ) -\frac{1}{2}-\frac{1}{6(n+1)},
\end{align}

where γ is the Euler–Mascheroni constant which is approximately 0.57722 (Cheng & Stephens 1989).[note 4]

The distribution can also be approximated by that of A, where

A = C_1 + C_2\chi^2_n \,,

in which


\begin{align}
C_1 &= \mu_M - \sqrt{\frac{\sigma^2_Mn}{2}},\\
C_2 &= {\sqrt\frac{\sigma^2_M}{2n}},\\
\end{align}

and where \chi^2_n follows a chi-square distribution with n degrees of freedom. Therefore, to test the hypothesis H0 that a random sample of n values comes from the distribution F(x,θ), the statistic T(\theta)= \frac{M(\theta)-C_1}{C_2} can be calculated. Then H0 should be rejected with significance α if the value is greater than the critical value of the appropriate chi-square distribution (Cheng & Stephens 1989).

Where θ0 is being estimated by \hat\theta, Cheng and Stephens (1989) showed that S_n(\hat\theta) = M_n(\hat\theta) has the same asymptotic mean and variance as in the known case. However, the test statistic to be used is

T(\hat\theta) =  \frac{M(\hat\theta)+\frac{k}{2}-C_1}{C_2},

where k is the number of parameters in the estimate \hat\theta.

[edit] Generalized maximum spacing

[edit] Alternate measures and spacings

Ranneby and Ekström (1997) generalized the method to approximate other measures besides the Kullback–Leibler measure. Ekström (1997) further expanded the method to investigate properties of estimators using higher order spacings, where an m-order spacing would be defined as F(Xj + m) − F(Xj).

[edit] Multivariate distributions

Ranneby et al. (2005) discuss extended maximum spacing methods to the multivariate case. As there is no natural order for \mathbb{R}^k (k>1), they discuss two alternative approaches: a geometric approach based on Dirichlet cells and a probabilistic approach based on a "nearest neighbor ball" metric.

[edit] See also

[edit] Notes

  1. ^ The actual definition is sourced to (Pyke 1965), but without direct access to that paper, sourcing is given to Pyke's later paper which defines the spacings in passing. -- Editor
  2. ^ There appear to be some minor typographical errors in the paper. For example, in section 4.2, equation (4.1), the rounding replacement for Dj, should not have the log term. In section 1, equation (1.2), Dj is defined to be the spacing itself, and M(θ) is the negative sum of the logs of Dj. If Dj is logged at this step, the result is always =<0, as the difference between two adjacent points on a cumulative distribution is always =< 1, and strictly <1 unless there are only two points at the bookends. Also, in section 4.3, on page 392, calculation shows that it is the variance \textstyle\tilde{\sigma^2} which has MPS estimate of 6.87, not the standard deviation \textstyle\tilde{\sigma}. -- Editor
  3. ^ The literature refers to related statistics as Moran or Moran-Darling statistics. For example, Cheng & Stephens (1989) analyze the form M(\theta)= -\sum_{j=1}^{n+1}\log{D_i(\theta)} where Di(θ) is defined as above. Wong & Li (2006) use the same form as well. However, Beirlant et al. (2001) uses the form M_n= -\sum_{j=0}^{n}\ln{((n + 1)(X_{n,i+1} - X_{n,i}))}, with the additional factor of (n + 1) inside the logged summation. The extra factors will make a difference in terms of the expected mean and variance of the statistic. For consistency, this article will continue to use the Cheng & Amin/Wong & Li form. -- Editor
  4. ^ Wong & Li (2006) leave out the Euler–Mascheroni constant from their description. -- Editor

[edit] References

1 Anatolyev, Stanislav; Grigory Kosenok (2005). "An Alternative to Maximum Likelihood Based on Spacings" (PDF). Econometric Theory (Cambridge University Press) 21 (2): 472–476. doi:10.1017/S0266466605050255. http://www.nes.ru/~gkosenok/MPS.pdf. Retrieved on 2009-01-21. 

1 Beirlant, J.; E. J. Dudewicz, L. Györfi, E. C. van der Meulen (1997). "Nonparametric entropy estimation: an overview" (PDF). International Journal of Mathematical and Statistical Sciences 6 (1): 17–40. ISSN 1055-7490. http://www.menem.com/ilya/digital_library/entropy/beirlant_etal_97.pdf. Retrieved on 2008-12-31.  Note: Linked paper is the updated 2001 version.

1 2 3 4 5 Cheng, R.C.H.; N.A.K. Amin (1983). "Estimating Parameters in Continuous Univariate Distributions with a Shifted Origin". Journal of the Royal Statistical Society Series B (Royal Statistical Society) 45 (3): 394–403. ISSN 0035-9246. 

1 2 3 4 5 6 7 8 9 10 Cheng, R.C.H; M. A. Stephens (1989). "A goodness-of-fit test using Moran's statistic with estimated parameters". Biometrika (Oxford University Press) 76 (2): 386–392. doi:10.1093/biomet/76.2.385. 

1 Ekström, Magnus (1997). "Generalized Maximum Spacing Estimates" (PostScript). Research Report (Umeå University) 6. ISSN 0345-3928. http://www.matstat.umu.se/varia/reports/rep9706.ps.gz. Retrieved on 2008-12-30. 

1 Hall, M.J.; H.F.P. van den Boogaard, R.C. Fernando, A.E. Mynett (2004). "The construction of confidence intervals for frequency analysis using resampling techniques" (PDF). Hydrology and Earth System Sciences (European Geosciences Union) 8 (2): 235–246. ISSN 1027-5606. http://www.hydrol-earth-syst-sci.net/8/235/2004/hess-8-235-2004.pdf. Retrieved on 2009-01-21. 

1 Pyke, Ronald (1965). "Spacings". Journal of the Royal Statistical Society Series B (Royal Statistical Society) 27: 395–449. ISSN 0035-9246. 

1 Pyke, Ronald (1972). "Spacings Revisited" (PDF). Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (University of California Press) 1: 417–427. MR0405709Zbl 0234.62008. ISSN 0097-0433. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.bsmsp/1200514103. Retrieved on 2008-12-30. 

1 2 3 4 5 Ranneby, Bo (1984). "The Maximum Spacing Method. An Estimation Method Related to the Maximum Likelihood Method". Scandinavian Journal of Statistics 11 (2): 93–112. ISSN 0303-6898. 

1 Ranneby, Bo; Magnus Ekström (1997). "Maximum Spacing Estimates Based on Different Metrics" (PostScript). Research Report (Umeå University) 5. ISSN 0345-3928. http://www.matstat.umu.se/varia/reports/rep9705.ps.gz. Retrieved on 2008-12-30. 

1 Ranneby, Bo; S. Rao Jammalamadakab, Alex Teterukovskiy (2005). "The maximum spacing estimation for multivariate observations" (PDF). Journal of Statistical Planning and Inference (Elsevier) 129 (1–2): 427–446. doi:10.1016/j.jspi.2004.06.059. http://www.pstat.ucsb.edu/faculty/jammalam/html/research%20publication_files/MSP2.pdf. Retrieved on 2008-12-31. 

1 2 Wong, T.S.T; W.K. Li (2006). "A note on the estimation of extreme value distributions using maximum product of spacings" (PDF). IMS Lecture Notes–Monograph Series (Institute of Mathematical Statistics) 52: 272–283. doi:10.1214/074921706000001102. arΧiv:math/0702830v1. http://projecteuclid.org/DPubS/Repository/1.0/Disseminate?view=body&id=pdf_1&handle=euclid.lnms/1196285981. Retrieved on 2008-12-31. 

Personal tools

Visit joltnews for the latest headlines
Visit bloit.com for company information
Geed Media does computer consulting on long island.
This page viewed times. See Logs