In statistics, it is often necessary to fit an empirical distribution, obtained from observed data, to a theoretical distribution that best describes the data. This allows for the calculation of probabilities and other statistical inferences.
Scipy provides numerous distribution functions that can be fitted to data. To find the most suitable distribution, the method of least squares is often used to minimize the sum of squared errors (SSE) between the histogram of the data and the histogram of the fitted distribution.
import numpy as np import scipy.stats as st # Data points data = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9] # Candidate theoretical distributions distributions = ['norm', 'beta', 'gamma'] # Iterate over distributions and find best fit best_dist = None lowest_sse = float('inf') for dist_name in distributions: dist = getattr(st, dist_name) # Fit distribution to data params = dist.fit(data) # Evaluate SSE sse = np.sum((np.histogram(data, bins=10, density=True)[0] - dist.pdf(np.linspace(0, 10, 100), *params))**2) # Update best distribution if lower SSE found if sse < lowest_sse: lowest_sse = sse best_dist = dist # Calculate p-value for a given value value = 5 p_value = best_dist.cdf(value)
In the example above, the empirical distribution of the data is fitted to three different theoretical distributions (normal, beta, and gamma). The gamma distribution is found to have the lowest SSE and is therefore the best fit. The p-value for the value 5 is then calculated as the cumulative distribution function of the gamma distribution evaluated at 5.
The above is the detailed content of How Can I Fit an Empirical Distribution to a Theoretical One Using SciPy in Python?. For more information, please follow other related articles on the PHP Chinese website!