Table of Contents

Class StatisticsHelper<T>

Namespace
AiDotNet.Helpers
Assembly
AiDotNet.dll

Provides statistical calculation methods for various data analysis tasks.

public static class StatisticsHelper<T>

Type Parameters

T

The numeric type used for calculations (e.g., double, float, decimal).

Inheritance
StatisticsHelper<T>
Inherited Members

Remarks

For Beginners: This class contains methods to calculate common statistical measures like averages, variations, and statistical tests. These help you understand your data's patterns and make decisions based on statistical evidence.

Methods

CalculateAIC(int, int, T)

Calculates the Akaike Information Criterion (AIC) for model comparison.

public static T CalculateAIC(int sampleSize, int parameterSize, T rss)

Parameters

sampleSize int

The number of observations in the sample.

parameterSize int

The number of parameters in the model.

rss T

The residual sum of squares (sum of squared errors).

Returns

T

The AIC value.

Remarks

For Beginners: The Akaike Information Criterion (AIC) helps you compare different models for the same data. It's calculated as 2k + n*[ln(2pRSS/n) + 1], where k is the number of parameters, n is the sample size, and RSS is the residual sum of squares. The AIC balances model fit against model complexity - a lower AIC indicates a better model. This formulation of AIC is based on the likelihood function assuming normally distributed errors. When comparing models, the absolute AIC value isn't important; what matters is the difference between models. A model with an AIC that's 2 or more points lower than another is considered better. This metric helps you avoid overfitting by penalizing models that use too many parameters.

CalculateAICAlternative(int, int, T)

Calculates an alternative formulation of the Akaike Information Criterion (AIC).

public static T CalculateAICAlternative(int sampleSize, int parameterSize, T rss)

Parameters

sampleSize int

The number of observations in the sample.

parameterSize int

The number of parameters in the model.

rss T

The residual sum of squares (sum of squared errors).

Returns

T

The AIC value.

Remarks

For Beginners: The Akaike Information Criterion (AIC) is a measure used to compare different models for the same data. This alternative formulation calculates AIC as n*ln(RSS/n) + 2k, where n is the sample size, RSS is the residual sum of squares, and k is the number of parameters. The AIC balances model fit (the first term) against model complexity (the second term). Lower AIC values indicate better models. When comparing models, differences of 2 or less are considered negligible, differences between 4 and 7 indicate the model with the lower AIC is considerably better, and differences greater than 10 indicate the model with the lower AIC is substantially better. This metric helps prevent overfitting by penalizing models with too many parameters.

CalculateAUC(Vector<T>, Vector<T>)

Calculates the Area Under a Curve (AUC) given x and y coordinates.

public static T CalculateAUC(Vector<T> fpr, Vector<T> tpr)

Parameters

fpr Vector<T>

The x-coordinates (typically false positive rates for ROC curves).

tpr Vector<T>

The y-coordinates (typically true positive rates for ROC curves).

Returns

T

The area under the curve.

Remarks

For Beginners: This method calculates the area under a curve defined by a series of points. It uses the trapezoidal rule, which approximates the area by dividing it into trapezoids and summing their areas. For each adjacent pair of points, it calculates the width (difference in x-coordinates) and the average height (average of y-coordinates), then multiplies them to get the area of that trapezoid. While this method is typically used for ROC curves (where x is the false positive rate and y is the true positive rate), it can be used for any curve where you need to calculate the area underneath. The more points you have defining your curve, the more accurate the area calculation will be.

CalculateAccuracy(Vector<T>, Vector<T>)

Calculates the accuracy of predictions by comparing them to actual values.

public static T CalculateAccuracy(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values.

Returns

T

The accuracy as a proportion between 0 and 1.

Remarks

For Beginners: Accuracy is the simplest measure of prediction performance - it's the proportion of predictions that exactly match the actual values. This method calculates accuracy by counting how many predictions are exactly equal to their corresponding actual values, then dividing by the total number of predictions. The result ranges from 0 (no correct predictions) to 1 (all predictions correct). While easy to understand, this strict definition of accuracy can be limiting for many problems, especially with continuous values where exact matches are rare. This basic version is most appropriate for classification problems with discrete categories.

CalculateAccuracy(Vector<T>, Vector<T>, PredictionType, T?)

Calculates the accuracy of predictions with support for different prediction types and tolerance levels.

public static T CalculateAccuracy(Vector<T> actual, Vector<T> predicted, PredictionType predictionType, T? tolerance = default)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values.

predictionType PredictionType

The type of prediction (Binary or Regression).

tolerance T

For regression, the acceptable error tolerance as a proportion (default is 0.05 or 5%).

Returns

T

The accuracy as a proportion between 0 and 1.

Remarks

For Beginners: This enhanced accuracy calculation supports both binary classification and regression problems. For binary classification, it works like the simpler version, counting exact matches. For regression problems, it introduces a tolerance parameter that allows predictions to be "close enough" rather than requiring exact matches. A prediction is considered correct if it's within a certain percentage (the tolerance) of the actual value. For example, with the default 5% tolerance, a prediction of 95 would be considered correct for an actual value of 100. This makes the accuracy metric much more useful for continuous values, where exact matches are unlikely.

CalculateAccuracy(Vector<T>, Vector<T>, PredictionType, T?, bool)

public static T CalculateAccuracy(Vector<T> actual, Vector<T> predicted, PredictionType predictionType, T? tolerance, bool treatDefaultAsMissing)

Parameters

actual Vector<T>
predicted Vector<T>
predictionType PredictionType
tolerance T
treatDefaultAsMissing bool

Returns

T

CalculateAdjustedR2(T, int, int)

Calculates the adjusted R² value, which accounts for the number of predictors in the model.

public static T CalculateAdjustedR2(T r2, int n, int p)

Parameters

r2 T

The standard R² value.

n int

The number of observations (sample size).

p int

The number of predictors (independent variables) in the model.

Returns

T

The adjusted R² value.

Remarks

For Beginners: Adjusted R² is a modified version of R² that accounts for the number of predictors in your model. Regular R² always increases when you add more variables to your model, even if those variables don't actually improve predictions. Adjusted R² penalizes you for adding variables that don't help, making it more useful when comparing models with different numbers of variables. Like regular R², higher values indicate better model fit.

CalculateAdjustedRandIndex(Vector<T>, Vector<T>)

Calculates the Adjusted Rand Index (ARI) between two clusterings.

public static T CalculateAdjustedRandIndex(Vector<T> labels1, Vector<T> labels2)

Parameters

labels1 Vector<T>

The first set of cluster labels.

labels2 Vector<T>

The second set of cluster labels (e.g., ground truth).

Returns

T

The Adjusted Rand Index value, ranging from -1 to 1, where 1 indicates perfect agreement.

Remarks

The Adjusted Rand Index (ARI) is a measure of the similarity between two data clusterings, adjusted for chance. It computes a similarity measure between two clusterings by considering all pairs of samples and counting pairs that are assigned in the same or different clusters in the predicted and true clusterings.

For Beginners: This method compares two different ways of grouping the same data and tells you how similar they are, while accounting for random chance.

The score ranges from -1 to 1:

  • 1.0 means the two clusterings are identical
  • 0.0 means the agreement is what you'd expect from random clustering
  • Negative values mean the clusterings are worse than random

This is useful for:

  • Comparing your clustering results to known "ground truth" labels
  • Evaluating how stable your clustering algorithm is across different runs
  • Comparing different clustering algorithms on the same data

For example, if you cluster customer data and want to see how well it matches manually-defined segments, ARI tells you how similar your automated clustering is to the manual segmentation.

CalculateAucF1Score<TInput, TOutput>(ModelEvaluationData<T, TInput, TOutput>)

Calculates both the AUC and F1 score for model evaluation.

public static (T, T) CalculateAucF1Score<TInput, TOutput>(ModelEvaluationData<T, TInput, TOutput> evaluationData)

Parameters

evaluationData ModelEvaluationData<T, TInput, TOutput>

The model evaluation data containing actual and predicted values.

Returns

(T Accuracy, T Loss)

A tuple containing the AUC and F1 score.

Type Parameters

TInput
TOutput

Remarks

For Beginners: This method calculates two important classification metrics in one go: the Area Under the precision-recall Curve (AUC) and the F1 score. AUC measures the model's ability to rank positive instances higher than negative ones across all possible thresholds, while the F1 score balances precision and recall at a specific threshold. Together, these metrics provide a comprehensive view of model performance. AUC gives a threshold-independent assessment of ranking ability, while F1 shows performance at an operating point. This is useful because a model might have a good AUC (ranking ability) but still perform poorly at the chosen threshold, or vice versa. Having both metrics helps you understand different aspects of your model's performance.

CalculateAutoCorrelationFunction(Vector<T>, int)

Calculates the autocorrelation function (ACF) for a time series up to a specified maximum lag.

public static Vector<T> CalculateAutoCorrelationFunction(Vector<T> series, int maxLag)

Parameters

series Vector<T>

The time series data.

maxLag int

The maximum lag to calculate autocorrelation for.

Returns

Vector<T>

A vector containing autocorrelation values for lags 0 to maxLag.

Remarks

For Beginners: The autocorrelation function (ACF) measures the correlation between a time series and a lagged version of itself. It helps identify patterns, seasonality, and the degree of randomness in time series data. This method calculates autocorrelation for lags from 0 to maxLag by first computing the mean and variance of the series, then for each lag, calculating the sum of products of deviations from the mean for points separated by that lag, and finally normalizing by the variance. ACF values range from -1 to 1, with values close to 1 indicating strong positive correlation, values close to -1 indicating strong negative correlation, and values close to 0 indicating little correlation. The ACF is a fundamental tool in time series analysis, used for model identification in ARIMA modeling, detecting seasonality, and testing for randomness.

CalculateBIC(int, int, T)

Calculates the Bayesian Information Criterion (BIC) for model comparison.

public static T CalculateBIC(int sampleSize, int parameterSize, T rss)

Parameters

sampleSize int

The number of observations in the sample.

parameterSize int

The number of parameters in the model.

rss T

The residual sum of squares (sum of squared errors).

Returns

T

The BIC value.

Remarks

For Beginners: The Bayesian Information Criterion (BIC) is similar to AIC but penalizes model complexity more strongly. It's calculated as n*ln(RSS/n) + k*ln(n), where n is the sample size, RSS is the residual sum of squares, and k is the number of parameters. Like AIC, lower BIC values indicate better models. The BIC tends to favor simpler models than AIC does, especially with larger sample sizes, because its penalty for additional parameters increases with sample size. When comparing models, differences of 2-6 points indicate positive evidence for the model with the lower BIC, differences of 6-10 indicate strong evidence, and differences greater than 10 indicate very strong evidence. BIC is particularly useful when you want to be more conservative about adding parameters to your model.

CalculateBayesFactor<TInput, TOutput>(ModelStats<T, TInput, TOutput>)

Calculates the Bayes Factor for comparing two models.

public static T CalculateBayesFactor<TInput, TOutput>(ModelStats<T, TInput, TOutput> modelStats)

Parameters

modelStats ModelStats<T, TInput, TOutput>

The model statistics object containing necessary information.

Returns

T

The Bayes Factor.

Type Parameters

TInput
TOutput

Remarks

For Beginners: The Bayes Factor is a ratio that compares the evidence for two competing models. It's calculated as the ratio of the marginal likelihood of one model to the marginal likelihood of another (reference) model. A Bayes Factor greater than 1 indicates evidence in favor of the first model, while a value less than 1 favors the reference model. The strength of evidence is often interpreted using guidelines: 1-3 is considered weak evidence, 3-10 is substantial, 10-30 is strong, 30-100 is very strong, and >100 is decisive evidence. Unlike p-values, Bayes Factors can provide evidence in favor of the null hypothesis and allow for direct comparison of non-nested models. They're a fundamental tool in Bayesian model selection.

CalculateBetaCDF(T, T, T)

Calculates the cumulative distribution function (CDF) of the Beta distribution.

public static T CalculateBetaCDF(T x, T alpha, T beta)

Parameters

x T

The value at which to evaluate the CDF, must be between 0 and 1.

alpha T

The first shape parameter (α), must be positive.

beta T

The second shape parameter (β), must be positive.

Returns

T

The probability that a Beta(α, β) random variable is less than or equal to x.

Remarks

For Beginners: The Beta distribution CDF gives the probability that a random value from a Beta distribution falls below a certain point. This is essential for calculating confidence intervals for proportions, such as in Clopper-Pearson intervals.

Exceptions

ArgumentOutOfRangeException

Thrown when parameters are out of valid range.

CalculateBootstrapInterval(Vector<T>, Vector<T>, T)

Calculates confidence intervals using bootstrap resampling.

public static (T Lower, T Upper) CalculateBootstrapInterval(Vector<T> actual, Vector<T> predicted, T confidenceLevel)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the bootstrap confidence interval.

Remarks

For Beginners: Bootstrap intervals are a powerful way to estimate confidence intervals without making assumptions about the underlying distribution of your data. This method creates many new samples by randomly selecting values from your original predictions (with replacement), calculates the mean for each of these samples, and then determines the interval bounds based on the distribution of these means. For example, for a 95% confidence interval, it finds the values that contain the middle 95% of the bootstrap sample means. This approach is particularly useful when your data doesn't follow a normal distribution or when you have a small sample size.

CalculateCRPS(Vector<T>, Vector<T>)

Calculates the CRPS for point predictions (assumes zero uncertainty).

public static T CalculateCRPS(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values (point forecasts without uncertainty).

Returns

T

The average CRPS value, which equals MAE for deterministic predictions.

Remarks

For Beginners: This overload is for deterministic predictions (no uncertainty estimates). In this case, CRPS is equivalent to Mean Absolute Error (MAE). Use the overload with predictedStdDev for probabilistic forecasts.

CalculateCRPS(Vector<T>, Vector<T>, Vector<T>)

Calculates the Continuous Ranked Probability Score (CRPS) for probabilistic forecasts.

public static T CalculateCRPS(Vector<T> actual, Vector<T> predictedMean, Vector<T> predictedStdDev)

Parameters

actual Vector<T>

The actual observed values.

predictedMean Vector<T>

The predicted mean values (point forecasts).

predictedStdDev Vector<T>

The predicted standard deviations (uncertainty estimates).

Returns

T

The average CRPS value across all observations. Lower values indicate better probabilistic forecasts.

Remarks

The Continuous Ranked Probability Score (CRPS) is a proper scoring rule that measures the accuracy of probabilistic predictions. It compares the predicted cumulative distribution function (CDF) with the observed value, rewarding forecasts that assign high probability to what actually happens.

For Beginners: CRPS evaluates how well a probabilistic forecast (one that includes uncertainty) matches reality. Unlike MAE which only looks at point predictions, CRPS considers the entire predicted probability distribution.

Imagine a weather forecast that says "temperature will be 20°C ± 3°C" (mean=20, std=3). If the actual temperature is 20°C, that's a good forecast. If the actual temperature is 25°C, that's outside the predicted range - not as good. CRPS quantifies this, penalizing forecasts that are both inaccurate and overconfident.

Key properties:

  • Lower CRPS is better (like MAE)
  • When predictions have zero uncertainty, CRPS equals MAE
  • CRPS rewards well-calibrated uncertainty estimates
  • Units are the same as the predicted variable

This implementation assumes Gaussian (normal) predictive distributions, which is appropriate for most time series forecasting models like DeepAR, TFT, and Chronos.

CalculateCalinskiHarabaszIndex(Matrix<T>, Vector<T>)

Calculates the Calinski-Harabasz Index (Variance Ratio Criterion) for a clustering result.

public static T CalculateCalinskiHarabaszIndex(Matrix<T> data, Vector<T> labels)

Parameters

data Matrix<T>

The data matrix where each row is an observation.

labels Vector<T>

The cluster labels for each observation.

Returns

T

The Calinski-Harabasz Index.

Remarks

For Beginners: The Calinski-Harabasz Index (CHI), also known as the Variance Ratio Criterion, measures the ratio of between-cluster variance to within-cluster variance, adjusted for the number of clusters and data points. Higher values indicate better clustering with dense, well-separated clusters. It's calculated as [(n-k)/(k-1)] * [B/W], where n is the number of points, k is the number of clusters, B is the between-cluster variance, and W is the within-cluster variance. This method computes these components by first calculating cluster centroids and the global centroid, then measuring the variances based on these. CHI is particularly useful for comparing clustering results with different numbers of clusters, as it accounts for this difference in its formula. It works best for convex, well-separated clusters.

CalculateChiSquareCDF(int, T)

Calculates the chi-square cumulative distribution function (CDF) value for a given chi-square value and degrees of freedom.

public static T CalculateChiSquareCDF(int degreesOfFreedom, T x)

Parameters

degreesOfFreedom int

The degrees of freedom parameter for the chi-square distribution.

x T

The chi-square value to evaluate.

Returns

T

The probability that a chi-square random variable with the specified degrees of freedom is less than or equal to x.

Remarks

For Beginners: The chi-square CDF function calculates the probability that a chi-square random variable is less than or equal to a specific value. This is useful in hypothesis testing to determine if your observed results could have happened by chance. This method uses the incomplete gamma function to calculate the result, which is a standard approach for computing chi-square probabilities.

CalculateChiSquarePDF(int, T)

Calculates the probability density function (PDF) of the chi-square distribution.

public static T CalculateChiSquarePDF(int degreesOfFreedom, T x)

Parameters

degreesOfFreedom int

The degrees of freedom parameter for the chi-square distribution.

x T

The value at which to evaluate the PDF.

Returns

T

The probability density at point x.

Remarks

For Beginners: The chi-square PDF tells you the relative likelihood of observing a particular value in a chi-square distribution. Think of it as measuring how common a specific value is within this distribution. The degrees of freedom parameter determines the shape of the distribution - higher values create a more symmetric bell curve.

CalculateClopperPearsonInterval(int, int, T)

Calculates the Clopper-Pearson (exact) confidence interval for a binomial proportion.

public static (T Lower, T Upper) CalculateClopperPearsonInterval(int successes, int trials, T confidence)

Parameters

successes int

The number of observed successes.

trials int

The total number of trials.

confidence T

The confidence level, between 0 and 1 (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the confidence interval.

Remarks

For Beginners: The Clopper-Pearson interval provides exact confidence bounds for the true probability of success in a binomial experiment. Unlike the normal approximation, it guarantees the specified coverage probability, making it suitable for small samples and extreme proportions.

This is the mathematically correct way to compute confidence intervals for proportions, especially important in certified robustness where we need guaranteed coverage.

Exceptions

ArgumentOutOfRangeException

Thrown when parameters are out of valid range.

CalculateClopperPearsonLowerBound(int, int, T)

Calculates the Clopper-Pearson lower confidence bound for a binomial proportion.

public static T CalculateClopperPearsonLowerBound(int successes, int trials, T confidence)

Parameters

successes int

The number of observed successes.

trials int

The total number of trials.

confidence T

The confidence level for a one-sided bound.

Returns

T

The lower bound of the one-sided confidence interval.

Remarks

This is particularly useful for randomized smoothing certification where we need a guaranteed lower bound on the top class probability.

CalculateConditionNumber(Matrix<T>, ModelStatsOptions)

Calculates the condition number of a matrix using the specified method.

public static T CalculateConditionNumber(Matrix<T> matrix, ModelStatsOptions options)

Parameters

matrix Matrix<T>

The matrix to analyze.

options ModelStatsOptions

Options for model statistics calculations, including the condition number calculation method.

Returns

T

The condition number of the matrix.

Remarks

For Beginners: The condition number of a matrix measures how sensitive a linear system is to errors or changes in the input. A high condition number indicates that small changes in the input can lead to large changes in the output, which is a sign of an ill-conditioned problem. In the context of regression, a high condition number suggests that the model might be unstable and sensitive to small changes in the data. This method supports several approaches to calculate the condition number, including Singular Value Decomposition (SVD), L1 norm, infinity norm, and power iteration. Each approach has different computational characteristics, but they all provide a measure of the matrix's conditioning. Lower condition numbers (closer to 1) indicate better conditioning.

Exceptions

ArgumentException

Thrown when an unsupported condition number calculation method is specified.

CalculateConfidenceIntervals(Vector<T>, T, DistributionType)

Calculates confidence intervals for the mean of a set of values based on the specified distribution type and confidence level.

public static (T LowerBound, T UpperBound) CalculateConfidenceIntervals(Vector<T> values, T confidenceLevel, DistributionType distributionType)

Parameters

values Vector<T>

The sample data.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

distributionType DistributionType

The type of distribution to use for the calculation.

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the confidence interval.

Remarks

For Beginners: Confidence intervals tell you the range where the true population mean is likely to be, based on your sample data. For example, a 95% confidence interval means that if you were to take many samples and calculate the confidence interval for each, about 95% of these intervals would contain the true population mean. This method calculates these intervals for different types of distributions (Normal, Laplace, Student's t, LogNormal, Exponential, or Weibull). The calculation approach varies by distribution type, but all involve finding the appropriate critical values based on the confidence level and using them to calculate the margin of error around the sample mean or median.

Exceptions

ArgumentException

Thrown when an invalid distribution type is specified.

CalculateConfusionMatrix(Vector<T>, Vector<T>, T)

Calculates a confusion matrix for binary classification at a specified threshold.

public static ConfusionMatrix<T> CalculateConfusionMatrix(Vector<T> actual, Vector<T> predicted, T threshold)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

threshold T

The threshold above which predictions are considered positive.

Returns

ConfusionMatrix<T>

A confusion matrix containing counts of true positives, true negatives, false positives, and false negatives.

Remarks

For Beginners: A confusion matrix summarizes the performance of a classification model by counting how many instances were correctly and incorrectly classified for each class. This method creates a confusion matrix for binary classification by comparing actual values to predictions at a specified threshold. It counts true positives (correctly identified positives), true negatives (correctly identified negatives), false positives (negatives incorrectly classified as positives), and false negatives (positives incorrectly classified as negatives). The confusion matrix is the foundation for many classification metrics, including accuracy, precision, recall, F1 score, and specificity. It provides a more detailed view of model performance than single metrics, showing exactly where the model makes mistakes.

Exceptions

ArgumentException

Thrown when inputs have different lengths.

CalculateCorrelationMatrix(Matrix<T>, ModelStatsOptions)

Calculates a correlation matrix for a set of features.

public static Matrix<T> CalculateCorrelationMatrix(Matrix<T> features, ModelStatsOptions options)

Parameters

features Matrix<T>

The matrix of features, where each column represents a feature.

options ModelStatsOptions

Options for model statistics calculations, including multicollinearity threshold.

Returns

Matrix<T>

A matrix of correlation coefficients between each pair of features.

Remarks

For Beginners: A correlation matrix shows how each feature in your dataset relates to every other feature. Each cell in the matrix contains the Pearson correlation coefficient between two features, which ranges from -1 (perfect negative correlation) to 1 (perfect positive correlation), with 0 indicating no linear relationship. The diagonal of the matrix always contains 1s since each feature perfectly correlates with itself. This method also checks for multicollinearity, which occurs when features are highly correlated with each other. High multicollinearity can cause problems in regression models because it makes it difficult to determine the individual effect of each feature. The method logs a warning when it detects correlation above the threshold specified in the options.

CalculateCovarianceMatrix(Matrix<T>)

Calculates the covariance matrix for a dataset.

public static Matrix<T> CalculateCovarianceMatrix(Matrix<T> matrix)

Parameters

matrix Matrix<T>

The data matrix where each row is an observation and each column is a variable.

Returns

Matrix<T>

The covariance matrix.

Remarks

For Beginners: The covariance matrix measures how variables in a dataset vary together. Each element (i,j) in the matrix represents the covariance between the i-th and j-th variables. Diagonal elements are variances of individual variables, while off-diagonal elements show how pairs of variables co-vary. This method calculates the covariance matrix by first computing the mean of each variable, then for each pair of variables, calculating the average product of their deviations from their respective means. The resulting matrix is symmetric (covariance of X with Y equals covariance of Y with X). The covariance matrix is essential for many statistical techniques, including principal component analysis, Mahalanobis distance, and multivariate normal distributions. It helps understand the structure and relationships in multivariate data.

CalculateCredibleIntervals(Vector<T>, T, DistributionType)

Calculates the credible intervals for a given set of values based on the specified distribution type and confidence level.

public static (T LowerBound, T UpperBound) CalculateCredibleIntervals(Vector<T> values, T confidenceLevel, DistributionType distributionType)

Parameters

values Vector<T>

The sample data.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

distributionType DistributionType

The type of distribution to use for the calculation.

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the credible interval.

Remarks

For Beginners: Credible intervals are the Bayesian equivalent of confidence intervals. They give you a range of values that likely contains the true parameter value with a certain probability (the confidence level). For example, a 95% credible interval means there's a 95% probability that the true value falls within that range. This method calculates these intervals for different types of distributions (Normal, Laplace, Student's t, LogNormal, Exponential, or Weibull) based on your data. The method automatically calculates the necessary parameters from your data and then uses the appropriate inverse CDF function to find the interval bounds.

Exceptions

ArgumentException

Thrown when an invalid distribution type is specified.

CalculateDIC<TInput, TOutput>(ModelStats<T, TInput, TOutput>)

Calculates the Deviance Information Criterion (DIC) for Bayesian model comparison.

public static T CalculateDIC<TInput, TOutput>(ModelStats<T, TInput, TOutput> modelStats)

Parameters

modelStats ModelStats<T, TInput, TOutput>

The model statistics object containing necessary information.

Returns

T

The DIC value.

Type Parameters

TInput
TOutput

Remarks

For Beginners: The Deviance Information Criterion (DIC) is a hierarchical modeling generalization of the AIC and BIC, used for Bayesian model comparison. It's calculated as D(?²) + 2pD, where D(?²) is the deviance at the posterior mean (a measure of how well the model fits the data), and pD is the effective number of parameters (a measure of model complexity). Lower DIC values indicate better models. DIC is particularly useful for comparing Bayesian models where the posterior distributions have been obtained using Markov Chain Monte Carlo (MCMC) methods. Like AIC and BIC, DIC balances model fit against complexity, but it's specifically designed for Bayesian models where the effective number of parameters might not be clear due to prior information and hierarchical structure.

CalculateDaviesBouldinIndex(Matrix<T>, Vector<T>)

Calculates the Davies-Bouldin Index for a clustering result.

public static T CalculateDaviesBouldinIndex(Matrix<T> data, Vector<T> labels)

Parameters

data Matrix<T>

The data matrix where each row is an observation.

labels Vector<T>

The cluster labels for each observation.

Returns

T

The Davies-Bouldin Index.

Remarks

For Beginners: The Davies-Bouldin Index (DBI) measures the average similarity between each cluster and its most similar cluster. Lower values indicate better clustering with more distinct, well-separated clusters. For each cluster, it calculates the ratio of the sum of within-cluster scatter to the between-cluster separation for the most similar cluster, then averages these ratios. This method first calculates centroids for all clusters, then for each cluster, finds the maximum ratio with any other cluster and adds it to the total. DBI is particularly useful because it considers both the compactness of clusters (how close points are to their centroids) and the separation between clusters. Unlike some metrics, it doesn't improve simply by increasing the number of clusters, making it valuable for comparing clustering results with different numbers of clusters.

CalculateDistance(Vector<T>, Vector<T>, DistanceMetricType, Matrix<T>?)

Calculates the distance or similarity between two vectors using the specified metric.

public static T CalculateDistance(Vector<T> v1, Vector<T> v2, DistanceMetricType metric, Matrix<T>? covarianceMatrix = null)

Parameters

v1 Vector<T>

The first vector.

v2 Vector<T>

The second vector.

metric DistanceMetricType

The distance metric to use.

covarianceMatrix Matrix<T>

The covariance matrix (required only for Mahalanobis distance).

Returns

T

The calculated distance or similarity value.

Remarks

For Beginners: This method provides a unified interface for calculating various distance and similarity measures between two vectors. It supports common metrics like Euclidean (straight-line) distance, Manhattan (city block) distance, cosine similarity (angle between vectors), Jaccard similarity (overlap between sets), Hamming distance (number of different positions), and Mahalanobis distance (accounting for correlations). Each metric has different properties and is suitable for different types of data and applications. For example, cosine similarity is good for text data, Euclidean distance works well for low-dimensional continuous data, and Hamming distance is appropriate for categorical features. This method lets you easily switch between metrics to find the one that works best for your specific problem.

Exceptions

ArgumentException

Thrown when an unsupported distance metric is specified.

ArgumentNullException

Thrown when covariance matrix is null for Mahalanobis distance.

CalculateDurbinWatsonStatistic(Vector<T>, Vector<T>)

Calculates the Durbin-Watson statistic to test for autocorrelation in residuals.

public static T CalculateDurbinWatsonStatistic(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The Durbin-Watson statistic.

Remarks

For Beginners: The Durbin-Watson statistic tests whether there is autocorrelation in the residuals (errors) of a regression model. Autocorrelation means that the error at one point is correlated with the error at another point, which violates a key assumption of many regression models. The statistic ranges from 0 to 4, with a value of 2 indicating no autocorrelation. Values less than 2 suggest positive autocorrelation (adjacent errors tend to have the same sign), while values greater than 2 suggest negative autocorrelation (adjacent errors tend to have opposite signs). This method calculates the statistic by first computing the residuals (actual - predicted) and then passing them to the other overload of this method.

CalculateDurbinWatsonStatistic(List<T>)

Calculates the Durbin-Watson statistic from a list of residuals.

public static T CalculateDurbinWatsonStatistic(List<T> residualList)

Parameters

residualList List<T>

The list of residuals (errors) from a model.

Returns

T

The Durbin-Watson statistic.

Remarks

For Beginners: This version of the Durbin-Watson statistic calculation takes a list of residuals (errors) directly. The statistic is calculated as the sum of squared differences between consecutive residuals divided by the sum of squared residuals. It tests for autocorrelation in the residuals, which means checking whether the error at one point is related to the error at adjacent points. The statistic ranges from 0 to 4, with a value of 2 indicating no autocorrelation. Values less than 2 suggest positive autocorrelation, while values greater than 2 suggest negative autocorrelation. Detecting autocorrelation is important because it can indicate that your model is missing important variables or structure in the data.

CalculateDynamicTimeWarping(Vector<T>, Vector<T>)

Calculates the Dynamic Time Warping (DTW) distance between two time series.

public static T CalculateDynamicTimeWarping(Vector<T> series1, Vector<T> series2)

Parameters

series1 Vector<T>

The first time series.

series2 Vector<T>

The second time series.

Returns

T

The DTW distance.

Remarks

For Beginners: Dynamic Time Warping (DTW) is a technique for measuring similarity between two temporal sequences that may vary in speed or timing. Unlike Euclidean distance, which compares points at the same time index, DTW finds the optimal alignment between sequences by warping the time axis. This method implements DTW using dynamic programming to build a matrix of distances and find the path with minimal cumulative distance. Lower DTW distances indicate more similar sequences. DTW is particularly useful for comparing patterns in time series data, such as speech recognition, gesture recognition, and financial time series analysis. It can detect similarities even when sequences are shifted, stretched, or compressed in time, making it more robust than point-by-point comparison methods.

CalculateEffectiveNumberOfParameters(Matrix<T>, Vector<T>)

Calculates the effective number of parameters in a model using the trace of the hat matrix.

public static T CalculateEffectiveNumberOfParameters(Matrix<T> features, Vector<T> coefficients)

Parameters

features Matrix<T>

The feature matrix.

coefficients Vector<T>

The model coefficients.

Returns

T

The effective number of parameters.

Remarks

For Beginners: The effective number of parameters measures the complexity of a model, accounting for regularization and prior information that might reduce the effective complexity below the actual number of parameters. This method calculates it using the trace of the "hat matrix" (H = X(X'X)^(-1)X'), which maps the observed values to the fitted values in linear regression. The trace of this matrix equals the number of parameters in ordinary least squares regression, but can be less in regularized models. This metric is important in information criteria like AIC, BIC, and DIC, which penalize model complexity to prevent overfitting. It's particularly useful for hierarchical and regularized models where the nominal number of parameters might overstate the model's complexity.

CalculateExplainedVarianceScore(Vector<T>, Vector<T>)

Calculates the explained variance score between actual and predicted values.

public static T CalculateExplainedVarianceScore(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The explained variance score, typically between 0 and 1.

Remarks

For Beginners: The explained variance score measures how much of the variance in the actual data is captured by your model. It's similar to R², but focuses specifically on variance. - A score of 1 means your model perfectly captures the variance in the data - A score of 0 means your model doesn't explain any of the variance - Negative scores can occur when the model is worse than just predicting the mean

This metric helps you understand how well your model accounts for the spread in your data.

CalculateExponentialPDF(T, T)

Calculates the probability density function (PDF) value for an exponential distribution.

public static T CalculateExponentialPDF(T lambda, T x)

Parameters

lambda T

The rate parameter of the exponential distribution.

x T

The value at which to evaluate the PDF.

Returns

T

The PDF value at the specified point.

Remarks

For Beginners: The exponential PDF function calculates the height of the probability curve at a specific point for an exponential distribution. The exponential distribution is commonly used to model the time between events in a process where events occur continuously and independently at a constant average rate. The lambda parameter represents this rate - higher values of lambda mean events happen more frequently on average. The PDF is zero for negative values of x, reflecting that you can't have negative time between events.

CalculateF1Score(T, T)

Calculates the F1 score from precision and recall values.

public static T CalculateF1Score(T precision, T recall)

Parameters

precision T

The precision value.

recall T

The recall value.

Returns

T

The F1 score.

Remarks

For Beginners: The F1 score is a single metric that balances precision and recall. It's calculated as 2 * (precision * recall) / (precision + recall), which is the harmonic mean of precision and recall. The F1 score ranges from 0 to 1, with higher values indicating better performance. It's particularly useful when you need a single metric to evaluate your model and when the classes in your data are imbalanced. The F1 score gives equal weight to precision and recall, making it a good choice when both false positives and false negatives are important to minimize. If the denominator is zero (which happens when both precision and recall are zero), this method returns zero to avoid division by zero errors.

CalculateForecastInterval(Vector<T>, Vector<T>, T)

Calculates a forecast interval for future predictions.

public static (T Lower, T Upper) CalculateForecastInterval(Vector<T> actual, Vector<T> predicted, T confidenceLevel)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the forecast interval.

Remarks

For Beginners: A forecast interval is a range where future observations are expected to fall with a certain probability. It's similar to a prediction interval but specifically designed for time series forecasting. This method calculates forecast intervals based on the mean squared error between actual and predicted values, adjusted by a t-value corresponding to the desired confidence level. The resulting interval gives you a range where you can expect future values to fall, with the specified level of confidence. Wider intervals indicate greater uncertainty in your forecasts.

CalculateFriedmanMSE(Vector<T>, List<int>, List<int>)

Calculates the Friedman Mean Squared Error for a potential split in a decision tree.

public static T CalculateFriedmanMSE(Vector<T> y, List<int> leftIndices, List<int> rightIndices)

Parameters

y Vector<T>

The target values vector.

leftIndices List<int>

Indices of data points in the left branch.

rightIndices List<int>

Indices of data points in the right branch.

Returns

T

The Friedman MSE value for the split.

Remarks

For Beginners: This method evaluates how good a split is in a decision tree by measuring the weighted squared difference between the means of the two resulting groups. A higher value indicates a better split that creates more distinct groups. This is one of several criteria used to determine where to split data in decision tree algorithms.

CalculateGoodnessOfFit(Vector<T>, Func<T, T>)

Calculates the goodness of fit for a probability distribution against sample data.

public static T CalculateGoodnessOfFit(Vector<T> values, Func<T, T> pdfFunction)

Parameters

values Vector<T>

The sample data.

pdfFunction Func<T, T>

A function that calculates the PDF value for a given data point.

Returns

T

The negative log-likelihood, which measures how well the distribution fits the data (smaller values indicate better fit).

Remarks

For Beginners: This method measures how well a particular probability distribution fits your data. It works by calculating the probability density function (PDF) value for each data point, taking the logarithm of these values, summing them up, and then negating the result. This gives what's called the "negative log-likelihood" - a common measure in statistics. The smaller this value, the better the distribution fits your data. This is useful when you want to determine which type of distribution (normal, exponential, Weibull, etc.) best describes your data.

CalculateInverseBetaCDF(T, T, T)

Calculates the inverse cumulative distribution function (quantile function) of the Beta distribution.

public static T CalculateInverseBetaCDF(T probability, T alpha, T beta)

Parameters

probability T

The probability for which to find the quantile, must be between 0 and 1.

alpha T

The first shape parameter (α), must be positive.

beta T

The second shape parameter (β), must be positive.

Returns

T

The value x such that P(X ≤ x) = probability for X ~ Beta(α, β).

Remarks

For Beginners: This function finds the value below which a certain percentage of the Beta distribution falls. For example, if probability = 0.95, it finds the value below which 95% of the distribution lies. This is used in calculating exact confidence intervals for proportions (Clopper-Pearson intervals).

The implementation uses Newton-Raphson iteration for numerical stability and accuracy.

Exceptions

ArgumentOutOfRangeException

Thrown when parameters are out of valid range.

CalculateInverseChiSquareCDF(int, T)

Calculates the inverse of the chi-square cumulative distribution function (CDF).

public static T CalculateInverseChiSquareCDF(int degreesOfFreedom, T probability)

Parameters

degreesOfFreedom int

The degrees of freedom parameter for the chi-square distribution.

probability T

A value between 0 and 1 representing the probability.

Returns

T

The chi-square value corresponding to the given probability and degrees of freedom.

Remarks

For Beginners: The inverse chi-square function helps you find a specific chi-square value when you know the probability and degrees of freedom. The chi-square distribution is commonly used in statistical tests to determine if observed data matches expected data. "Degrees of freedom" refers to the number of values that are free to vary in a calculation, which affects the shape of the distribution. This method uses an initial approximation followed by a refinement technique called the Newton-Raphson method to find an accurate result.

Exceptions

ArgumentOutOfRangeException

Thrown when probability is not between 0 and 1 or when degrees of freedom is not positive.

CalculateInverseExponentialCDF(T, T)

Calculates the inverse cumulative distribution function (CDF) of the exponential distribution.

public static T CalculateInverseExponentialCDF(T lambda, T probability)

Parameters

lambda T

The rate parameter of the exponential distribution.

probability T

The probability value (between 0 and 1).

Returns

T

The value x such that P(X = x) = probability for an exponential random variable X.

Remarks

For Beginners: The inverse CDF helps you find a value in your distribution given a probability. For example, if you want to know what value represents the 90th percentile in an exponential distribution, you would use this function with probability = 0.9. The exponential distribution is often used to model the time between events, like customer arrivals or equipment failures.

Exceptions

ArgumentOutOfRangeException

Thrown when probability is not between 0 and 1.

CalculateInverseLaplaceCDF(T, T, T)

Calculates the inverse of the Laplace cumulative distribution function (CDF).

public static T CalculateInverseLaplaceCDF(T median, T mad, T probability)

Parameters

median T

The median (location parameter) of the Laplace distribution.

mad T

The mean absolute deviation (scale parameter) of the Laplace distribution.

probability T

A value between 0 and 1 representing the probability.

Returns

T

The value from the Laplace distribution corresponding to the given probability.

Remarks

For Beginners: The inverse Laplace CDF function helps you find a specific value when you know the probability for a Laplace distribution. The Laplace distribution (also called the double exponential distribution) has a peak at the median and falls off exponentially on both sides. It's often used to model data that has heavier tails than a normal distribution. The median parameter tells you where the center of the distribution is, while the mean absolute deviation (mad) tells you how spread out the values are.

CalculateInverseNormalCDF(T)

Calculates the inverse of the standard normal cumulative distribution function (CDF).

public static T CalculateInverseNormalCDF(T probability)

Parameters

probability T

A value between 0 and 1 representing the probability.

Returns

T

The z-score corresponding to the given probability.

Remarks

For Beginners: The inverse normal CDF function helps you find a specific value (called a z-score) when you know the probability. Think of it like this: if you know that 95% of values in a normal distribution are below a certain point, this function tells you what that point is. The normal distribution is the familiar bell-shaped curve used in statistics. This method uses a mathematical approximation to calculate the result quickly and accurately.

Exceptions

ArgumentOutOfRangeException

Thrown when probability is not between 0 and 1.

CalculateInverseNormalCDF(T, T, T)

Calculates the inverse of the normal cumulative distribution function (CDF) with specified mean and standard deviation.

public static T CalculateInverseNormalCDF(T mean, T stdDev, T probability)

Parameters

mean T

The mean (average) of the normal distribution.

stdDev T

The standard deviation of the normal distribution.

probability T

A value between 0 and 1 representing the probability.

Returns

T

The value from the normal distribution with the given mean and standard deviation corresponding to the specified probability.

Remarks

For Beginners: This version of the inverse normal CDF function allows you to specify the mean (average) and standard deviation (a measure of how spread out the values are) of your normal distribution. It converts the result from the standard normal distribution (which has mean 0 and standard deviation 1) to your specific distribution. For example, if you want to find the value that 90% of the data falls below in a normal distribution with mean 100 and standard deviation 15, you would use this function with those parameters and a probability of 0.9.

CalculateInverseStudentTCDF(int, T)

Calculates the inverse of the Student's t cumulative distribution function (CDF).

public static T CalculateInverseStudentTCDF(int degreesOfFreedom, T probability)

Parameters

degreesOfFreedom int

The degrees of freedom parameter for the t-distribution.

probability T

A value between 0 and 1 representing the probability.

Returns

T

The t-value corresponding to the given probability and degrees of freedom.

Remarks

For Beginners: The inverse Student's t-distribution function helps you find a specific t-value when you know the probability and degrees of freedom. The Student's t-distribution is similar to the normal distribution but has heavier tails, making it useful when working with small sample sizes. It's commonly used in hypothesis testing when the population standard deviation is unknown. This method uses a series of approximations to calculate the result, with special handling for low degrees of freedom where the approximation needs to be more careful.

Exceptions

ArgumentOutOfRangeException

Thrown when probability is not between 0 and 1.

CalculateJackknifeInterval(Vector<T>, Vector<T>)

Calculates confidence intervals using the jackknife resampling method.

public static (T Lower, T Upper) CalculateJackknifeInterval(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the jackknife confidence interval.

Remarks

For Beginners: The jackknife method is a resampling technique that helps estimate the bias and variance of a statistic. Unlike bootstrap (which creates new samples by randomly selecting with replacement), jackknife creates new samples by leaving out one observation at a time. This method calculates jackknife confidence intervals by computing the mean of each leave-one-out sample, then using the distribution of these means to estimate the standard error. It then applies a t-value to calculate the confidence interval. Jackknife intervals are useful when you have a small sample size or when you want to reduce the influence of potential outliers.

CalculateKendallTau(Vector<T>, Vector<T>)

Calculates Kendall's tau correlation coefficient between two vectors.

public static T CalculateKendallTau(Vector<T> x, Vector<T> y)

Parameters

x Vector<T>

The first vector.

y Vector<T>

The second vector.

Returns

T

Kendall's tau correlation coefficient.

Remarks

For Beginners: Kendall's tau is another rank correlation measure that assesses the ordinal association between two variables. It's calculated by comparing every possible pair of observations and counting concordant pairs (where both variables change in the same direction) and discordant pairs (where variables change in opposite directions). The coefficient is the difference between concordant and discordant pairs, divided by the total number of possible pairs. Like other correlation coefficients, it ranges from -1 to 1. Kendall's tau is particularly robust to outliers and doesn't assume any particular distribution. It's often used when the data has many tied ranks or when a more robust measure than Spearman's correlation is needed.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

CalculateLOO<TInput, TOutput>(ModelStats<T, TInput, TOutput>)

Calculates the Leave-One-Out Cross-Validation (LOO-CV) criterion for Bayesian model comparison.

public static T CalculateLOO<TInput, TOutput>(ModelStats<T, TInput, TOutput> modelStats)

Parameters

modelStats ModelStats<T, TInput, TOutput>

The model statistics object containing necessary information.

Returns

T

The LOO-CV value.

Type Parameters

TInput
TOutput

Remarks

For Beginners: Leave-One-Out Cross-Validation (LOO-CV) is a method for estimating how well a model will perform on unseen data. It works by fitting the model multiple times, each time leaving out one observation, and then predicting that observation with the model trained on all other data. This method calculates the LOO-CV criterion as -2 times the sum of the logarithms of these leave-one-out predictive densities. Lower values indicate better models. LOO-CV is particularly useful because it directly estimates out-of-sample prediction accuracy without requiring you to hold out a separate validation set. It's more computationally intensive than information criteria like AIC or BIC, but it often provides a more accurate estimate of a model's predictive performance.

CalculateLaplacePDF(T, T, T)

Calculates the probability density function (PDF) value for a Laplace distribution.

public static T CalculateLaplacePDF(T median, T mad, T x)

Parameters

median T

The median (location parameter) of the Laplace distribution.

mad T

The mean absolute deviation (scale parameter) of the Laplace distribution.

x T

The value at which to evaluate the PDF.

Returns

T

The PDF value at the specified point.

Remarks

For Beginners: The Laplace PDF function calculates the height of the probability curve at a specific point for a Laplace distribution. The Laplace distribution (also called the double exponential distribution) has a peak at the median and falls off exponentially on both sides. Unlike the normal distribution which has a bell shape, the Laplace distribution has a sharper peak and heavier tails. The median parameter tells you where the center of the distribution is, while the mean absolute deviation (mad) tells you how spread out the values are.

CalculateLearningCurve(Vector<T>, Vector<T>, int)

Calculates a learning curve by evaluating model performance on increasingly larger subsets of data.

public static List<T> CalculateLearningCurve(Vector<T> yActual, Vector<T> yPredicted, int steps)

Parameters

yActual Vector<T>

The actual values.

yPredicted Vector<T>

The predicted values from a model.

steps int

The number of points to calculate in the learning curve.

Returns

List<T>

A list of R-squared values representing the model performance at each step.

Remarks

For Beginners: A learning curve helps you understand how a model's performance improves as it sees more training data. This method creates a learning curve by calculating the R-squared (a measure of how well the model fits the data) for increasingly larger subsets of your data. For example, it might calculate R-squared for the first 10% of the data, then the first 20%, and so on. The resulting curve can help you determine if your model would benefit from more training data or if it's already reached its potential. A curve that's still rising at the end suggests that more data could improve performance, while a plateau indicates that additional data might not help much.

CalculateLeaveOneOutPredictiveDensities(Matrix<T>, Vector<T>, Func<Matrix<T>, Vector<T>, Vector<T>>)

Calculates leave-one-out predictive densities for each observation.

public static List<T> CalculateLeaveOneOutPredictiveDensities(Matrix<T> features, Vector<T> actualValues, Func<Matrix<T>, Vector<T>, Vector<T>> modelFitFunction)

Parameters

features Matrix<T>

The feature matrix.

actualValues Vector<T>

The actual observed values.

modelFitFunction Func<Matrix<T>, Vector<T>, Vector<T>>

A function that fits the model and returns coefficients.

Returns

List<T>

A list of leave-one-out predictive densities.

Remarks

For Beginners: This method implements leave-one-out cross-validation by systematically excluding each observation, fitting the model on the remaining data, and then calculating how well that model predicts the excluded observation. For each observation, it removes that data point, trains the model on the remaining data, predicts the value for the excluded point, and calculates the likelihood of the actual value given this prediction. The result is a list of predictive densities, one for each observation. These values are used in the LOO-CV criterion to assess the model's predictive performance. This approach is computationally intensive but provides a robust estimate of out-of-sample prediction accuracy.

CalculateLikelihood(T, T)

Calculates the likelihood of an observed value given a predicted value.

public static T CalculateLikelihood(T actual, T predicted)

Parameters

actual T

The actual observed value.

predicted T

The predicted value from a model.

Returns

T

The likelihood value.

Remarks

For Beginners: Likelihood measures how probable the observed data is under a specific model. This method calculates the likelihood for a single observation using a Gaussian (normal) distribution centered at the predicted value. It computes exp(-0.5 * residual²), where residual is the difference between the actual and predicted values. Higher likelihood values indicate that the model's prediction is closer to the actual value. Likelihood is a fundamental concept in statistics and forms the basis for many estimation methods, including maximum likelihood estimation. In Bayesian statistics, the likelihood function is combined with prior distributions to obtain posterior distributions, which are used for inference and prediction.

CalculateLogLikelihood(Vector<T>, Vector<T>)

Calculates the log-likelihood of a model given actual and predicted values.

public static T CalculateLogLikelihood(Vector<T> actualValues, Vector<T> predictedValues)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

Returns

T

The log-likelihood value.

Remarks

For Beginners: The log-likelihood measures how well a model fits the observed data. It's calculated by summing the logarithms of the absolute residuals (differences between actual and predicted values) and multiplying by -0.5. Higher log-likelihood values (closer to zero) indicate better fit. Log-likelihood is used in many statistical contexts, including maximum likelihood estimation and information criteria like AIC and BIC. Working with log-likelihood instead of likelihood directly helps avoid numerical underflow with very small probability values. This particular implementation assumes a Laplace distribution for the errors, which is more robust to outliers than the normal distribution typically used in log-likelihood calculations.

CalculateLogNormalPDF(T, T, T)

Calculates the probability density function (PDF) value for a log-normal distribution.

public static T CalculateLogNormalPDF(T mu, T sigma, T x)

Parameters

mu T

The mean of the natural logarithm of the distribution.

sigma T

The standard deviation of the natural logarithm of the distribution.

x T

The value at which to evaluate the PDF.

Returns

T

The PDF value at the specified point.

Remarks

For Beginners: The log-normal PDF function calculates the height of the probability curve at a specific point for a log-normal distribution. A log-normal distribution occurs when the logarithm of a variable follows a normal distribution. This distribution is useful for modeling quantities that can't be negative and are positively skewed, such as income, house prices, or certain biological measurements. The parameters mu and sigma are the mean and standard deviation of the variable's natural logarithm, not of the variable itself.

CalculateLogPointwisePredictiveDensity(Vector<T>, Vector<T>)

Calculates the log pointwise predictive density (LPPD) for a model.

public static T CalculateLogPointwisePredictiveDensity(Vector<T> actualValues, Vector<T> predictedValues)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

Returns

T

The log pointwise predictive density.

Remarks

For Beginners: The log pointwise predictive density (LPPD) measures how well a model's predictions match the observed data. It's calculated by summing the logarithms of the likelihoods for each observation. Higher LPPD values indicate better fit. This metric is used in information criteria like WAIC (Widely Applicable Information Criterion) to assess model performance. Unlike simple measures like mean squared error, LPPD accounts for the uncertainty in predictions by using the full likelihood function. It's particularly useful in Bayesian statistics because it can be calculated directly from posterior samples. The LPPD forms the basis for more complex model comparison metrics that balance fit against model complexity.

CalculateMarginalLikelihood(Vector<T>, Vector<T>, int)

Calculates an approximation of the marginal likelihood for a model.

public static T CalculateMarginalLikelihood(Vector<T> actualValues, Vector<T> predictedValues, int numParameters)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

numParameters int

The number of parameters in the model.

Returns

T

The approximated marginal likelihood.

Remarks

For Beginners: The marginal likelihood (also called the evidence) is a key quantity in Bayesian statistics that measures how well a model explains the observed data, integrating over all possible parameter values. This method approximates the marginal likelihood using the Bayesian Information Criterion (BIC). It first calculates the log-likelihood of the model, then computes the BIC as -2*log-likelihood + k*log(n), where k is the number of parameters and n is the sample size. Finally, it converts this to an approximation of the marginal likelihood using the formula exp(-0.5*BIC). The marginal likelihood is used in Bayes factors for model comparison, with higher values indicating models that better explain the data while accounting for model complexity.

CalculateMaxError(Vector<T>, Vector<T>)

Calculates the maximum absolute error between actual and predicted values.

public static T CalculateMaxError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The maximum absolute error.

Remarks

For Beginners: The maximum error measures the largest absolute difference between any actual value and its corresponding prediction. This metric gives you the "worst-case scenario" for your model's predictions. While mean or median error metrics tell you about the typical performance, the maximum error tells you about the extreme cases. This can be important in applications where even a single large error could have serious consequences. Like other error metrics, it's in the same units as your original data, making it easy to interpret in the context of your problem.

CalculateMean(IEnumerable<T>)

Calculates the arithmetic mean (average) of a collection of values.

public static T CalculateMean(IEnumerable<T> values)

Parameters

values IEnumerable<T>

The collection of values to average.

Returns

T

The arithmetic mean of the values.

Remarks

For Beginners: The mean is simply the average of all values in a dataset. It's calculated by adding up all values and dividing by the number of values.

CalculateMeanAbsoluteDeviation(Vector<T>, T)

Calculates the Mean Absolute Deviation (MAD) of a vector of values from a given median.

public static T CalculateMeanAbsoluteDeviation(Vector<T> values, T median)

Parameters

values Vector<T>

The vector of values to calculate MAD for.

median T

The median value to calculate deviations from.

Returns

T

The Mean Absolute Deviation of the values.

Remarks

For Beginners: Mean Absolute Deviation measures how spread out your data is from a central value (median). It calculates the average of the absolute differences between each value and the median.

For example, for values [2, 4, 6, 8] with median 5: 1. Calculate absolute differences: |2-5|=3, |4-5|=1, |6-5|=1, |8-5|=3 2. Calculate average: (3+1+1+3)/4 = 2

CalculateMeanAbsoluteError(Vector<T>, Vector<T>)

Calculates the Mean Absolute Error (MAE) between actual and predicted values.

public static T CalculateMeanAbsoluteError(Vector<T> actualValues, Vector<T> predictedValues)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

Returns

T

The mean absolute error.

Remarks

For Beginners: Mean Absolute Error measures the average magnitude of errors between predicted and actual values, without considering their direction (positive or negative). Unlike RMSE, MAE gives equal weight to all errors, making it less sensitive to outliers. Lower MAE values indicate better model performance.

CalculateMeanAbsoluteError(Vector<T>, List<int>, List<int>)

Calculates the Mean Absolute Error (MAE) for a split in decision tree algorithms.

public static T CalculateMeanAbsoluteError(Vector<T> y, List<int> leftIndices, List<int> rightIndices)

Parameters

y Vector<T>

The target values vector.

leftIndices List<int>

Indices of data points in the left branch.

rightIndices List<int>

Indices of data points in the right branch.

Returns

T

The negative weighted MAE (negative because lower MAE is better).

Remarks

For Beginners: This method helps decision trees decide where to split data. It calculates how far, on average, each data point is from the median value in its group. The result is negated because in optimization, we typically minimize a cost function, but for MAE, lower values are better (indicating less error).

CalculateMeanAbsolutePercentageError(Vector<T>, Vector<T>)

Calculates the Mean Absolute Percentage Error (MAPE) between actual and predicted values.

public static T CalculateMeanAbsolutePercentageError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

Vector of actual observed values.

predicted Vector<T>

Vector of predicted values.

Returns

T

The MAPE as a percentage.

Remarks

For Beginners: MAPE measures how accurate a prediction is as a percentage of the actual value. For example, if MAPE = 5%, it means predictions are off by 5% on average. Lower values indicate better predictions. The method skips pairs where the actual value is very close to zero to avoid division by zero issues.

Exceptions

ArgumentException

Thrown when actual and predicted vectors have different lengths.

CalculateMeanAndStandardDeviation(Vector<T>)

Calculates the mean (average) and standard deviation of a set of values.

public static (T Mean, T StandardDeviation) CalculateMeanAndStandardDeviation(Vector<T> values)

Parameters

values Vector<T>

The sample data.

Returns

(T Accuracy, T Loss)

A tuple containing the mean and standard deviation.

Remarks

For Beginners: This method calculates two fundamental statistical measures from your data. The mean is simply the average of all values - add them up and divide by how many there are. The standard deviation measures how spread out the values are from the mean. A small standard deviation means the values tend to be close to the mean, while a large standard deviation means they're more spread out. This method uses a computationally efficient approach that only requires a single pass through the data.

CalculateMeanAveragePrecision(Vector<T>, Vector<T>, int)

Calculates the Mean Average Precision (MAP) at k for a ranking task.

public static T CalculateMeanAveragePrecision(Vector<T> actual, Vector<T> predicted, int k)

Parameters

actual Vector<T>

The actual relevance scores.

predicted Vector<T>

The predicted scores used for ranking.

k int

The number of top items to consider.

Returns

T

The Mean Average Precision at k.

Remarks

For Beginners: Mean Average Precision (MAP) is a metric for evaluating ranking algorithms, particularly in information retrieval and recommendation systems. It measures how well a system ranks relevant items higher than irrelevant ones. This method calculates MAP by first sorting items by their predicted scores, then for each relevant item in the top k positions, calculating the precision at that position (the fraction of relevant items up to that point) and averaging these precision values. MAP ranges from 0 to 1, with higher values indicating better ranking performance. It's particularly useful because it considers both the order of relevant items and their positions in the ranking. Unlike metrics that only count relevant items, MAP rewards algorithms that place relevant items higher in the ranking.

CalculateMeanBiasError(Vector<T>, Vector<T>)

Calculates the mean bias error between actual and predicted values.

public static T CalculateMeanBiasError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The mean bias error.

Remarks

For Beginners: The mean bias error (MBE) measures the average direction of errors in your predictions. It's calculated by taking the average of the differences between predicted and actual values (predicted - actual). A positive MBE indicates that your model tends to overestimate values (predictions are too high on average), while a negative MBE indicates that your model tends to underestimate values (predictions are too low on average). An MBE close to zero suggests that your model's errors are balanced in both directions. This metric is useful for detecting systematic bias in your predictions, but it can mask the magnitude of errors since positive and negative errors can cancel each other out.

CalculateMeanPredictionError(Vector<T>, Vector<T>)

Calculates the mean absolute prediction error between actual and predicted values.

public static T CalculateMeanPredictionError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The mean absolute prediction error.

Remarks

For Beginners: The mean prediction error (specifically, the mean absolute error or MAE) measures how far off your predictions are from the actual values, on average. It calculates the absolute difference between each predicted value and its corresponding actual value, then takes the average of these differences. This gives you a single number that represents the typical magnitude of your prediction errors. The MAE is in the same units as your original data, making it easy to interpret. For example, if you're predicting house prices in dollars and get an MAE of $10,000, it means your predictions are off by about $10,000 on average. Lower values indicate better predictive accuracy.

CalculateMeanReciprocalRank(Vector<T>, Vector<T>)

Calculates the Mean Reciprocal Rank (MRR) for a ranking task.

public static T CalculateMeanReciprocalRank(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual relevance scores.

predicted Vector<T>

The predicted scores used for ranking.

Returns

T

The Mean Reciprocal Rank.

Remarks

For Beginners: Mean Reciprocal Rank (MRR) is a metric for evaluating ranking algorithms that focuses on the position of the first relevant item. It's calculated as the reciprocal of the rank of the first relevant item (1/rank). This method sorts items by their predicted scores and returns the reciprocal of the position of the first item with a positive actual score. MRR ranges from 0 to 1, with higher values indicating better performance. It's particularly useful in scenarios where the user is likely to stop after finding the first relevant result, such as question answering or search engines. Unlike metrics that consider all relevant items, MRR only cares about how quickly the system can provide at least one correct answer, making it a good measure of the system's ability to quickly satisfy the user's information need.

CalculateMeanSquaredError(Vector<T>, List<int>, List<int>)

Calculates the Mean Squared Error (MSE) for a potential split in a decision tree.

public static T CalculateMeanSquaredError(Vector<T> y, List<int> leftIndices, List<int> rightIndices)

Parameters

y Vector<T>

The target values vector.

leftIndices List<int>

Indices of data points in the left branch.

rightIndices List<int>

Indices of data points in the right branch.

Returns

T

The weighted MSE for the split.

Remarks

For Beginners: This method measures how much the values in each group (after splitting) vary from their group's average. Lower MSE values indicate that the data points in each group are closer to their group's average, suggesting a better split. The MSE is weighted by the size of each group to account for uneven splits.

CalculateMeanSquaredError(IEnumerable<T>, IEnumerable<T>)

Calculates the Mean Squared Error (MSE) between actual and predicted values.

public static T CalculateMeanSquaredError(IEnumerable<T> actualValues, IEnumerable<T> predictedValues)

Parameters

actualValues IEnumerable<T>

The actual observed values.

predictedValues IEnumerable<T>

The predicted or estimated values.

Returns

T

The Mean Squared Error between the actual and predicted values.

Remarks

For Beginners: Mean Squared Error measures how accurate your predictions are compared to actual values. Lower MSE means better predictions.

It's calculated by: 1. Finding the difference between each actual and predicted value 2. Squaring each difference (to make all values positive and emphasize larger errors) 3. Calculating the average of these squared differences

MSE is commonly used to evaluate machine learning models, especially regression models.

CalculateMeanSquaredError(IEnumerable<T>, T)

Calculates the Mean Squared Error (MSE) between a set of values and their mean.

public static T CalculateMeanSquaredError(IEnumerable<T> values, T mean)

Parameters

values IEnumerable<T>

The collection of values to analyze.

mean T

The mean value to compare against.

Returns

T

The mean squared error.

Remarks

For Beginners: Mean Squared Error measures how far, on average, values are from their mean. It squares the differences before averaging them, which gives more weight to larger errors. This is useful for understanding how spread out your data is from its average value.

CalculateMeanSquaredLogError(Vector<T>, Vector<T>)

Calculates the Mean Squared Logarithmic Error (MSLE) between actual and predicted values.

public static T CalculateMeanSquaredLogError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values.

Returns

T

The Mean Squared Logarithmic Error.

Remarks

For Beginners: Mean Squared Logarithmic Error (MSLE) is a regression metric that measures the average squared difference between the logarithms of predicted and actual values. It's calculated by taking the logarithm of both actual and predicted values (after adding 1 to avoid issues with zeros), finding the squared differences, and averaging them. MSLE is particularly useful when you're more concerned with relative errors than absolute ones, or when the target variable spans multiple orders of magnitude. It penalizes underestimation more than overestimation, which is desirable in some applications like demand forecasting where underestimating can be more costly. The logarithmic transformation makes MSLE less sensitive to outliers compared to mean squared error (MSE), making it suitable for datasets with skewed distributions.

CalculateMedian(IEnumerable<T>)

Calculates the median value from a collection of numeric values.

public static T CalculateMedian(IEnumerable<T> values)

Parameters

values IEnumerable<T>

The collection of values to calculate the median from.

Returns

T

The median value of the collection.

Remarks

For Beginners: The median is the middle value when all values are arranged in order. If there's an even number of values, it's the average of the two middle values.

For example, the median of [1, 3, 5, 7, 9] is 5, and the median of [1, 3, 5, 7] is 4 (average of 3 and 5).

CalculateMedianAbsoluteDeviation(Vector<T>)

Calculates the median absolute deviation (MAD) of a set of values.

public static T CalculateMedianAbsoluteDeviation(Vector<T> values)

Parameters

values Vector<T>

The values to analyze.

Returns

T

The median absolute deviation.

Remarks

For Beginners: The median absolute deviation (MAD) is a robust measure of variability in a dataset. It's calculated by finding the median of the absolute deviations from the data's median. First, calculate the median of the data. Then, calculate how far each value is from this median (the absolute deviations). Finally, find the median of these absolute deviations. MAD is less sensitive to outliers than the standard deviation, making it useful for datasets with extreme values. In robust statistics, MAD is often used as an alternative to standard deviation. It can be scaled (multiplied by 1.4826) to make it comparable to standard deviation for normally distributed data.

CalculateMedianAbsoluteError(Vector<T>, Vector<T>)

Calculates the median absolute error between actual and predicted values.

public static T CalculateMedianAbsoluteError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The median absolute error.

Remarks

For Beginners: The median absolute error (MedAE) is a measure of prediction accuracy that's more robust to outliers than the mean absolute error. It calculates the absolute difference between each actual and predicted value, then finds the median of these differences. This gives you the "typical" error in your predictions without being overly influenced by a few very large errors. Like the mean absolute error, it's in the same units as your original data, making it easy to interpret. The MedAE is particularly useful when your error distribution is skewed or contains outliers.

CalculateMedianPredictionError(Vector<T>, Vector<T>)

Calculates the median absolute prediction error between actual and predicted values.

public static T CalculateMedianPredictionError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The median absolute prediction error.

Remarks

For Beginners: The median prediction error (specifically, the median absolute error) is similar to the mean absolute error, but it uses the median instead of the mean. It calculates the absolute difference between each predicted value and its corresponding actual value, then finds the middle value (median) of these differences. This metric is less sensitive to outliers than the mean absolute error, making it useful when your data contains extreme values that might skew the average. Like the mean absolute error, it's in the same units as your original data, making it easy to interpret. Lower values indicate better predictive accuracy.

CalculateMutualInformation(Vector<T>, Vector<T>)

Calculates the mutual information between two discrete random variables.

public static T CalculateMutualInformation(Vector<T> x, Vector<T> y)

Parameters

x Vector<T>

The first variable.

y Vector<T>

The second variable.

Returns

T

The mutual information value.

Remarks

For Beginners: Mutual information measures how much knowing one variable reduces uncertainty about another. It's a fundamental concept in information theory that quantifies the "shared information" between two random variables. This method calculates mutual information by estimating the joint and marginal probability distributions of the variables, then computing the sum of p(x,y) * log(p(x,y)/(p(x)*p(y))) over all possible value combinations. Higher mutual information indicates stronger dependency between variables. Unlike correlation, mutual information can detect non-linear relationships. It's particularly useful in feature selection, as it helps identify variables that provide unique information about the target. This implementation treats the input vectors as discrete variables, counting occurrences to estimate probabilities.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

CalculateNDCG(Vector<T>, Vector<T>, int)

Calculates the Normalized Discounted Cumulative Gain (NDCG) at k for a ranking task.

public static T CalculateNDCG(Vector<T> actual, Vector<T> predicted, int k)

Parameters

actual Vector<T>

The actual relevance scores.

predicted Vector<T>

The predicted scores used for ranking.

k int

The number of top items to consider.

Returns

T

The NDCG at k.

Remarks

For Beginners: Normalized Discounted Cumulative Gain (NDCG) is a metric for evaluating ranking algorithms that accounts for both the relevance of items and their positions in the ranking. It's particularly useful when items have graded relevance (not just relevant/irrelevant). This method calculates NDCG by first computing the Discounted Cumulative Gain (DCG), which sums the relevance scores of items divided by the logarithm of their position (to discount items lower in the ranking). It then normalizes this by dividing by the Ideal DCG (IDCG), which is the DCG of the perfect ranking. NDCG ranges from 0 to 1, with 1 indicating a perfect ranking. It's widely used in search engines, recommendation systems, and other ranking applications because it captures both the quality and position of relevant items.

CalculateNormalCDF(T, T, T)

Calculates the Cumulative Distribution Function (CDF) for a normal distribution.

public static T CalculateNormalCDF(T mean, T stdDev, T x)

Parameters

mean T

The mean of the normal distribution.

stdDev T

The standard deviation of the normal distribution.

x T

The value at which to evaluate the CDF.

Returns

T

The probability that a random variable from the normal distribution is less than or equal to x.

Remarks

For Beginners: The normal CDF tells you the probability that a random value from a normal distribution will be less than or equal to a given value (x). For example, if the CDF equals 0.95 at x=10, it means there's a 95% chance that a random value from this distribution will be 10 or less.

The normal distribution is the familiar "bell curve" shape, defined by its mean (center) and standard deviation (width/spread).

CalculateNormalPDF(T, T, T)

Calculates the probability density function (PDF) for a normal (Gaussian) distribution.

public static T CalculateNormalPDF(T mean, T stdDev, T x)

Parameters

mean T

The mean (average) of the distribution.

stdDev T

The standard deviation of the distribution.

x T

The point at which to evaluate the PDF.

Returns

T

The probability density at point x.

Remarks

For Beginners: The normal PDF tells you the relative likelihood of observing a specific value in a normal distribution. It's the familiar bell curve shape. The mean determines where the peak of the curve is located, and the standard deviation determines how wide or narrow the bell curve is. Larger standard deviations create wider, flatter curves, while smaller standard deviations create narrower, taller curves.

CalculateNormalizedMutualInformation(Vector<T>, Vector<T>)

Calculates the normalized mutual information between two variables.

public static T CalculateNormalizedMutualInformation(Vector<T> x, Vector<T> y)

Parameters

x Vector<T>

The first variable.

y Vector<T>

The second variable.

Returns

T

The normalized mutual information value.

Remarks

For Beginners: Normalized mutual information (NMI) scales the mutual information to a range between 0 and 1, making it easier to interpret and compare across different variable pairs. It's calculated by dividing the mutual information by the square root of the product of the entropies of the individual variables. A value of 0 means the variables are independent, while 1 indicates perfect dependency (one variable completely determines the other). NMI is particularly useful in clustering evaluation, where it measures how well cluster assignments match ground truth labels, and in feature selection, where it helps identify informative but non-redundant features. Unlike raw mutual information, NMI accounts for the different entropy levels of the variables, providing a more balanced measure of their relationship.

CalculateObservedTestStatistic(Vector<T>, Vector<T>, TestStatisticType)

Calculates a test statistic comparing actual and predicted values.

public static T CalculateObservedTestStatistic(Vector<T> actualValues, Vector<T> predictedValues, TestStatisticType testType = TestStatisticType.ChiSquare)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

testType TestStatisticType

The type of test statistic to calculate (default is ChiSquare).

Returns

T

The calculated test statistic.

Remarks

For Beginners: This method calculates a test statistic that measures the discrepancy between observed and predicted values. It supports two types of statistics: Chi-square and F-test. The Chi-square statistic sums the squared residuals divided by the predicted values, which is useful for assessing goodness of fit, especially for count data. The F-test statistic compares the variance explained by the model to the unexplained variance, which helps determine if the model is significantly better than a simpler model. Both statistics can be used in hypothesis testing to determine if the model's predictions are significantly different from what would be expected by chance. Higher values generally indicate a greater discrepancy between the model and the data.

Exceptions

ArgumentException

Thrown when an unsupported test statistic type is specified.

CalculatePValue(Vector<T>, Vector<T>, TestStatisticType)

Calculates the p-value for a statistical test comparing two groups.

public static T CalculatePValue(Vector<T> leftY, Vector<T> rightY, TestStatisticType testType)

Parameters

leftY Vector<T>

The first group of values.

rightY Vector<T>

The second group of values.

testType TestStatisticType

The type of statistical test to perform.

Returns

T

The p-value from the statistical test.

Remarks

For Beginners: A p-value tells you how likely your results could have happened by random chance. Smaller p-values (typically < 0.05) suggest that the differences between groups are statistically significant and not just due to random variation.

Different test types are appropriate for different kinds of data: - T-Test: For comparing means when data is normally distributed - Mann-Whitney U: For comparing distributions when data might not be normal - Permutation Test: A flexible test that works by randomly shuffling data - Chi-Square: For comparing categorical data - F-Test: For comparing variances

CalculatePartialAutoCorrelationFunction(Vector<T>, int)

Calculates the partial autocorrelation function (PACF) for a time series up to a specified maximum lag.

public static Vector<T> CalculatePartialAutoCorrelationFunction(Vector<T> series, int maxLag)

Parameters

series Vector<T>

The time series data.

maxLag int

The maximum lag to calculate partial autocorrelation for.

Returns

Vector<T>

A vector containing partial autocorrelation values for lags 0 to maxLag.

Remarks

For Beginners: The partial autocorrelation function (PACF) measures the correlation between a time series and a lagged version of itself, after removing the effects of intermediate lags. While the ACF shows all correlations (direct and indirect), the PACF isolates the direct correlation at each lag. This method implements the Durbin-Levinson algorithm to calculate partial autocorrelations, which involves solving a system of equations based on the regular autocorrelations. PACF values also range from -1 to 1, with the interpretation similar to ACF. The PACF is particularly useful for identifying the order of an autoregressive (AR) model in time series analysis. A significant spike at lag k in the PACF suggests that an AR(k) model might be appropriate. Together with the ACF, the PACF provides crucial information for time series model selection and specification.

CalculatePeakDifference(Vector<T>, Vector<T>, Vector<T>, Vector<T>)

Calculates the difference between peak values in two distributions.

public static T CalculatePeakDifference(Vector<T> x1, Vector<T> y1, Vector<T> x2, Vector<T> y2)

Parameters

x1 Vector<T>

The x-coordinates of the first distribution.

y1 Vector<T>

The y-coordinates (values) of the first distribution.

x2 Vector<T>

The x-coordinates of the second distribution.

y2 Vector<T>

The y-coordinates (values) of the second distribution.

Returns

T

The absolute difference between the x-coordinates of the peak values.

Remarks

For Beginners: This method finds the difference between the locations of peak values in two distributions. For each distribution, it identifies the x-coordinate where the y-value is at its maximum (the peak), then calculates the absolute difference between these two x-coordinates. This is useful for comparing distributions to see how far apart their modes (most common values) are. For example, in spectroscopy, you might want to know how much a peak has shifted between two spectra. In statistics, this could help identify shifts in the central tendency of distributions. The method works with any paired x-y data that represents a distribution or curve.

CalculatePearsonCorrelation(Vector<T>, Vector<T>)

Calculates the Pearson correlation coefficient between two sets of values.

public static T CalculatePearsonCorrelation(Vector<T> x, Vector<T> y)

Parameters

x Vector<T>

The first set of values.

y Vector<T>

The second set of values.

Returns

T

The Pearson correlation coefficient, a value between -1 and 1.

Remarks

For Beginners: The Pearson correlation coefficient measures the linear relationship between two variables. It ranges from -1 to 1, where 1 means a perfect positive linear relationship (as one variable increases, the other increases proportionally), -1 means a perfect negative linear relationship (as one variable increases, the other decreases proportionally), and 0 means no linear relationship. This method calculates this coefficient using the standard formula that involves the covariance of the variables divided by the product of their standard deviations. It's useful for understanding how strongly two variables are related to each other.

CalculatePearsonCorrelationCoefficient(Vector<T>, Vector<T>)

Calculates the Pearson correlation coefficient between two vectors.

public static T CalculatePearsonCorrelationCoefficient(Vector<T> x, Vector<T> y)

Parameters

x Vector<T>

The first vector.

y Vector<T>

The second vector.

Returns

T

The Pearson correlation coefficient.

Remarks

For Beginners: The Pearson correlation coefficient measures the linear relationship between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive linear relationship, -1 indicates a perfect negative linear relationship, and 0 indicates no linear relationship. This method calculates the coefficient by first finding the mean of each vector, then computing the sum of products of the deviations from the means, and finally dividing by the square root of the product of the sum of squared deviations. The Pearson correlation is widely used in statistics to measure how strongly two variables are related. It's sensitive to outliers and only captures linear relationships, so it might miss non-linear patterns in the data.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

CalculatePercentileInterval(Vector<T>, T)

Calculates a percentile-based confidence interval from predicted values.

public static (T Lower, T Upper) CalculatePercentileInterval(Vector<T> predicted, T confidenceLevel)

Parameters

predicted Vector<T>

The predicted values from a model.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the percentile interval.

Remarks

For Beginners: A percentile interval is a simple way to create a confidence interval directly from your data without assuming any particular distribution. This method sorts your predicted values and then finds the values at the percentiles corresponding to the edges of your desired confidence level. For example, for a 95% confidence interval, it finds the values at the 2.5th and 97.5th percentiles. This approach is non-parametric (doesn't assume a normal distribution) and is useful when your data doesn't follow a normal distribution or when you want a straightforward interpretation of your interval.

CalculatePopulationStandardError(Vector<T>, Vector<T>)

Calculates the population standard error of the estimate.

public static T CalculatePopulationStandardError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The population standard error of the estimate.

Remarks

For Beginners: The population standard error is simply the square root of the mean squared error between actual and predicted values. Unlike the sample standard error, it doesn't adjust for the number of model parameters. It represents the standard deviation of the prediction errors and gives you an idea of how much your predictions typically deviate from the actual values, in the same units as your original data. This metric is useful when you're treating your entire dataset as the population rather than as a sample from a larger population. Lower values indicate better predictive accuracy.

CalculatePosteriorPredictiveCheck<TInput, TOutput>(ModelStats<T, TInput, TOutput>)

Calculates a posterior predictive p-value for model checking.

public static T CalculatePosteriorPredictiveCheck<TInput, TOutput>(ModelStats<T, TInput, TOutput> modelStats)

Parameters

modelStats ModelStats<T, TInput, TOutput>

The model statistics object containing necessary information.

Returns

T

The posterior predictive p-value.

Type Parameters

TInput
TOutput

Remarks

For Beginners: Posterior predictive checking is a way to assess whether a Bayesian model fits the observed data well. This method calculates a posterior predictive p-value, which is the proportion of simulated datasets (generated from the posterior distribution) that are more extreme than the observed data according to some test statistic. A p-value close to 0.5 suggests the model fits well, while values close to 0 or 1 indicate poor fit. For example, if 95% of simulated datasets have a test statistic more extreme than the observed data (p-value = 0.95), this suggests the model doesn't capture some important aspect of the data. Posterior predictive checks are valuable because they directly assess the model's ability to generate data similar to what was observed.

CalculatePosteriorPredictiveSamples(Vector<T>, Vector<T>, int, int)

Generates samples from the posterior predictive distribution for a model.

public static List<T> CalculatePosteriorPredictiveSamples(Vector<T> actual, Vector<T> predicted, int featureCount, int numSamples = 1000)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

featureCount int

The number of features (parameters) in the model.

numSamples int

The number of samples to generate (default is 1000).

Returns

List<T>

A list of samples from the posterior predictive distribution.

Remarks

For Beginners: Posterior predictive sampling generates new data that might be observed if the model is correct, accounting for both parameter uncertainty and random variation. This method estimates the error variance from the residuals, then generates samples by adding random noise to the model's predictions. Each sample is an average of n simulated values, where n is the sample size. These samples form a distribution that represents our uncertainty about future observations. Posterior predictive samples are useful for model checking (comparing the distribution of simulated data to actual data) and for making predictions with appropriate uncertainty intervals. This approach is particularly valuable in Bayesian statistics but can be used with any regression model.

CalculatePrecisionRecallAUC(Vector<T>, Vector<T>)

Calculates the area under the precision-recall curve (PR AUC).

public static T CalculatePrecisionRecallAUC(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The area under the precision-recall curve.

Remarks

For Beginners: The precision-recall AUC measures how well a model can identify positive cases without raising too many false alarms. It's particularly useful for imbalanced datasets where negative cases are much more common than positive ones. This method calculates the AUC by sorting predictions from highest to lowest, then tracking how precision and recall change as the threshold is lowered. The area is calculated using the trapezoidal rule. PR AUC ranges from 0 to 1, with higher values indicating better performance. Unlike ROC AUC, which can be misleadingly high with imbalanced data, PR AUC focuses on the positive class and provides a more realistic assessment when the positive class is rare but important to identify correctly.

Exceptions

ArgumentException

Thrown when inputs have different lengths or when there are no positive or negative samples.

CalculatePrecisionRecallF1(Vector<T>, Vector<T>, PredictionType, T?)

Calculates precision, recall, and F1 score for a set of predictions.

public static (T Precision, T Recall, T F1Score) CalculatePrecisionRecallF1(Vector<T> actual, Vector<T> predicted, PredictionType predictionType, T? threshold = default)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values.

predictionType PredictionType

The type of prediction (Binary or Regression).

threshold T

For regression, the absolute error threshold for considering a prediction "close enough" (default is 0.1). For binary/multi-label classification, the probability threshold for converting scores to labels (default is 0.5).

Returns

(T Precision, T Recall, T F1Score)

A tuple containing the precision, recall, and F1 score.

Remarks

For Beginners: This method calculates three important metrics for evaluating prediction performance. Precision measures how many of your positive predictions were actually correct (true positives ² (true positives + false positives)). Recall measures how many of the actual positives your model correctly identified (true positives ² (true positives + false negatives)). The F1 score is the harmonic mean of precision and recall, providing a single metric that balances both concerns. For binary classification, these metrics are calculated based on the standard definitions. For regression problems, the method adapts these concepts by considering predictions within a threshold of the actual value as "correct." These metrics are particularly useful when the classes are imbalanced or when false positives and false negatives have different costs.

CalculatePrecisionRecallF1(Vector<T>, Vector<T>, PredictionType, T?, bool)

public static (T Precision, T Recall, T F1Score) CalculatePrecisionRecallF1(Vector<T> actual, Vector<T> predicted, PredictionType predictionType, T? threshold, bool treatDefaultAsMissing)

Parameters

actual Vector<T>
predicted Vector<T>
predictionType PredictionType
threshold T
treatDefaultAsMissing bool

Returns

(T Precision, T Recall, T F1Score)

CalculatePredictionIntervalCoverage(Vector<T>, Vector<T>, T, T)

Calculates the proportion of actual values that fall within a specified prediction interval.

public static T CalculatePredictionIntervalCoverage(Vector<T> actual, Vector<T> predicted, T lowerInterval, T upperInterval)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

lowerInterval T

The lower bound of the prediction interval.

upperInterval T

The upper bound of the prediction interval.

Returns

T

The proportion of actual values that fall within the prediction interval.

Remarks

For Beginners: This method helps you evaluate how well your prediction intervals are calibrated. It calculates the percentage of actual values that fall within the specified prediction interval. For example, if you've calculated a 95% prediction interval, ideally about 95% of the actual values should fall within this interval. If a much lower percentage falls within the interval, your model might be underestimating uncertainty. If a much higher percentage falls within the interval, your model might be overestimating uncertainty (making the intervals unnecessarily wide). This metric is useful for assessing whether your prediction intervals are appropriately sized for your data and model.

CalculatePredictionIntervals(Vector<T>, Vector<T>, T)

Calculates prediction intervals for future observations based on a model's predictions.

public static (T LowerInterval, T UpperInterval) CalculatePredictionIntervals(Vector<T> actual, Vector<T> predicted, T confidenceLevel)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the prediction interval.

Remarks

For Beginners: Prediction intervals tell you the range where future individual observations are likely to fall, based on your model's predictions. Unlike confidence intervals (which estimate where the true mean is), prediction intervals account for both the uncertainty in estimating the mean and the natural variability in individual observations. This method calculates these intervals by first determining the mean squared error (MSE) between actual and predicted values, then using this to estimate the standard error. It then applies a t-value (based on the confidence level) to calculate the margin of error around the mean prediction. The resulting interval gives you a range where you can expect future observations to fall with the specified level of confidence.

CalculateQuantile(T[], T)

Calculates a specific quantile from sorted data.

public static T CalculateQuantile(T[] sortedData, T quantile)

Parameters

sortedData T[]

The data array, already sorted in ascending order.

quantile T

The quantile to calculate (between 0 and 1).

Returns

T

The value at the specified quantile.

Remarks

For Beginners: A quantile divides your sorted data into equal portions. For example, the 0.25 quantile (also called the 25th percentile or first quartile) is the value below which 25% of your data falls. This method calculates any quantile between 0 and 1 using linear interpolation, which means it estimates values between actual data points when necessary. The method requires that your data is already sorted in ascending order (smallest to largest). Common quantiles include 0.25 (first quartile), 0.5 (median), and 0.75 (third quartile).

CalculateQuantileIntervals(Vector<T>, Vector<T>, T[])

Calculates confidence intervals around specified quantiles of the predicted values.

public static List<(T Quantile, T Lower, T Upper)> CalculateQuantileIntervals(Vector<T> actual, Vector<T> predicted, T[] quantiles)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

quantiles T[]

An array of quantiles to calculate intervals for.

Returns

List<(T Quantile, T Lower, T Upper)>

A list of tuples containing the quantile and its corresponding lower and upper interval bounds.

Remarks

For Beginners: This method calculates confidence intervals around specific quantiles (percentiles) of your predicted values. For example, you might want to know the range around the median (50th percentile) or the 90th percentile of your predictions. For each quantile you specify, this method calculates a lower and upper bound by looking at nearby quantiles (²2.5%). This gives you an idea of the uncertainty around different parts of your prediction distribution. These intervals are useful when you're interested in specific parts of the distribution rather than just the mean or a single prediction.

CalculateQuantiles(Vector<T>)

Calculates the first and third quartiles (25th and 75th percentiles) of a data set.

public static (T FirstQuantile, T ThirdQuantile) CalculateQuantiles(Vector<T> data)

Parameters

data Vector<T>

The data vector.

Returns

(T Accuracy, T Loss)

A tuple containing the first quartile (Q1) and third quartile (Q3).

Remarks

For Beginners: Quartiles divide your data into four equal parts after sorting it from smallest to largest. The first quartile (Q1) is the value below which 25% of the data falls, and the third quartile (Q3) is the value below which 75% of the data falls. These values are useful for understanding the spread of your data and identifying potential outliers. The difference between Q3 and Q1 is called the interquartile range (IQR) and is a robust measure of variability. This method sorts your data and then uses linear interpolation to estimate the quartile values.

CalculateR2(Vector<T>, Vector<T>)

Calculates the coefficient of determination (R²) between actual and predicted values.

public static T CalculateR2(Vector<T> actualValues, Vector<T> predictedValues)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

Returns

T

The R² value, ranging from 0 to 1 (or negative in case of poor fit).

Remarks

For Beginners: R² (R-squared) tells you how well your model explains the variation in your data. It ranges from 0 to 1, where: - 1 means your model perfectly predicts the data - 0 means your model is no better than just using the average value - Negative values can occur when the model performs worse than using the average

For example, an R² of 0.75 means your model explains 75% of the variation in the data.

CalculateROCAUC(Vector<T>, Vector<T>)

Calculates the Area Under the Receiver Operating Characteristic Curve (ROC AUC).

public static T CalculateROCAUC(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The ROC AUC value.

Remarks

For Beginners: The ROC AUC (Area Under the Curve) is a performance measurement for classification problems that tells how well a model can distinguish between classes. It ranges from 0 to 1, where 1 means perfect classification, 0.5 means the model is no better than random guessing, and values below 0.5 indicate worse-than-random performance. This method calculates the ROC curve (plotting true positive rate against false positive rate at various thresholds) and then computes the area under this curve. ROC AUC is particularly useful when you need a single metric to compare models and when the classes are somewhat balanced. It's threshold-invariant, meaning it measures the model's ability to rank positive instances higher than negative ones, regardless of the specific threshold used.

CalculateROCCurve(Vector<T>, Vector<T>)

Calculates the Receiver Operating Characteristic (ROC) curve for a set of predictions.

public static (Vector<T> fpr, Vector<T> tpr) CalculateROCCurve(Vector<T> actualValues, Vector<T> predictedValues)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

Returns

(Vector<T> mean, Vector<T> logVar)

A tuple containing vectors of false positive rates and true positive rates.

Remarks

For Beginners: The ROC curve plots the true positive rate (sensitivity) against the false positive rate (1-specificity) at various threshold settings. This method calculates these rates for each unique threshold in the predicted values. The resulting curve shows the tradeoff between catching true positives and avoiding false positives. A perfect classifier would reach the top-left corner (100% sensitivity, 0% false positives), while a random classifier would follow the diagonal line. The area under this curve (ROC AUC) is a common metric for classification performance. ROC curves are particularly useful when you need to balance sensitivity and specificity, or when the optimal threshold isn't known in advance.

CalculateReferenceModelMarginalLikelihood(Vector<T>)

Calculates the marginal likelihood for a reference model (intercept-only model).

public static T CalculateReferenceModelMarginalLikelihood(Vector<T> actual)

Parameters

actual Vector<T>

The actual observed values.

Returns

T

The marginal likelihood of the reference model.

Remarks

For Beginners: This method calculates the marginal likelihood for a simple reference model that predicts the mean of the data for all observations (an intercept-only model). It first calculates the mean and variance of the actual values, then computes the log marginal likelihood using a formula based on the normal distribution. The result is exponentiated to get the marginal likelihood. This reference model serves as a baseline for comparison in Bayes factor calculations. By comparing the marginal likelihood of a more complex model to this reference model, you can assess whether the additional complexity is justified by improved fit to the data. A Bayes factor greater than 1 indicates evidence in favor of the more complex model.

CalculateResidualSumOfSquares(Vector<T>, Vector<T>)

Calculates the residual sum of squares (SSE) between actual and predicted values.

public static T CalculateResidualSumOfSquares(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The residual sum of squares.

Remarks

For Beginners: The residual sum of squares (SSE) measures how much variation in the data remains unexplained by a model. It's calculated by summing the squared differences between each actual value and its corresponding predicted value. Lower SSE values indicate a better fit to the data. SSE is used in many statistical calculations, including mean squared error (MSE = SSE/n), R-squared (1 - SSE/SST), and F-tests for model comparison. It's also used in calculating standard errors and confidence intervals. The SSE is particularly important because it penalizes larger errors more heavily than smaller ones, making it sensitive to outliers and large prediction errors.

CalculateResiduals(Vector<T>, Vector<T>)

Calculates the residuals (errors) between actual and predicted values.

public static Vector<T> CalculateResiduals(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

Returns

Vector<T>

A vector of residuals.

Remarks

For Beginners: Residuals are the differences between observed values and the values predicted by a model. They represent the part of the data that the model doesn't explain. This method simply subtracts each predicted value from its corresponding actual value. Analyzing residuals is a crucial step in assessing model fit. In a good model, residuals should be randomly distributed around zero with no obvious patterns. Patterns in residuals (like trends, curves, or changing variance) can indicate that the model is missing important structure in the data. Residuals are used in many diagnostic plots and tests, including residual plots, Q-Q plots, and tests for autocorrelation like the Durbin-Watson test.

CalculateRootMeanSquaredError(Vector<T>, Vector<T>)

Calculates the Root Mean Squared Error (RMSE) between actual and predicted values.

public static T CalculateRootMeanSquaredError(Vector<T> actualValues, Vector<T> predictedValues)

Parameters

actualValues Vector<T>

The actual observed values.

predictedValues Vector<T>

The predicted values from a model.

Returns

T

The root mean squared error.

Remarks

For Beginners: Root Mean Squared Error is a way to measure how well a model's predictions match the actual values. It's the square root of the average of squared differences between predictions and actual values. Lower RMSE values indicate better model performance. The units of RMSE match the units of your original data, making it easier to interpret than MSE.

CalculateSampleStandardError(Vector<T>, Vector<T>, int)

Calculates the sample standard error of the estimate, adjusted for the number of model parameters.

public static T CalculateSampleStandardError(Vector<T> actual, Vector<T> predicted, int numberOfParameters)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

numberOfParameters int

The number of parameters in the model.

Returns

T

The sample standard error of the estimate.

Remarks

For Beginners: The sample standard error of the estimate measures the accuracy of predictions made by a regression model, adjusted for the complexity of the model. It's calculated by taking the square root of the mean squared error divided by the degrees of freedom (sample size minus the number of parameters). This adjustment accounts for the fact that models with more parameters tend to fit the training data better by chance. The standard error gives you an idea of how much your predictions typically deviate from the actual values, in the same units as your original data. Lower values indicate better predictive accuracy.

CalculateSilhouetteScore(Matrix<T>, Vector<T>)

Calculates the silhouette score for a clustering result.

public static T CalculateSilhouetteScore(Matrix<T> data, Vector<T> labels)

Parameters

data Matrix<T>

The data matrix where each row is an observation.

labels Vector<T>

The cluster labels for each observation.

Returns

T

The average silhouette score across all observations.

Remarks

For Beginners: The silhouette score measures how well each object fits within its assigned cluster compared to other clusters. For each point, it calculates (b-a)/max(a,b), where a is the average distance to other points in the same cluster, and b is the average distance to points in the nearest different cluster. The score ranges from -1 to 1, where higher values indicate better clustering. A score near 1 means points are well-matched to their clusters and far from neighboring clusters. A score near 0 indicates overlapping clusters, while negative scores suggest points might be assigned to the wrong clusters. This method calculates the silhouette for each point and returns the average across all points. It's a valuable tool for evaluating clustering quality and comparing different clustering algorithms or parameter settings.

CalculateSimultaneousPredictionInterval(Vector<T>, Vector<T>, T)

Calculates a simultaneous prediction interval for multiple future observations.

public static (T Lower, T Upper) CalculateSimultaneousPredictionInterval(Vector<T> actual, Vector<T> predicted, T confidenceLevel)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the simultaneous prediction interval.

Remarks

For Beginners: While a regular prediction interval gives you a range where a single future observation is likely to fall, a simultaneous prediction interval gives you a range where multiple future observations are all likely to fall at the same time. This is important when you're making multiple predictions and want to ensure that all of them (not just each one individually) are within the interval with a certain probability. This method calculates such intervals by adjusting the margin of error to account for multiple comparisons. The resulting interval is wider than a regular prediction interval but provides stronger guarantees for multiple predictions.

CalculateSkewnessAndKurtosis(Vector<T>, T, T, int)

Calculates the skewness and kurtosis of a sample.

public static (T skewness, T kurtosis) CalculateSkewnessAndKurtosis(Vector<T> sample, T mean, T stdDev, int n)

Parameters

sample Vector<T>

The sample data.

mean T

The pre-calculated mean of the sample.

stdDev T

The pre-calculated standard deviation of the sample.

n int

The sample size.

Returns

(T Accuracy, T Loss)

A tuple containing the skewness and kurtosis values.

Remarks

For Beginners: Skewness and kurtosis are measures that describe the shape of your data's distribution. Skewness measures the asymmetry - a positive skew means the distribution has a longer tail on the right side, while a negative skew means a longer tail on the left. A skewness of zero indicates a symmetric distribution. Kurtosis measures how "heavy" the tails are compared to a normal distribution. Higher kurtosis means more of the variance comes from infrequent extreme deviations, while lower kurtosis indicates more frequent moderate deviations. This method calculates both measures in a single pass through the data, using pre-calculated mean and standard deviation values for efficiency.

CalculateSpearmanRankCorrelationCoefficient(Vector<T>, Vector<T>)

Calculates the Spearman rank correlation coefficient between two vectors.

public static T CalculateSpearmanRankCorrelationCoefficient(Vector<T> x, Vector<T> y)

Parameters

x Vector<T>

The first vector.

y Vector<T>

The second vector.

Returns

T

The Spearman rank correlation coefficient.

Remarks

For Beginners: The Spearman rank correlation measures the monotonic relationship between two variables, which means it detects whether one variable tends to increase or decrease as the other increases, regardless of whether the relationship is linear. This method calculates the coefficient by first converting the values to ranks, then applying the Pearson correlation formula to these ranks. The result ranges from -1 to 1, with the same interpretation as Pearson correlation. Spearman correlation is more robust to outliers and can detect non-linear relationships as long as they're monotonic. It's particularly useful when the data doesn't meet the assumptions required for Pearson correlation, such as normality or linear relationship.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

CalculateSplitScore(Vector<T>, List<int>, List<int>, SplitCriterion)

Calculates a score for a data split based on the specified criterion.

public static T CalculateSplitScore(Vector<T> y, List<int> leftIndices, List<int> rightIndices, SplitCriterion splitCriterion)

Parameters

y Vector<T>

The target values vector.

leftIndices List<int>

Indices of values in the left group after splitting.

rightIndices List<int>

Indices of values in the right group after splitting.

splitCriterion SplitCriterion

The criterion to use for evaluating the split quality.

Returns

T

A score indicating the quality of the split (higher is better).

Remarks

For Beginners: This method helps determine how good a split is when dividing data into two groups. Different criteria measure quality differently - some look at how similar items are within each group, others at how well the split helps with predictions.

Think of it like sorting fruits: if you split apples and oranges perfectly, you'd get a high score because each group is very "pure" (contains only one type of fruit).

CalculateStandardDeviation(IEnumerable<T>)

Calculates the standard deviation of a vector.

public static T CalculateStandardDeviation(IEnumerable<T> values)

Parameters

values IEnumerable<T>

The collection of values to calculate standard deviation for.

Returns

T

The standard deviation of the values.

Remarks

For Beginners: Standard deviation is the square root of variance. It measures how spread out your data is, in the same units as your original data (unlike variance, which is in squared units).

A low standard deviation means data points tend to be close to the mean, while a high standard deviation means data points are spread out over a wider range.

CalculateStudentPDF(T, T, T, int)

Calculates the probability density function (PDF) value for a Student's t-distribution.

public static T CalculateStudentPDF(T x, T mean, T stdDev, int df)

Parameters

x T

The value at which to evaluate the PDF.

mean T

The mean (location parameter) of the distribution.

stdDev T

The standard deviation (scale parameter) of the distribution.

df int

The degrees of freedom parameter.

Returns

T

The PDF value at the specified point.

Remarks

For Beginners: The Student's t-distribution PDF function calculates the height of the probability curve at a specific point. The Student's t-distribution looks similar to the normal distribution but has heavier tails, making it more appropriate when working with small sample sizes. The mean parameter specifies the center of the distribution, the standard deviation controls how spread out it is, and the degrees of freedom parameter affects the shape - smaller values give heavier tails. This distribution is commonly used in hypothesis testing when the population standard deviation is unknown.

CalculateSymmetricMeanAbsolutePercentageError(Vector<T>, Vector<T>)

Calculates the Symmetric Mean Absolute Percentage Error (SMAPE) between actual and predicted values.

public static T CalculateSymmetricMeanAbsolutePercentageError(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual observed values.

predicted Vector<T>

The predicted values from a model.

Returns

T

The SMAPE value as a percentage.

Remarks

For Beginners: The Symmetric Mean Absolute Percentage Error (SMAPE) measures the accuracy of predictions as a percentage, but unlike the standard MAPE, it treats over-predictions and under-predictions symmetrically. It's calculated as 200% * average(|actual - predicted| / (|actual| + |predicted|)). The result ranges from 0% (perfect predictions) to 200% (worst possible predictions). SMAPE is particularly useful when the data contains zeros or very small values, which can cause standard percentage errors to explode. It's also more balanced than MAPE, which penalizes over-predictions more heavily than under-predictions. This method handles zero denominators by skipping those points, ensuring the calculation remains valid even with zero values.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

CalculateTValue(int, T)

Calculates the t-value for a given degrees of freedom and confidence level.

public static T CalculateTValue(int degreesOfFreedom, T confidenceLevel)

Parameters

degreesOfFreedom int

The degrees of freedom.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

T

The t-value corresponding to the given confidence level and degrees of freedom.

Remarks

For Beginners: The t-value is a critical value used in statistical hypothesis testing and confidence interval calculations. It represents how many standard errors away from the mean you need to go to capture a certain percentage of the data. This method calculates the t-value based on the Student's t-distribution for a given confidence level and degrees of freedom. For example, with a 95% confidence level, the t-value tells you how far from the mean you need to go to include 95% of the data in your interval. Higher confidence levels result in larger t-values, meaning wider intervals.

CalculateTheilUStatistic(Vector<T>, Vector<T>)

Calculates Theil's U statistic, a measure of forecast accuracy.

public static T CalculateTheilUStatistic(Vector<T> actual, Vector<T> predicted)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

Returns

T

Theil's U statistic.

Remarks

For Beginners: Theil's U statistic is a measure of forecast accuracy that compares your model's predictions to a simple no-change forecast. It's calculated by dividing the root mean squared error of your predictions by the sum of the root mean squared values of the actual and predicted series. A value of 0 indicates perfect forecasts, while a value of 1 indicates that your model performs no better than a naive no-change forecast. Values less than 1 indicate that your model outperforms the naive forecast, while values greater than 1 indicate that your model performs worse than the naive forecast. This metric is particularly useful in time series forecasting to evaluate whether your model adds value beyond simple forecasting methods.

CalculateToleranceInterval(Vector<T>, Vector<T>, T)

Calculates a tolerance interval for a set of predicted values.

public static (T Lower, T Upper) CalculateToleranceInterval(Vector<T> actual, Vector<T> predicted, T confidenceLevel)

Parameters

actual Vector<T>

The actual values.

predicted Vector<T>

The predicted values from a model.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the tolerance interval.

Remarks

For Beginners: A tolerance interval is a range that is expected to contain a specified proportion of a population with a certain confidence level. Unlike confidence intervals (which estimate where the true mean is) or prediction intervals (which predict where a single future observation will fall), tolerance intervals aim to capture a specified proportion of the entire population. This method calculates tolerance intervals for predicted values, taking into account both the variability in the data and the sample size. The resulting interval gives you a range where you can expect a certain percentage of all future values to fall, with the specified level of confidence.

CalculateTotalSumOfSquares(Vector<T>)

Calculates the total sum of squares (SST) for a set of values.

public static T CalculateTotalSumOfSquares(Vector<T> values)

Parameters

values Vector<T>

The values to analyze.

Returns

T

The total sum of squares.

Remarks

For Beginners: The total sum of squares (SST) measures the total variation in a dataset. It's calculated by summing the squared differences between each value and the mean of all values. SST represents how much the data points vary from their average, regardless of any model. In regression analysis, SST is the total variance that a model attempts to explain. It can be partitioned into the explained sum of squares (SSR, the variation explained by the model) and the residual sum of squares (SSE, the unexplained variation). The ratio SSR/SST gives the coefficient of determination (R²), which indicates the proportion of variance explained by the model.

CalculateVIF(Matrix<T>, ModelStatsOptions)

Calculates the Variance Inflation Factor (VIF) for each feature based on a correlation matrix.

public static List<T> CalculateVIF(Matrix<T> correlationMatrix, ModelStatsOptions options)

Parameters

correlationMatrix Matrix<T>

The correlation matrix between features.

options ModelStatsOptions

Options for model statistics calculations, including maximum VIF threshold.

Returns

List<T>

A list of VIF values, one for each feature.

Remarks

For Beginners: The Variance Inflation Factor (VIF) measures how much the variance of a regression coefficient is increased due to multicollinearity (correlation between features). For each feature, the VIF is calculated by regressing that feature against all other features and then using the formula 1/(1-R²), where R² is the coefficient of determination from that regression. A VIF of 1 means there's no correlation between this feature and others, while higher values indicate increasing multicollinearity. As a rule of thumb, VIF values above 5-10 are considered problematic. This method calculates VIF for each feature and logs a warning when it detects values above the threshold specified in the options. High VIF values suggest you might want to remove or combine some features to reduce multicollinearity.

CalculateVariance(Vector<T>, T)

Calculates the variance of a vector of values from a given mean.

public static T CalculateVariance(Vector<T> values, T mean)

Parameters

values Vector<T>

The vector of values to calculate variance for.

mean T

The mean value to calculate deviations from.

Returns

T

The variance of the values.

Remarks

For Beginners: Variance measures how spread out your data is from the average (mean). Higher variance means data points are more scattered; lower variance means they're closer together.

It's calculated by: 1. Finding the difference between each value and the mean 2. Squaring each difference 3. Calculating the average of these squared differences

CalculateVariance(IEnumerable<T>)

Calculates the variance of a collection of values.

public static T CalculateVariance(IEnumerable<T> values)

Parameters

values IEnumerable<T>

The collection of values to calculate variance for.

Returns

T

The variance of the values.

Remarks

For Beginners: This method calculates variance without requiring you to provide the mean. It first calculates the mean internally, then computes the variance.

Variance is zero when all values are identical, and increases as values become more spread out.

CalculateVarianceReduction(Vector<T>, List<int>, List<int>)

Calculates the variance reduction achieved by splitting data into left and right groups.

public static T CalculateVarianceReduction(Vector<T> y, List<int> leftIndices, List<int> rightIndices)

Parameters

y Vector<T>

The target values vector.

leftIndices List<int>

Indices of values in the left group.

rightIndices List<int>

Indices of values in the right group.

Returns

T

The variance reduction achieved by the split.

Remarks

For Beginners: Variance reduction measures how much a split improves the "purity" of data. It's used in decision trees to find the best way to split data into groups.

Higher variance reduction means the split creates more homogeneous (similar) groups, which is desirable when building decision trees.

CalculateVariationOfInformation(Vector<T>, Vector<T>)

Calculates the variation of information (also known as shared information distance) between two variables.

public static T CalculateVariationOfInformation(Vector<T> x, Vector<T> y)

Parameters

x Vector<T>

The first variable.

y Vector<T>

The second variable.

Returns

T

The variation of information value.

Remarks

For Beginners: The variation of information (VI) is a measure of the distance between two clusterings or partitions of the same data. It's calculated as the sum of the entropies of the variables minus twice their mutual information: VI(X,Y) = H(X) + H(Y) - 2*MI(X,Y). Lower values indicate more similar distributions, with 0 meaning identical distributions. VI satisfies the properties of a true metric (non-negativity, symmetry, and triangle inequality), making it useful for comparing different clusterings of the same data. It's particularly valuable in clustering evaluation and consensus clustering, where you need to measure the distance between different partitions of the data. Unlike some other measures, VI penalizes both splitting and merging of clusters equally.

CalculateWAIC<TInput, TOutput>(ModelStats<T, TInput, TOutput>)

Calculates the Widely Applicable Information Criterion (WAIC) for Bayesian model comparison.

public static T CalculateWAIC<TInput, TOutput>(ModelStats<T, TInput, TOutput> modelStats)

Parameters

modelStats ModelStats<T, TInput, TOutput>

The model statistics object containing necessary information.

Returns

T

The WAIC value.

Type Parameters

TInput
TOutput

Remarks

For Beginners: The Widely Applicable Information Criterion (WAIC) is a fully Bayesian approach to estimating the out-of-sample expectation. It's calculated as -2 * (lppd - pWAIC), where lppd is the log pointwise predictive density (a measure of how well the model fits the data), and pWAIC is the effective number of parameters (a measure of model complexity). Lower WAIC values indicate better models. WAIC is considered an improvement over DIC because it's fully Bayesian, uses the entire posterior distribution, and is invariant to parameterization. It's particularly useful for hierarchical models and models with many parameters. Like other information criteria, WAIC helps you select the model that best balances fit and complexity.

CalculateWeibullConfidenceIntervals(Vector<T>, T)

Calculates confidence intervals for Weibull distribution parameters using bootstrap resampling.

public static (T LowerBound, T UpperBound) CalculateWeibullConfidenceIntervals(Vector<T> values, T confidenceLevel)

Parameters

values Vector<T>

The sample data.

confidenceLevel T

The confidence level (e.g., 0.95 for 95% confidence).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the confidence interval.

Remarks

For Beginners: This method estimates confidence intervals for Weibull distribution parameters using a technique called bootstrap resampling. Bootstrapping works by creating many new samples by randomly selecting values from your original data (with replacement), then calculating the parameters for each of these samples. This gives you a distribution of possible parameter values, from which you can determine confidence intervals. The Weibull distribution is commonly used to model things like failure rates and lifetimes of components.

CalculateWeibullCredibleIntervals(Vector<T>, T, T)

Calculates the credible intervals for a Weibull distribution.

public static (T LowerBound, T UpperBound) CalculateWeibullCredibleIntervals(Vector<T> sample, T lowerProbability, T upperProbability)

Parameters

sample Vector<T>

The sample data.

lowerProbability T

The lower probability bound (e.g., 0.025 for a 95% interval).

upperProbability T

The upper probability bound (e.g., 0.975 for a 95% interval).

Returns

(T Accuracy, T Loss)

A tuple containing the lower and upper bounds of the credible interval.

Remarks

For Beginners: Credible intervals are the Bayesian equivalent of confidence intervals. They give you a range of values where you can be reasonably confident the true parameter lies. For example, a 95% credible interval means there's a 95% probability that the true value falls within that range, based on your observed data and the Weibull distribution assumption.

CalculateWeibullPDF(T, T, T)

Calculates the probability density function (PDF) value for a Weibull distribution.

public static T CalculateWeibullPDF(T k, T lambda, T x)

Parameters

k T

The shape parameter of the Weibull distribution.

lambda T

The scale parameter of the Weibull distribution.

x T

The value at which to evaluate the PDF.

Returns

T

The PDF value at the specified point.

Remarks

For Beginners: The Weibull PDF function calculates the height of the probability curve at a specific point for a Weibull distribution. The Weibull distribution is versatile and can model many different shapes depending on its parameters. It's commonly used in reliability engineering to model failure rates. The shape parameter (k) determines the overall shape of the distribution - values less than 1 give a decreasing failure rate, equal to 1 gives a constant failure rate (equivalent to an exponential distribution), and greater than 1 gives an increasing failure rate. The scale parameter (lambda) stretches or compresses the distribution.

ChiSquareTest(Vector<T>, Vector<T>, T?)

Performs a Chi-Square test to determine if there is a significant association between two categorical variables.

public static ChiSquareTestResult<T> ChiSquareTest(Vector<T> leftY, Vector<T> rightY, T? significanceLevel = default)

Parameters

leftY Vector<T>

The first group of categorical values.

rightY Vector<T>

The second group of categorical values.

significanceLevel T

The threshold p-value to determine statistical significance (default is 0.05).

Returns

ChiSquareTestResult<T>

A result object containing the test statistics and conclusion.

Remarks

For Beginners: The Chi-Square test helps determine if there's a relationship between two categorical variables (variables that have distinct categories rather than continuous values). It compares the observed frequencies in your data with what would be expected if there was no relationship.

For example, if you want to know if preference for ice cream flavors differs between children and adults, the Chi-Square test can tell you if any observed differences are statistically significant or just due to chance.

CosineSimilarity(Vector<T>, Vector<T>)

Calculates the cosine similarity between two vectors.

public static T CosineSimilarity(Vector<T> v1, Vector<T> v2)

Parameters

v1 Vector<T>

The first vector.

v2 Vector<T>

The second vector.

Returns

T

The cosine similarity.

Remarks

For Beginners: Cosine similarity measures the cosine of the angle between two vectors, indicating how similar their orientations are, regardless of their magnitudes. It ranges from -1 (exactly opposite) through 0 (orthogonal or unrelated) to 1 (exactly the same direction). This method calculates it by dividing the dot product of the vectors by the product of their magnitudes. Cosine similarity is particularly useful in text analysis and recommendation systems, where the absolute values (like document length or user rating scale) are less important than the pattern of values. It's invariant to scaling, meaning that multiplying a vector by a constant doesn't change its direction and therefore doesn't affect the cosine similarity with other vectors.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

DetermineBestFitDistribution(Vector<T>)

Analyzes a dataset and determines which statistical distribution best fits the data.

public static DistributionFitResult<T> DetermineBestFitDistribution(Vector<T> values)

Parameters

values Vector<T>

The vector of data values to analyze.

Returns

DistributionFitResult<T>

A result object containing the best-fitting distribution type and its parameters.

Remarks

For Beginners: This method helps you find which statistical pattern (distribution) best describes your data. It tests several common distributions (Normal, Laplace, Student's t, Log-Normal, Exponential, and Weibull) and returns the one that most closely matches your data's pattern. This is useful for understanding the underlying structure of your data and making predictions based on that structure.

Digamma(T)

Calculates the digamma function, which is the logarithmic derivative of the gamma function.

public static T Digamma(T x)

Parameters

x T

The input value.

Returns

T

The digamma function value at x.

Remarks

For Beginners: The digamma function is a special mathematical function that appears in various statistical calculations. It's the derivative of the natural logarithm of the gamma function. In simpler terms, it helps us understand how quickly the gamma function changes at different points, which is useful for parameter estimation in certain probability distributions.

EstimateWeibullParameters(Vector<T>)

Estimates the shape and scale parameters of a Weibull distribution from sample data.

public static (T Shape, T Scale) EstimateWeibullParameters(Vector<T> values)

Parameters

values Vector<T>

The sample data.

Returns

(T Accuracy, T Loss)

A tuple containing the estimated shape and scale parameters.

Remarks

For Beginners: The Weibull distribution is commonly used to model things like failure rates and lifetimes of components. The shape parameter determines the overall shape of the distribution, while the scale parameter stretches or compresses it. This method analyzes your data to find the Weibull distribution that best describes it, using a technique called the "method of moments" followed by refinement with the Newton-Raphson method.

EuclideanDistance(Vector<T>, Vector<T>)

Calculates the Euclidean distance between two vectors.

public static T EuclideanDistance(Vector<T> v1, Vector<T> v2)

Parameters

v1 Vector<T>

The first vector.

v2 Vector<T>

The second vector.

Returns

T

The Euclidean distance.

Remarks

For Beginners: The Euclidean distance is the "ordinary" straight-line distance between two points in Euclidean space. It's calculated as the square root of the sum of squared differences between corresponding elements of the two vectors. This is a direct application of the Pythagorean theorem extended to multiple dimensions. Euclidean distance is widely used in machine learning for tasks like clustering (e.g., k-means), nearest neighbor searches, and measuring similarity between data points. It works well when the data dimensions have similar scales and are somewhat independent. However, it can be sensitive to outliers and may not perform well when dimensions have very different scales or are highly correlated.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

FTest(Vector<T>, Vector<T>, T?)

Performs an F-test to compare the variances of two samples.

public static FTestResult<T> FTest(Vector<T> leftY, Vector<T> rightY, T? significanceLevel = default)

Parameters

leftY Vector<T>

The first sample vector.

rightY Vector<T>

The second sample vector.

significanceLevel T

The significance level for hypothesis testing (default is 0.05).

Returns

FTestResult<T>

An FTestResult object containing the test statistics and results.

Remarks

For Beginners: The F-test compares the variability (spread) between two groups of data. It helps determine if one group is significantly more variable than the other. The significance level (typically 0.05 or 5%) represents how willing you are to be wrong when rejecting the null hypothesis that both groups have equal variance.

Exceptions

ArgumentNullException

Thrown when either input vector is null.

ArgumentException

Thrown when either input vector has fewer than two elements.

InvalidOperationException

Thrown when both groups have zero variance.

Gamma(T)

Calculates the gamma function for a given value.

public static T Gamma(T x)

Parameters

x T

The input value.

Returns

T

The gamma function value at x.

Remarks

For Beginners: The gamma function is an extension of the factorial function to real numbers. While factorial (n!) is only defined for positive integers, the gamma function works for any positive real number. For a positive integer n, Gamma(n) = (n-1)!. This function is widely used in probability distributions and statistical calculations.

GeneratePosteriorPredictiveSamples(Matrix<T>, Vector<T>, int)

Generates samples from the posterior predictive distribution.

public static IEnumerable<T> GeneratePosteriorPredictiveSamples(Matrix<T> features, Vector<T> coefficients, int numSamples)

Parameters

features Matrix<T>

The feature matrix.

coefficients Vector<T>

The model coefficients.

numSamples int

The number of samples to generate.

Returns

IEnumerable<T>

A collection of test statistics calculated from the posterior predictive samples.

Remarks

For Beginners: Posterior predictive sampling is a way to generate new data that might be observed if the model is correct. This method creates samples by first calculating predicted values using the model's coefficients, then adding random noise to simulate the natural variability in the data. For each sample, it calculates a test statistic that compares the predicted values to the noisy predictions. These samples form the posterior predictive distribution, which can be used for model checking (as in posterior predictive checks) or to make predictions with uncertainty estimates. This approach is valuable because it accounts for both the uncertainty in the model parameters and the inherent randomness in the data generation process.

GenerateThresholds(Vector<T>)

Generates a set of unique threshold values from predicted values.

public static Vector<T> GenerateThresholds(Vector<T> predictedValues)

Parameters

predictedValues Vector<T>

The predicted values from a model.

Returns

Vector<T>

A vector of unique threshold values.

Remarks

For Beginners: This utility method extracts all unique values from a set of predictions to use as thresholds for classification. When evaluating a classification model that outputs continuous scores (like probabilities), you need to choose a threshold above which predictions are considered positive. Instead of arbitrarily selecting thresholds, this method identifies all possible thresholds by finding the unique values in the predictions. These thresholds can then be used to calculate performance metrics at different operating points, allowing you to create curves like ROC or precision-recall curves. This approach ensures that you evaluate the model at all meaningful threshold values without redundancy.

HammingDistance(Vector<T>, Vector<T>)

Calculates the Hamming distance between two vectors.

public static T HammingDistance(Vector<T> v1, Vector<T> v2)

Parameters

v1 Vector<T>

The first vector.

v2 Vector<T>

The second vector.

Returns

T

The Hamming distance.

Remarks

For Beginners: The Hamming distance counts the number of positions at which two vectors differ. It's a simple but powerful measure of dissimilarity that's particularly useful for categorical data, binary vectors, or strings. This method increments a counter each time it finds a position where the two vectors have different values. The result ranges from 0 (identical vectors) to the length of the vectors (completely different). Hamming distance is commonly used in information theory, coding theory, and error detection/correction. In machine learning, it's useful for comparing categorical features or binary representations. Unlike metrics that consider the magnitude of differences, Hamming distance only cares about whether values match exactly.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

IncompleteGamma(T, T)

Calculates the incomplete gamma function, which is used in various statistical distributions.

public static T IncompleteGamma(T a, T x)

Parameters

a T

The shape parameter.

x T

The value at which to evaluate the incomplete gamma function.

Returns

T

The value of the incomplete gamma function at the specified point.

Remarks

For Beginners: The incomplete gamma function is a mathematical function used in many statistical calculations. It's a building block for other statistical functions like the chi-square distribution. This implementation uses a numerical approximation to calculate its value by summing a series of terms until the desired precision is reached. Don't worry too much about the details - this is an advanced mathematical function that works behind the scenes to make other statistical calculations possible.

JaccardSimilarity(Vector<T>, Vector<T>)

Calculates the Jaccard similarity coefficient between two vectors.

public static T JaccardSimilarity(Vector<T> v1, Vector<T> v2)

Parameters

v1 Vector<T>

The first vector.

v2 Vector<T>

The second vector.

Returns

T

The Jaccard similarity coefficient.

Remarks

For Beginners: The Jaccard similarity coefficient measures the similarity between two sets by comparing what they have in common with what they have in total. It's calculated as the size of the intersection divided by the size of the union. This method adapts this concept to numeric vectors by treating each position as a partial membership in a set, using the minimum value as the intersection and the maximum value as the union at each position. The result ranges from 0 (no overlap) to 1 (identical). Jaccard similarity is particularly useful for sparse binary data, like presence/absence features or one-hot encoded categorical variables. It focuses on the attributes that are present in at least one of the vectors, ignoring positions where both vectors have zeros.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

MahalanobisDistance(Vector<T>, Vector<T>, Matrix<T>)

Calculates the Mahalanobis distance between two vectors, accounting for correlations in the data.

public static T MahalanobisDistance(Vector<T> v1, Vector<T> v2, Matrix<T> covarianceMatrix)

Parameters

v1 Vector<T>

The first vector.

v2 Vector<T>

The second vector.

covarianceMatrix Matrix<T>

The covariance matrix of the data.

Returns

T

The Mahalanobis distance.

Remarks

For Beginners: The Mahalanobis distance measures how many standard deviations away a point is from the mean of a distribution, taking into account the correlations between variables. Unlike Euclidean distance, which treats all dimensions equally, Mahalanobis distance gives less weight to dimensions with high variance and to dimensions that are highly correlated with others. This method calculates it by first finding the difference between the vectors, then multiplying by the inverse of the covariance matrix, and finally taking the square root of the dot product with the original difference. Mahalanobis distance is particularly useful for multivariate outlier detection, classification with correlated features, and when features have very different scales or units.

Exceptions

ArgumentException

Thrown when dimensions of vectors and covariance matrix don't match.

ManhattanDistance(Vector<T>, Vector<T>)

Calculates the Manhattan distance (L1 norm) between two vectors.

public static T ManhattanDistance(Vector<T> v1, Vector<T> v2)

Parameters

v1 Vector<T>

The first vector.

v2 Vector<T>

The second vector.

Returns

T

The Manhattan distance.

Remarks

For Beginners: The Manhattan distance (also known as taxicab or city block distance) measures the distance between two points as the sum of the absolute differences of their coordinates. It's called Manhattan distance because it's like measuring the distance a taxi would drive in a city laid out in a grid (like Manhattan). Unlike Euclidean distance, which measures the shortest path "as the crow flies," Manhattan distance follows the grid. This metric is less sensitive to outliers than Euclidean distance and can be more appropriate when the dimensions represent features that are not directly comparable or when movement along the axes has different costs. It's commonly used in machine learning algorithms like k-nearest neighbors when dealing with high-dimensional data.

Exceptions

ArgumentException

Thrown when vectors have different lengths.

MannWhitneyUTest(Vector<T>, Vector<T>, T?)

Performs a Mann-Whitney U test to compare distributions between two groups.

public static MannWhitneyUTestResult<T> MannWhitneyUTest(Vector<T> leftY, Vector<T> rightY, T? significanceLevel = default)

Parameters

leftY Vector<T>

The first group of values.

rightY Vector<T>

The second group of values.

significanceLevel T

The significance level for hypothesis testing (default: 0.05).

Returns

MannWhitneyUTestResult<T>

A result object containing the U statistic, Z-score, p-value, and significance level.

Remarks

For Beginners: The Mann-Whitney U test (also called Wilcoxon rank-sum test) compares two groups without assuming they follow a normal distribution. It works by ranking all values and analyzing the distribution of ranks between groups.

This test is useful when: - Your data might not be normally distributed - You have outliers that could skew results - You're working with ordinal data (like ratings from 1-5)

The U statistic represents the number of times values in one group precede values in the other group. The Z-score standardizes this value, and the p-value tells you if the difference is statistically significant.

PermutationTest(Vector<T>, Vector<T>, T?)

Performs a permutation test to determine if there is a significant difference between two groups.

public static PermutationTestResult<T> PermutationTest(Vector<T> leftY, Vector<T> rightY, T? significanceLevel = default)

Parameters

leftY Vector<T>

The first group of values.

rightY Vector<T>

The second group of values.

significanceLevel T

The threshold p-value to determine statistical significance (default is 0.05).

Returns

PermutationTestResult<T>

A result object containing the test statistics and conclusion.

Remarks

For Beginners: A permutation test is a statistical technique that helps determine if the difference between two groups is due to chance or represents a real difference. It works by repeatedly shuffling all the data and recalculating the difference between groups to see how often a difference as large as the observed one occurs by random chance.

Think of it like shuffling a deck of cards many times to see how often a particular arrangement happens naturally. If your observed arrangement rarely occurs by chance, it's likely significant.

TTest(Vector<T>, Vector<T>, T?)

Performs a Student's t-test to compare means between two groups.

public static TTestResult<T> TTest(Vector<T> leftY, Vector<T> rightY, T? significanceLevel = default)

Parameters

leftY Vector<T>

The first group of values.

rightY Vector<T>

The second group of values.

significanceLevel T

The significance level for hypothesis testing (default: 0.05).

Returns

TTestResult<T>

A result object containing the t-statistic, degrees of freedom, p-value, and significance level.

Remarks

For Beginners: The t-test helps determine if there's a significant difference between the averages of two groups. It's commonly used when you want to know if one group's average is truly different from another's.

For example, you might use a t-test to compare the average test scores of students who studied using two different methods.

The t-statistic measures how many standard errors the two means are apart. A larger absolute t-value suggests a greater difference between groups.

The p-value tells you the probability of seeing such a difference by random chance. If p < significanceLevel (typically 0.05), the difference is considered statistically significant.