When performing regression to a dataset, what kind of error should be minimized?
There two options: Ordinary least square (OLS) and Weighted least squares (WLS)
From google search AI summary:
- Use OLS when your data is consistent and the variance of residuals is constant across all predictors (homoscedastic).
- Use WLS when the data is heteroscedastic, when using sample statistics (like means) that have unequal variances, or when some observations are known to be more accurate than others. [1, 2, 4, 5, 6]
Google search AI summary - ‘difference in linear fitting with least squares and weighted least squares’
The primary difference between linear fitting with Ordinary Least Squares (OLS) and Weighted Least Squares (WLS) is how they handle the variance of errors across data points. OLS treats all observations equally (assuming constant variance), while WLS assigns different weights to each data point based on its variance or reliability, making it ideal for data with non-constant variance (heteroscedasticity). [1, 2]
Key Differences at a Glance
| Feature [1, 2, 3, 4, 5] | Ordinary Least Squares (OLS) | Weighted Least Squares (WLS) |
|---|---|---|
| Error Assumption | Homoscedasticity (constant variance). | Heteroscedasticity (non-constant variance). |
| Weighting | All points weighted equally (). | Points weighted by (inverse of variance). |
| Objective | Minimize sum of squared residuals: . | Minimize weighted sum of squared residuals: . |
| Accuracy | Less efficient when variances are not equal. | More efficient; precise points have higher impact. |
| Use Case | General, clean, consistent data. | Data with outliers, grouped data, or changing noise levels. |
Detailed Breakdown
-
Handling Variance (Heteroscedasticity): OLS assumes that the noise (residuals) is constant across all measurements. When the noise is not constant—such as when smaller values are more precise and larger values are less precise—OLS becomes unreliable. WLS corrects this by giving less weight to high-variance (noisy) points and more weight to low-variance (precise) points.
-
Influence on the Line: In OLS, an outlier or a noisy measurement with a high -value can disproportionately pull the regression line away from the true relationship. WLS reduces the weight of these noisy points, resulting in a fit that more accurately reflects the bulk of the data.
-
Precision and Efficiency: When data contains points of varying quality, WLS produces more efficient and precise estimates of the regression coefficients than OLS.
-
When to Use Which:
-
Use OLS when your data is consistent and the variance of residuals is constant across all predictors (homoscedastic).
-
Use WLS when the data is heteroscedastic, when using sample statistics (like means) that have unequal variances, or when some observations are known to be more accurate than others. [1, 2, 4, 5, 6]
Example of WLS Weights
If a data point is the average of 10 observations, it should be given a higher weight than a data point that is the average of only 2 observations, because the former is more precise. If the variance () of a point is known, the weight is typically defined as the inverse of that variance (). [1, 4]
AI responses may include mistakes.
[1] https://statisticsbyjim.com/regression/weighted-least-squares/
[3] https://online.stat.psu.edu/stat501/lesson/13/13.1
[4] https://www.stat.cmu.edu/~larry/=stat401/lecture-24.pdf
[6] https://online.stat.psu.edu/stat501/book/export/html/1086
P value and value
https://rcompanion.org/handbook/G_10.html
Google search AI summary
P-values and R-squared () are distinct regression metrics: measures the strength of the relationship (proportion of variance explained), while the P-value determines the statistical significance of that relationship. A high shows strong predictive power, while a low P-value ( 0.05) indicates the model is better than a null model. [1, 2, 3, 4]
Key Differences and Relationships:
- R-Squared (): Ranges from 0 to 1 (0% to 100%). It explains the proportion of variation in the dependent variable explained by independent variables. A higher value means the model explains more variation.
- P-value: Indicates whether the model’s predictive power is likely due to chance. A p-value 0.05 suggests the model is statistically significant, meaning the relationship is likely real.
- Relationship: They do not always correlate. It is possible to have a high with a high p-value (good fit, not significant) or a very low with a low p-value (poor fit, but significant, often with large sample sizes).
- Interpretation: indicates how well the data points fit the line, while the p-value indicates if the slope is significantly different from zero. [1, 2, 4, 5, 6, 7]
Example: In a large dataset, a very small, weak relationship (low of 0.05) might still be statistically significant (p < 0.05) because the sample size is large enough to confirm the relationship isn’t just noise. [1, 2, 8, 9, 10]
AI responses may include mistakes.
[2] https://rcompanion.org/handbook/G_10.html
[3] https://www.displayr.com/what-is-r-squared/
[4] https://rstudio-pubs-static.s3.amazonaws.com/411980_85836ec0002244819a77e3f8ad5f20bd.html
[5] https://www.rpubs.com/shubh2565/pvalue-rsquared
[7] https://www.reddit.com/r/statistics/comments/2sb9gv/eli5_pvalue_vs_rsquared/
[8] https://www.reddit.com/r/AskStatistics/comments/1havonz/multiple_linear_regression_low_r2_value_but/
[10] https://medium.com/@nivedita.home/understand-p-value-and-confidence-interval-6fcf3e68667b