Talk:Correlation/Archive 1
Page contents not supported in other languages.
It is possible that, in the paragraph headed "The Sample Correlation" in the section labelled "Pearson's product-moment coefficient", the author has inadvertently left an expression for the convenient calculation of r in the position it occupied during a draft (it occurs on the first line of a set of equations). The definition of r appears on the next line and the first expression (which may be derived from it) appears for a second time below this. From my reading, I suspect that the author intended to move the first expression in two stages using copy and paste followed by deletion of the original (but the deletion was forgotten). I did not wish to delete the relevant line without providing an opportunity for the original author to check. Xenoglossophobe (talk) 14:01, 29 June 2008 (UTC)
The "n" term doesn't make any sense to me. If i=1,2,...n for both X and Y, wouldn't the total number of observations be 2n? In fact, usually for X, i=1,2,...n and for Y, j=1,2...m, making the total n+m. So should the bottom part of that formula be n+m? Akshayaj 15:27, 7 August 2007 (UTC)
However, the n-1 term from before seems correct, as opposed to the n term that's there nowhttp://stattrek.com/AP-Statistics-1/Correlation.aspx?Tutorial=APAkshayaj 16:08, 7 August 2007 (UTC)
I think including the Python code is too much, since Wikipedia is not really a code repository. I have removed it. Please discuss if you have a problem with that. Brianboonstra (talk) 13:49, 18 January 2008 (UTC)
To be more precise, the expression of the algorithm in Python (which actually happens to be a wonderful language) is less clear, due to the zero offsets. For example, the range(1,N) expression has nonobvious effects to someone who does not know the language. Brianboonstra (talk) 13:56, 18 January 2008 (UTC)
I recently checked the algorithm and it computes the correlation with slightly different formula: , is this on purpose (in that case some note in the text would be needed) or an error? --Tomas.hruz 09:53, 6 October 2006 (UTC)
I believe the calculation is fine. Note that the std devs used in the denominator are population std devs. Brianboonstra (talk) 13:49, 18 January 2008 (UTC)
The algorithm does not take into account the case when either pop_sd_x or pop_sd_y is zero, causing a divide by zero on the last line. holopoj 17:06, 5 August 2006 (UTC)
It seems like the algorithm is calculating something wrong (or maybe it is just that I coded it wrong!), but I wrote it in C++, and it does not calculate the correct covariance for the proposed example, which should be -841.667 as calculated in Excel and R. I used a straight algorithm in C++ (without any optimization) and it gave the right answer. Could somebody tell me what was my mistake in coding it? Thanks in advance. Here is the code:
double cov(double* x,double* y,int tamano,int tipo) { int i; double sumCuadX = 0.0, sumCuadY = 0.0, sumCoprod = 0.0, mediaX = x[0], mediaY = y[0], barre, deltaX, deltaY, pobSDX, pobSDY, covcor; for (i = 1; i < tamano; i++) { barre = ((double)i - 1.0)/(double)i; deltaX = x[i] - mediaX; deltaY = y[i] - mediaY; sumCuadX += deltaX*deltaX*barre; sumCuadY += deltaY*deltaY*barre; sumCoprod += deltaX*deltaY*barre; mediaX += deltaX/(double)i; mediaY += deltaY/(double)i; } pobSDX = sqrt(sumCuadX/(double)tamano); pobSDY = sqrt(sumCuadY/(double)tamano); covcor = sumCoprod/(double)tamano; if (tipo == CORRELACION) covcor /= pobSDX*pobSDY; return covcor; }
Paulrc 25 19:39, 27 December 2006 (UTC)
The problem lies in your incomplete translation of the one-based index code to your zero-based index version. There appear to be three changes, all within the loop and all dealing with adjustments to i:
double cov(double* x,double* y,int tamano,int tipo) { int i; double sumCuadX = 0.0, sumCuadY = 0.0, sumCoprod = 0.0, mediaX = x[0], mediaY = y[0], barre, deltaX, deltaY, pobSDX, pobSDY, covcor; for (i = 1; i < tamano; i++) { barre = i/(1d + i); deltaX = x[i] - mediaX; deltaY = y[i] - mediaY; sumCuadX += deltaX*deltaX*barre; sumCuadY += deltaY*deltaY*barre; sumCoprod += deltaX*deltaY*barre; mediaX += deltaX/(1 + i); mediaY += deltaY/(1 + i); } pobSDX = sqrt(sumCuadX/(double)tamano); pobSDY = sqrt(sumCuadY/(double)tamano); covcor = sumCoprod/(double)tamano; if (tipo == CORRELACION) covcor /= pobSDX*pobSDY; return covcor; }
Could we see some account of this concept of "correlation ratio"? All I can find is on Eric Weisstein's site, and it looks like what in conventional nomenclature is called an F-statistic. Michael Hardy 21:02 Mar 19, 2003 (UTC)
it goes somewhat like:
correlation_ratio(Y|X) = 1 - E(var(Y|X))/var(Y)
I don't know the conventional nomenclature, but in the literature on similarity measures for image registration it is called just this...
The relation between them is already there on the autocorrelation page. "...the autocorrelation is simply the correlation of the process against a time shifted version of itself." You can see this trivially by considering the equation for correlation if the series Yt = Xt-k.--Richard Clegg 20:43, 7 Feb 2005 (UTC)
This page currently tells only the mathematical aspects of correlation. While it is, obviously, a mathematical concept, it is used in many areas of research such as Psychology (my own field; sort of) in ways that would be better defined by purpose than mathematical properties. What I mean is, I'm not sure how to add information about what correlation is used for into this article - I wanted to put in the "vicars and tarts" demonstration of "correlation doesn't prove causality", for instance. But that would require a rather different definition of correlation, in terms of "the relationship between two variables" or something. Any ideas on how to rewrite would be welcome - if not, of course, I'll do it myself at some point...
Oh, and I can't decide what to do about that ext. link - as is, it's rather useless, taking you to the homepage of a particular reference site (I suspect it of being "Wikispam"); but if you find the right page and break out of their frameset, there is actually some interesting info at http://www.statsoft.com/textbook/stbasic.html#Correlations. Ah well, maybe I'll come back to this after I've sorted out some of the memory-related pages... IMSoP 17:43, 20 May 2004 (UTC)
I thought that the deleted stuff about the sample for correlation was useful. Not enough stats people pay attention to the difference between a statistic and an estimator for that statistic. The Pearson product-moment correlation coefficient page does cover this but it would be nice to see the treatment for the standard correlation too (IMHO at least).--Richard Clegg 20:21, 10 Feb 2005 (UTC)
I think so too, but I was rushed. I will put the section back soon, but I will combine it with the Pearsonssection. Paul Reiser 21:09, 10 Feb 2005 (UTC)
--Richard Clegg 22:45, 10 Feb 2005 (UTC)
what about the signal processing version of correlation? kind of the opposite of convolution, with one function not reversed. also autocorrelation. does it have an article under a different name? if so, there should be a link. after reading this article over again, i believe the two are related. i will research some and see, (and add them to my to do list) but please add a bit if you know the connection... Omegatron 20:10, Feb 13, 2004 (UTC)
Correlation matrix search redirects to this page but I can't find here what a correlation matrix is. I have some idea from http://www.vias.org/tmdatanaleng/cc_covarmat.html , but don't feel confident enough to write an entry, and I am no sure where to add it.
Covariance_matrix exists.
Scatter_matrix do not.
--Dax5 19:16, 7 May 2005 (UTC)
The "Correlation function in spreadsheets" section looks very useless to me, and the information included is probably wrong since the correlation of two real numbers does not make sense. I will delete it, if you put it back can you tell me why?
Muzzle 12:44, 6 September 2006 (UTC)
I was the one that put the disclaimer on "random" variables. If anybody would like to discuss, I'm all ears, so to speak. —The preceding unsigned comment was added by Phili (talk • contribs) .
The "unsigned" person wrote utter nonsense. This is a crackpot. Michael Hardy 02:38, 30 November 2005 (UTC)
Perhaps I'm being thick, but after a minute or two of scrutinising it I couldn't work out how to read the diagram on this article. Which scatter plot corresponds to which coefficient, and why are they arranged in that way? It is not clear. Ben Finn 22:12, 18 January 2006 (UTC)
I actually thought the figure is awesome, but now that I consider it, I wonder if it is intuitive and informative only for those who understand correlation well enough not to really need the figure. Also, I think it would be instructive to show a high-correlation scatterplot where the variances of the two underlying series are in a ratio of, say, 1:6 rather than 1:1 in the plots shown. --Brianboonstra 15:53, 3 March 2006 (UTC)
I have no clue how that figure works, and I'm in a PhD program. --Alex Storer 22:50, 17 April 2006 (UTC)
I have added this sentence to the caption to try to clarify it, in case anyone is still confused:
Michael Hardy 22:14, 22 April 2006 (UTC)
I understand the figure, but I think it's WAY too complicated, especially for someone who doesn't already know what it is. Its slightly neat to see that you have four different data sets generated, and you're looking at all pairs... but I think for most people it would be MUCH MUCH clearer if you just showed four examples in a row, with labels directly above: R2 in {0, .5, .75, 1 } or something. 24.7.106.155 09:27, 7 May 2006 (UTC)
I have to agree that the diagram is over complicated. It also doesn't show negative correlations. Would it be better to have a table with two rows. Each colum could have a correlation coefficient as a number in the first row, and a scatter plot in the second row. The coeffecients could range between -1 and 1. I think that this would also emphasise that a negative correlation is still a strong correlation. 80.176.151.208 07:49, 31 May 2006 (UTC)
Sometimes one sees the term "intercorrelation". What does this exactly signifies? I associate "intercorrelation" as the correlation between two different variables - but that is what standard "correlation" is. It seems to me that "inter" is redundant... And the opposite of autocorrelation is not intercorrelation but cross-correlation... -fnielsen 15:43, 10 February 2006 (UTC)
The table at the beginning of the article is flawed in almost every regard. First, it is poorly designed. Suppose a reader wants to know what a low correlation is. He or she looks at the row, sees "low," and sees that the cell below it says "> -0.9." At first glance, this makes it sound as though ANY correlation that is greater than -0.9 is low, including 0, 0.9, etc. Then the next column says "low: < -0.4." It takes a moment to figure out that the author was actually intending to convey "low: -0.9 < r < -0.4." Something like this would be better:
High correlation | Low correlation | No correlation (random) | Low correlation | High correlation |
−1 < r < −0.9 | −0.9 < r < −0.4 | −0.4 < r < +0.4 | +0.4 < r < +0.9 | +0.9 < r < +1 |
though some letter other than r might be better, and less-than-or-equal-to signs belong in there somewhere. That brings up the second problem, though: Where on earth did these numbers come from? Cohen, for example, defines a "small" correlation as 0.10 <= |r| < 0.3, a "medium" correlation as 0.3 <= |r| < 0.5, and a "large" correlation as 0.5 <= |r| <=1. I know of no one who thinks that a correlation between -0.4 and 0.4 signifies no correlation.
Then there's the argument--made by Cohen himself, among others--that any such distinctions are potentially flawed and misleading, and that the "importance" of correlations depends on the context. No such disclaimer appears in the article, and the reader might take these values as dogma.
I suggest that the table be removed entirely. Failing that, it should at the very least be revised for clarity as described above, and a disclaimer should be added. The values in the table should be changed to Cohen's values, or else the source of these values should be mentioned somewhere.
I'd be happy to make all of the changes that I can, but as I'm new to Wikipedia I thought I'd defer to more experienced authors.
--Trilateral chairman 22:51, 22 March 2006 (UTC)
Okay. I've removed the old table, added Cohen's table with a citation, and added the disclaimer with the explanation you suggested. Here is the old table if anyone wants it:
High correlation | High | Low | Low | No | No correlation (random) | No | Low | Low | High | High correlation |
−1 | < −0.9 | > −0.9 | < −0.4 | > −0.4 | 0 | < +0.4 | > +0.4 | < +0.9 | > +0.9 | +1 |
--Trilateral chairman 01:18, 24 March 2006 (UTC)
What is the E in the first equations? Why isn't the E replaced by capital sigma indicating the sum of?
I have seen E before in statistics texts. If it is some standard notation, it should be explained.
Gary 16:24, 28 March 2006 (UTC)
I think it is the expected value.It indeed is the expected value. Often calculated as (k/N) where k is the total number of objects and N is the number of intervals.
I'd really appreciate if someone could expand the first couple paragraphs a bit to better explain correlation. While I'm sure that the rest of the article is correct, for me, as smeone without a math background, it doesn't make much sense. I understand that by the very nature of the topic it is complicated, but I'd still like to have some sort of understanding of the text. Thank you! cbustapeck 17:16, 13 October 2006 (UTC)
What I find confusing is that it first defines correlation to then move to the pearsons correlation coefficient without really explaining the relation between the two. The first section on correlation is quite clear. But the Sample correlation section is entirely confusing and there is not a single mention of any relationship to what was said in the previous section. And the formula is different than the one in the first section. Computing the expected value would mean dividing by n but the formula in 'Sample Correlation' divides by n-1. —The preceding unsigned comment was added by 67.93.205.78 (talk)
Maybe this may sound completely ignorant, but I have a minimal background in math, and what I'm interested in is the implementation of this type of concept. What's the general purpose of this formula? What does it accomplish? How is it implemented? Should I just go back to school? I read layman physics books and the concepts are explained fully in plain language. Not just the barebones formulas, but the implications as well. Can the implications of this type of math be explained?70.66.9.70 15:45, 31 March 2007 (UTC)
Recently an editor removed a whole string of on-line publications by Herve Abdi, which did seem somewhat self-promotional to include here. However the textbook by Cohen et al. is the only major textbook (that people might use in a course) that was listed, and it also was removed. Does anyone object to restoring the Cohen book to the Reference list, or under the heading 'Further Reading' if you prefer? EdJohnston 19:07, 13 December 2006 (UTC)
Isn't there a difference between correlation and coefficient of correlation? The coefficient lying between -1 and 1, while the general term 'correlation', can have any numerical value attached to it?
If so, you will find the introduction somewhat misleading: "In probability theory and statistics, correlation, also called correlation coefficient" - correlation and correlation coefficient are not quite the same thing, but are very very similar.
With someone who has a fresher statistics/econometrics backbround please confirm this and accordingly edit the main page. Cheers all, --ToyotaPanasonic 13:31, 24 December 2006 (UTC)
[A comment left here by User:Jjoffe was removed by EdJohnston. See my further note below. I left intact the response by User:Chris53516 who was responding to Joffe. -- EdJohnston 14:32, 8 January 2007 (UTC)
I'm not crazy about this section under common misconceptions: An appropriately expanded expression may be "correlation is not causation, but it sure is a hint." I don't think this is an illuminating rephrasing, in part because the rationale behind neither dictum (correlation is not causation nor the one quoted above), is explained sufficiently. I'd be more happy with:
The conventional dictum that "correlation does not imply causation" is a commonly-used admonition to using correlation to support a direct causal relationship among the variables. However, this admonition should not be taken to mean that correlations are acausal, merely that the causes underlying the correlation may be indirect and unknown. A correlation between age and height is fairly causally transparent, but a correlation between mood and health might be less so. Does improved mood lead to improved health? Or does good health lead to good mood? Or does some other factor underlie both? In other words, a correlation can be taken as evidence for a causal relationship, but cannot indicate precisely what the causal relationship might be.
Comments? SJS1971 13:07, 24 January 2007 (UTC)
A "{{prod}}" template has been added to the article Currency correlation, suggesting that it be deleted according to the proposed deletion process. All contributions are appreciated, but the article may not satisfy Wikipedia's criteria for inclusion, and the deletion notice explains why (see also "What Wikipedia is not" and Wikipedia's deletion policy). You may contest the proposed deletion by removing the {{dated prod}}
notice, but please explain why you disagree with the proposed deletion in your edit summary or on its talk page. Also, please consider improving the article to address the issues raised. Even though removing the deletion notice will prevent deletion through the proposed deletion process, the article may still be deleted if it matches any of the speedy deletion criteria or it can be sent to Articles for Deletion, where it may be deleted if consensus to delete is reached. John 10:30, 14 June 2007 (UTC)
It appears to me that the article Association (statistics) is really discussing correlation. Is there a distinction between the terms ‘association’ and ‘correlation’? If so, could someone please edit the article association (statistics) to state what that distinction is. If there is no distinction, then should the articles be merged? --Mathew5000 00:26, 4 July 2007 (UTC)
Another reference to Herve Abdi, inserted by an anonymous user with ip address 129.110.8.39 which seems to belong to the University of Texas at Dallas. Apparently the only editing activity so far has been to insert excessive references to publications by Herve Abdi, of the University of Texas at Dallas. The effect is that many Wikipedia articles on serious scientific topics currently are citing numerous rather obscure publications by Abdi et al, while ignoring much more influential original publications by others. I think this constitutes an abuse of Wikipedia. A while ago, as a matter of decency, I suggested to 129.110.8.39 to remove all the inappropriate references in the numerous articles edited by 129.110.8.39, before others do it. For several months nothing has happened. I think it is time to delete the obscure reference. Truecobb 21:32, 15 July 2007 (UTC)
Is this true? If so, can someone give me a (preferably nonmathematical) example of this? The X vs. sin(X) example does not seem to work, as I would think that a functional relationship between the two means they are correlated.
Thanks
Thanks Akshayaj 20:38, 20 July 2007 (UTC)
The effect of taking Tylenol on reducing pain. Taking Tylenol does cause a patient's perceived level of pain to go down, but this may be due to the placebo effect. Here, the Tylenol does cause thereduction in pain, but the drug itself is not correlated with the reduction in pain.
Does this work?Thanks Akshayaj 20:58, 20 July 2007 (UTC)
Here is a simple example: hot weather may cause both crime and ice-cream purchases. Therefore crime is correlated with ice-cream purchases. But crime does not cause ice-cream purchases and ice-cream purchases do not cause crime. Michael Hardy 01:41, 21 July 2007 (UTC)
The section on this subject, to my mind, misses out on a very important element in the interpretation of correlation coefficients. The point that I want to make is that, in addition to calculating the correlation coefficient, a researcher will often be well advised to carry out a significance test. A significance test will guide a researcher as to how much importance should be attached to a correlation coefficient.
The easiest way to carry out the significance test is to compare the 'test value' with a 'table value'. The test value is simply equal to 'mod r', i.e the value of the correlation coefficient, ignoring a negative sign if applicable. Table values can be acquired from appropriate published tables (sorry I can't quote a reference here 'off the top of my head' but I am certain that another reader will be able to fill in the gap). The table value to be used depends on the level of significance required (e.g. 5% or 1% etc.) and on the number of 'degrees of freedom' which in this case is equal to n-2 where n is the sample size. If the test value exceeds the table value then the correlation is significant at the percentage selected.
An instructive example would be a piece of research where the sample size was, say, 32 so that n-2 was 30. In such a case, the table value for 5% significance is 0.35 and for 1% significance it is 0.45. It follows that in such a case, correlation coefficients as low as 0.35 would be significant at 5% and as low as 0.45 would be significant at 1%. Thus, although a correlation in this situation of 0.45 is only 'medium' according to Cohen's table, it is nevertheless significant at 1% i.e. there is only a 1% or less probability that the observed relationship is just 'due to chance' as distinct from being a real relationship.
A converse example would be where the sample size was just 7 (n-2=5). In this case, consultation of the tables will tell us that we need a correlation coefficient of at least 0.75 to be significant at 5% and at least 0.88 for significance at 1%! Agibbs100 21:48, 9 August 2007 (UTC)
The following refers to following remark added by User:Hongguanglishibahao: 'However, Guang Wu does show how to deliberately make the correlation coefficient be larger than unity.' (see [1]).
Let and
be two random variables that are not degenerate and have finite first two moments. These assumptions are required for the correlation coefficient to be properly defined; in particular they imply that the variances of
and
are nonzero and finite. Then
Similarly
From these two inequalities it follows that .
Unfortunately I cannot access the article 'Wu, G. (2003). An extremely strange observation on the equations for calculation of correlation coefficient. European Journal of Drug Metabolism and Pharmacokinetics, 28, 85-92.', so I cannot explain why it claims that the correlation coefficient can deliberately be made larger than unity. Given the proof above --- which can be found in just about any basic text on statistics (that provides proofs) --- it seams mr. Wu is either making a mistake or using different definitions. Note in particular that and
need to be nonzero, otherwise the correlation coefficient is not defined (or defined to be zero by some texts). A proof such as given above simply falsifies the acclaimed statement. The fact the article is not published in a mathematics journal might explain some things here. (no pun intended) --Kuifware 10:11, 16 August 2007 (UTC)
Various equations are used to calculate the correlation coefficient, these equations are presumed equally. However we find the extraordinary results when using r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)) and r2 = (sigma(yi - y)2 - sigma(yi - ŷi)2) / (sigma(yi - y)2) to calculate the correlation coefficient, for example, a line within 95% confidence band of a regressed line. The results are so extraordinary that we do not know whether or not we can still call the results as correlation coefficient, however we are sure that these results need to be presented.
Hi, Debivort. Thanks for continuing deleting the addition, 'However, Guang Wu does show how to deliberately make the correlation coefficient be larger than unity.' However, this remark is fully referenced in international peer-reviewed journal, which as all international peer-reviewed journals, has strong and strict reviewing process.
I really wonder why you do not read the reference before deleting, and the paper can be obtained by emailing postmaster@dreamscitech.com, as you know that the paper is copyrighted, whose contents cannot be put here.
Still, what I present here is the verified fact, please be unbiased to treat real facts. —Preceding unsigned comment added by Hongguanglishibahao (talk • contribs) 02:29, August 26, 2007 (UTC)
Regards —Preceding unsigned comment added by Hongguanglishibahao (talk • contribs) 26 August, 2007.
By the way, the comment by H. Schütz in European Journal of Drug Metabolism and Pharmacokinetics should be in the form of letter to Editor, and G. Wu has answered this comment. Mr Schütz has no more responses for the answer. —Preceding unsigned comment added by Hongguanglishibahao (talk • contribs) 26 August, 2007.
By the strong request of Debivort, let us discuss how to deliberately make the correlation coefficient be larger than unity. Please note DELIBERATELY. Besides, the study was done almost 10 years ago, it is only recently that we have full accession of Wiki inside China, thus I decided to put this buried result in light.
Let us assume that we have a dataset, x = 0, 5, 10, 15, 20, 25, 30 and 35; and y = 0.1, 5.5, 9.7, 14.3, 21.7, 24.3, 32, 34.4, which resulted in y = 0.0917 + 1.0091x, with r = 0.9963, 10 years ago. If we used x = 0, 5, 10, 15, 20, 25, 30 and 35 into y = 0.0917 + 1.0091x, we get 0.0917, 5.1372, 10.1827, 15.2282, 20.2737, 25.3192, 30.3647, and 35.4102.
Put these data into r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)) = square root of (1069.1970/1077.0800) = 0.9963.
When using a pharmacokinetic software to fit this dataset, let us assume that we stop fitting before reaching the global minimum, although the global minimum can be easily found analytically for this dataset. Say, we stopped very near to y = 0.0917 + 1.0091x, which is located in the 95% confidence intervals for slope (0.9219 to 1.0962) and 95% confidence band for y = 0.0917 + 1.0091x.
The stopped line, for example, is (worse y) = 0.0917 + 1.02x. Then we put x = 0, 5, 10, 15, 20, 25, 30 into (worse y) = 0.0917 + 1.02x, which then resulted in 0.0917, 5.1917, 10.2917, 15.3917, 20.4917, 25.5917, 30.6917, 35.7917 and 35.4102.
Let us put them into the equation, r = square root of ((sigma(worse ŷi - y)2) / (sigma(yi - y)2)) = square root of (1092.7139/1077.0800) = 1.0072.
So here, r > 1.
Actually, the process is very simple, you have any dataset, then you regress them and get a regressed equation no matter linear or nonlinear. Then you slightly move this regressed line within 95% confidence band, and put x into this slighted-moved-line equation to calculate ŷ, then put ŷ into r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)). In general, you get r > 1.
Please note this process only related to our assumption that we stop fitting a curse before reaching the global minimum.
By the way, someone, who knows well how to write mathematical formulae, please edit the equations in this text. Many thanks! —Preceding unsigned comment added by Hongguanglishibahao (talk • contribs) 04:18, August 26, 2007 (UTC)
In fact, this study was inspired by seeing the figures in papers and books that how the sum of squared residuals reduces during the fitting process. The idea then was how the correlation coefficient would change during this fitting process. And the only equation suited for calculation of this process seemed to be r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)). However, when writing a program for fitting with this equation, the resulted correlation coefficient can be larger than unity.
One may say why we have not met the case that the correlation coefficient is larger than 1 during a fitting process because a fitting using the least squares method often stops before reaching the global minimum sum of squared residuals, i.e. the fitted line is near to the “real” regressed line as our example. This is because the correlation coefficient is calculated based on different graphic presentations, i.e. to use regressed and measured yi as axes to construct a plot.
In more plain words of this process, we have a dataset, x and y, we get the regressed linear equation, y = ax + b (the same for the more complicated dataset with multi-linear as well as nonlinear regressed equations). Then you move this line to ŷ = (a + delta)x + b or ŷ = ax + (b + delta), put x into this moved line, you get ŷi for each x. At this moment, the only equation that can calculate the correlation coefficient is r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)), by which we can get r>1.
I argue that until now we have no restriction on how to use this equation, and we have no other equations to calculate the correlation coefficient during the fitting, but if we calculate the correlation coefficient in such a way, that is, a line near to the global minimum within 95% confidence interval, then this only equation, which can be used in such circumstance, will result in r>1.
Best wishes
Guang WuHongguanglishibahao 05:45, 26 August 2007 (UTC)
Moreover, I had no business to do with Cauchy-Schwarz inequality 10 years ago, what I was interested in was how the correlation coefficient behaved during the fitting and how to calculate the value of correlation coefficient when the fitting did not reach the global minimum.
Guang WuHongguanglishibahao 06:38, 26 August 2007 (UTC)
Another important issue is that if we have, for example, x = 1, 2, 3, 4, 5 and y = 10, 20, 30, 40 50. And then we have ŷ = 10.1, 20.1, 30.1, 40.1, 50.1.
We absolutely need to write a program according to r = square root of ((sigma(ŷi - y)2) / (sigma(yi - y)2)) to calculate r, you cannot either use x vs ŷ or y vs ŷ in any regression program for calculation of r. Even in the later case y vs ŷ, the program will result in r<=1, because the regression program uses y vs ŷ as x vs y to make calculation. This is particularly important when the dataset is multiple linear or nonlinear regressions.Hongguanglishibahao 11:04, 26 August 2007 (UTC)
Thanks, EdJohnston
However, please do not pay so much worship to mathematicians and their mathematics journals. The correlation coefficient is an extremely old topic, no modern mathematicians and statisticians have more working knowledge on this topic than you and me. What they are interested in is the current problems in mathematics and statistics rather than such an old topic. Besides, the pioneer generation of statisticians, who worked on this topic more than 100 years ago, had no imagination of how the correlation coefficient would be when the regressed line is approaching the globe maximum.
With the advance of technology and the open Wikipedia, I think that everyone should control his own fate rather than dictated by others. We should respect the fact, even disputed. The fact is simple that r can be larger than unity, even much more in fitting process. By the way, Eur J Drug Metab Pharmacokinet is also a statistical journal in the field I once worked.
Besides, I did not want to spend time to dispute this topic with anyone that was why I did not provide you the data earlier. However, the spirit of Wikipedia is the presentation of referenced facts, not cycled referenced. Still, I stressed several times, deliberately. Please do not make Wikipedia an old style media, which is giving its dominated role step-by-step.
RegardsHongguanglishibahao 18:17, 26 August 2007 (UTC)
I did not use any specific software to numerical approximations. I told you already that I was interested in how the correlation coefficient behaved during the fitting and how to calculate the value of correlation coefficient when the fitting did not reach the global minimum.
Can you tell me how to calculate the correlation coefficient during the fitting? We should face the new problems raised in current situation rather than avoid them with various excuses.Hongguanglishibahao 18:55, 26 August 2007 (UTC)
Dear All
Since I posted the discussion here, I could not enter Wiki until today. Perhaps, the connection will be blocked soon. I only would like to say that I will discuss this issue when I have full access again.
Guang Wu —Preceding unsigned comment added by Hongguanglishibahao (talk • contribs) 14:09, 1 April 2008 (UTC)
By the way, I am not happy with the comments made by EdJohnston “The type of argument you are making is one that you should get accepted in a math or statistics journal before bringing it to Wikipedia”.
I wonder who give you the power to decide which sentence can be added? Do you want to make Wiki another UN, where you can veto everything you do not like but you do not even pay the UN fee. —Preceding unsigned comment added by Hongguanglishibahao (talk • contribs) 14:32, 1 April 2008 (UTC)
The pseudo algorithm describe on the page will not work when pop_sd_x or pop_sd_y is 0, you'll get a nan value. —Preceding unsigned comment added by 195.167.237.98 (talk) 10:13, August 24, 2007 (UTC)
Isn't correlation also a term in projective geometry? When PG(V) and PG(W) are projective spaces, a correlation is a bijection from the subspaces of V to the subspaces of W, such that
is equivalent with
11:12, 12 October 2008 (UTC)
11:12, 12 October 2008 (UTC)
The first image (show right of here) is really quite difficult to understand. Could we find an easier image to introduce correlation with? --Apoc2400 (talk) 20:45, 6 December 2007 (UTC)
This article needs a link to explain the concept of a linear relationship. Ianhowlett 18:47, 6 July 2007 (UTC)
11:12, 12 October 2008 (UTC)
I have implemented this formula in R. Could it be that the correct assertion is that The covariance matrix of T will be 1/(m-1) times the identity matrixrather thanThe covariance matrix of T will be the identity matrix?Is there a textbook of a paper that can be cited regarding this formula,preferably containing a derivation? I would like to check whether it is Wikipedia or my implementation that is incorrect.
Thanks, Leo. —Preceding unsigned comment added by 201.9.189.100 (talk) 22:38, 12 June 2008 (UTC)
11:12, 12 October 2008 (UTC)