I don't think this one-to-group comparison really measures such a notion. It seems to me that the described technique is essentially an averaging process among the engine group, so it can't be a surprise that the results end up tending toward a whitewash. I also don't think that this "comparing to a group" (even when done in a more careful manner than simply taking maximal intra-group overlap) is much related philosophically to plagiarism.Trotsky@rybkaforum.net wrote:I apologise for making this report at this late stage, but I was excluded from the panel deliberations and therefore only have access to the material now. [...]
Watkins data was used by him to make 1 to 1 comparisons. However it is also possible to make 1 to group comparisons for all the programs in the group. Just as a high score in the 1 to 1 comparisons is suggested by Watkins to indicate plagiarism between the two programs concerned, so a high score in the 1 to group comparison would indicate plagiarism within the group as a whole.
I'd say the new percentages correspond to something a bit different. Speaking (loosely) in terms of graph theory, this new measurement determines the minimal distance from a given node to any node, while the original method determined the distance from a given node to a specific other one. I don't think there is any obvious interpretation of the numbers obtained, and I would certainly expect the resulting percentages to be larger when comparing to a group rather than a singleton. [On a pedantic note, the raw percentages don't matter anyway, only the probabilities that are derived from them]. If one were (say) to apply the Abstraction-Filtration-Comparison Test to a 1-to-group situation, I'd guess a significant amount of stuff would be evicted at the Filtration stage.Percentage plagiarism (corresponds to the 1 to 1 Watkins percentage headline figure of 74%, although I might get corrected on this)
Another comment is that the resulting numbers need to be adjusted and/or re-interpreted for the size/span of the group. Simply making the group really big would tend to make everything up being 100% plagiarism [then again, maybe that's the point?!]. Putting this as a mathematical example, if I double the size of your data set by adding 8 new engines: the first of which copies Crafty for the first 6 features, then RESP for the 6 next features, etc.; the second of which copies RESP for the first 6 features, etc., then every engine would end up 100% plagiarised. The fact that this 1-to-group comparison collapses when adding in such "averaged" engines indicates to me that it is not too useful of a statistic for distinguishing engines and/or engine pairs. Contrariwise, the EVAL_COMP methodology is somewhat robust against such the addition of such averaged engines (at least until you throw in so many "(re-)averaged" engines that they dominate the analysis).
At a copyright level, individual features would likely not be "protected content", but a specific selection of them could be considered such. For such purposes I think it is clear that a 1-to-group analysis (using the "maximal overlap" metric) is not as useful as a 1-to-1 analysis, particularly when the group is large. Analogously, individual elements of a book plot are not (typically) subject to copyright, but the specific combining of said elements can often be so, to the extent that the said combination was "creative" (a subjective term of course) -- I would guess one could re-phrase the AFC test to analyse book plots if desired, first identifying layers of abstraction for the plot, etc.
Even with these issues about said method, I might point out that the "average plagiarising/plagiarised" level obtained here is 65%, with (other than Fruit and Rybka) only RESP and Faile of that size [another point: you probably need to adjust the 1-to-group comparison for engines like Faile that have few features]. Removing Fruit and Rybka [applying an outlier test], the other 6 engines total up to 61% mean with 4.5% standard deviation. So it would only be natural to investigate Rybka and Fruit more specifically, given that this 1-to-group analysis puts them at 3-4 sigma upon comparison to other 6 engines (admittedly a small sample). Something that measures peer-to-peer overlap rather than peer-to-group would then be a logical addition to the methodology.Crafty 0.58
RESP 0.65
Ryb1 0.76
Phal 0.54
Fail 0.67
Fr21 0.81
Pepi 0.60
EX5b 0.64
It's still not clear to me what exactly "inter-group plagiarism" really means [as any significant intra-group commonalities should be filtered out], while the above statistic notes that Rybka and Fruit do in fact stand out above the other 6 programs.[...] there appears to be a massive inter-group plagiarism in the evaluation function of the all selected programs. No one program stands out above any other program.