Calculating article uniqueness
- SEO |
I know that various software compares articles and comes up with a
uniqueness percentage.
Does anyone know what the calculation is based on ?
I have a script which goes through an article
and compares 4 word groups in the first article
with the next article.
This is how it works:
It breaks the article into a list of words.
Then it selects the first four words as four-word-group-1
It then checks for a match of a set of four words in the comparison article.
If it finds a match, it records a hit.
Then four-word-group-2 is created by using the article's
2nd word + 3rd word + 4th word + 5th word.
Does the comparison and records a hit or not.
It then continues through the whole article until
all the four-word-group s have been created and tested.
The total number of four-word-group s will always
be the number of words in article less 3
So in a 1000 word article, you get 997 four-word-group s
If the comparison finds 120 hits,
then the percentage would be 120/997 * 100 = 12 %
Would this be a suitable measure of uniqueness ?
( Actually as we have just measured similarity,
maybe we would say the uniqueness percentage is 88 % in this case.)
Is this the same calculation that the softwares
are using ?
Thanks for any insights
EDIT
I have just thought ... I am only counting a maximum of one hit for each phrase
or "four-word-group" maybe if the phrase occurs more than once in the second
article, I should record each one as a separate hit ?
Of course if the phrase also occurs more than once in the first article
as well, then that would lead to a lot of double counting.
hemmm...
Also is FOUR words the best number ?
Why not 3 ??
I choose that number because I've seen it used by SpinnerChief .
.
*Need help marketing your business?*
----Click Here to Join Me and Learn What Really Works!----