Calculating article uniqueness

1 replies
  • SEO
  • |
Hi,

I know that various software compares articles and comes up with a
uniqueness percentage.

Does anyone know what the calculation is based on ?


I have a script which goes through an article
and compares 4 word groups in the first article
with the next article.

This is how it works:

It breaks the article into a list of words.
Then it selects the first four words as four-word-group-1
It then checks for a match of a set of four words in the comparison article.
If it finds a match, it records a hit.

Then four-word-group-2 is created by using the article's
2nd word + 3rd word + 4th word + 5th word.

Does the comparison and records a hit or not.

It then continues through the whole article until
all the four-word-group s have been created and tested.

The total number of four-word-group s will always
be the number of words in article less 3

So in a 1000 word article, you get 997 four-word-group s

If the comparison finds 120 hits,
then the percentage would be 120/997 * 100 = 12 %

Would this be a suitable measure of uniqueness ?

( Actually as we have just measured similarity,
maybe we would say the uniqueness percentage is 88 % in this case.)

Is this the same calculation that the softwares
are using ?


Thanks for any insights

EDIT

I have just thought ... I am only counting a maximum of one hit for each phrase
or "four-word-group" maybe if the phrase occurs more than once in the second
article, I should record each one as a separate hit ?

Of course if the phrase also occurs more than once in the first article
as well, then that would lead to a lot of double counting.

hemmm...

Also is FOUR words the best number ?
Why not 3 ??

I choose that number because I've seen it used by SpinnerChief .




.
#article #calculating #uniqueness
  • Profile picture of the author WebPen
    I highly doubt anyone here could tell you the black magic the software companies used to decide on using 4 words instead of 3, 5, etc.

    Also, what do you mean by "suitable measure of uniqueness"? Do you mean that Google will see it as unique?

    Because that's another question nobody here can really answer.

    Plus, it doesn't really matter unless you're basically re-using tons of content on the same site, which isn't a great idea anyways.
    {{ DiscussionBoard.errors[8597433].message }}

Trending Topics