SANTTUcurriculum vitae
01 Sep 2018

Jaccard Similarity Coefficient for finding similarity amongst the documents

A qualitative way of analysing the similarity is to visually check for the common words and phrases between the two documents. Since, most of the documents have been written in natural language they contain, the most commonly used words such as prepositions, articles, and pronouns in all of the documents. Therefore, the presence of these commonly used words (also called as stopwords [1, 2]) can pose significant challenges in evaluating the qualitative judgement of an expert. Nevertheless, using the techniques from the natural language processing domain, we address the aforementioned challenges and quantify the similarity between two documents as a measure of Jaccard Similarity Coefficient [2].

First we tokenise the textual data in all the documents. This process removes punctuation marks including brackets, hyphens, etc. Furthermore, the text is converted to lower case to cater consistency among the documents. Other forms of inconsistency can arise from different number or time formats, abbreviations and acronyms which were transformed into a standard form. Second, the most frequently used words (stopwords), which accounts for 30-40% of the total word counts are removed. In this paper, we have used the “Long Stopword list” compiled by Ranks NL[1]. The resultant word list contains meaningful words that are significant on their own.

For quantitative analysis, identifying words with a common meaning is beneficial, e.g., the words ‘Vaccine’ and ’Vaccination’ stem from the same word ’Vaccine’. Therefore, stemming process is used to extract these common forms of words. We have used the most popular, Martin Porter’s Stemming Algorithm [5] for our analysis. Once, we are obtained with the stemmed word list from each of the document, the duplicate entries are removed to ensure a unique significant word list from each of the document.

Jaccard Similarity Coefficient is then calculated between the two documents (A and B)  as the length of the intersection of the sets of unique significant words in the documents A and B divided by the length of the union of the two sets. Mathematically represented as jaccard(A, B) = \frac{|A \bigcap B|}{ |A\bigcup B|} . This measure gives the percentage of total unique significant terms that are common between two documents.

The case study on 4 medical documents, results in Table 1 have been produced using the Python Natural Language Tool Kit (NLTK) [6]. The results show that there is a similarity of 35% between Vaccine Benefit and Vaccine Risk. The most common words between them are shown in Table-2

Table-1 Use-case report and its similarity measures with respect to other use-case reports.

Documents Jaccard Similarity Coefficient
doc-1 1 0.239 0.266 0.241
doc-2 0.239 1 0.296 0.349
doc-3 0.266 0.296 1 0.253
doc-4 0.241 0.349 0.253 1
































Table-2 Some of the common words present in two documents.



[1] Fox, C., 1989, September. A stop list for general text. In Acm sigir forum (Vol. 24, No. 1-2, pp. 19-21). ACM.

[2] Tirunagari, S., Hanninen, M., Stanhlberg, K. and Kujala, P., 2012. Mining causal relations and concepts in maritime accidents investigation reports. International Journal of Innovative Research and Development1(10), pp.548-566.

[3] Tirunagari, S., 2015. Data Mining of Causal Relations from Text: Analysing Maritime Accident Investigation Reports. arXiv preprint arXiv:1507.02447.

[4] Niwattanakul, Suphakit, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. “Using of Jaccard coefficient for keywords similarity.” In Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1, no. 6. 2013.

[5] Willett, P., 2006. The Porter stemming algorithm: then and now. Program40(3), pp.219-223.

[6] E. Loper and S. Bird. Nltk: The natural language toolkit. In Proceedings of the ACL-02 Workshop on Effective tools and methodologies for teaching natural language processing and computational linguisticsVolume 1, pages 63–70. Association for Computational Linguistics, 2002


health care • machine learning • teaching Leave a comment

Leave a Reply

%d bloggers like this: