Introduction to TF-IDF

Reading time ~1 minute

Preliminary

Since one month ago, I’ve been feeling interested with text mining and machine learning. In the learning process, I found some alien words for me. One of that is TF-IDF.


What is TF-IDF

Based on tfidf.com, TF-IDF stands for Term Frequency - Inverse Document Frequency. And then, show up new alien word again, that is TF-IDF weight.

TF-IDF weight is a statistical measure used to evaluate how important a word to a document in a collection of documents. - tfidf.com


How to Determine TF-IDF Weight

  • First, we have to compute Term Frequency (TF). TF measures how frequently a word appeared in a document. The formula is number of times a word show up in the document divided by total number of words in that document.
  • Second, we have to compute Inverse Document Frequency (IDF). IDF measures how important a word. The formula is log(total number of documents divided by number of documents where the word appeared).
  • Third, finally we have to compute TF-IDF weight. The formula is TF multiplied by IDF


Example

There is a document with 50 words in it. The word book appears 5 times. The TF for book is 5 / 50 = 0.1. I have 1000000 documents and the word book appears in 1000 document. so the IDF is log(1000000 / 1000) = 3. And then the TF-IDF weight is 0.1 * 3 = 0.3.


References

Hello Kotlin, Nice to Meet You!

My first experience using Kotlin to make a simple android app Continue reading

A Good Git Branching Model #CodeSharing

Published on February 05, 2017

What I Learn From Xamarin Party Event

Published on January 18, 2017