

Donny Donny
March 12, 2019
March 13, 2019


AutoTag is a program that generate tags for documents automatically. The main process includes:

  1. Participle (N-gram + lookup in dictionary)

  2. Generate bag-of-words for each document.

  3. Calculate term frequency and inverse document frequency.

  4. Pick top x words with greater tf-idf values as tags.


N-gram generate a sequence of n words in every position of a sentence.[1]

sentences = 'Lucy like to listen to music. Luna like music too.'
items = ngram(sentences, 2)
# output:
    'Lucy like',
    'like to',
    'to listen',
    'listen to',
    'to music',
    'Luna like',
    'like music',
    'music too',

Bag of words

The bag-of-words model is a simplifying representation in NLP and IR.[1]

  1. N-gram

  2. Count the times that each word appears.

# items = ngram(sentences, 1)
bow = bagOfWords(items)
# output:
    'Lucy': 1,
    'like': 2,
    'to': 2,
    'listen': 1,
    'music': 2,
    'Luna': 1,


TF-IDF is intended to reflect how important a word is to a document in a collection or corpus.[2]

TF is term frequency. IDF is inverse document frequency.

$$ \text{tf-idf}(t, d, D) = tf(t, d) * idf(t, D) $$

There are various ways to calculate TF and IDF. Here are some of them:

( \( f_{t, d} \) is the number of times that term t occurs in document d. )

weighting scheme tf weight
binary \( \{ 0, 1 \} \)
raw count \( f_{t, d} \)
term frequency \( \frac{ f_{t,d} }{ \sum_{ t^{'} }{ f_{t^{'}, d} } } \)
log normalization \( log(1 + f_{t, d}) \)

For IDF:

( \(n_t\) is the number of documents where term t appears. )

weighting scheme idf weight
unary 1
inverse document frequency \( log( \frac{N}{n_t} ) \)
inverse document frequency smooth \( log( \frac{N}{ 1+n_t } ) \)


  1. Bag-of-words Model

  2. TF-IDF