AutoTag

AutoTag

Tech

Donny

March 12, 2019

March 13, 2019

364

frequency document music inverse document frequency n-gram tf-idf term frequency

Tags in blue are handcrafted tags; Tags in green are generated using AutoTag.

AutoTag is a program that generate tags for documents automatically. The main process includes:

Participle (N-gram + lookup in dictionary)
Generate bag-of-words for each document.
Calculate term frequency and inverse document frequency.
Pick top x words with greater tf-idf values as tags.

N-gram

N-gram generate a sequence of n words in every position of a sentence.^[1]

sentences = 'Lucy like to listen to music. Luna like music too.'
items = ngram(sentences, 2)
print(items)
# output:
[
    'Lucy like',
    'like to',
    'to listen',
    'listen to',
    'to music',
    'Luna like',
    'like music',
    'music too',
]

Bag of words

The bag-of-words model is a simplifying representation in NLP and IR.^[1]

N-gram
Count the times that each word appears.

# items = ngram(sentences, 1)
bow = bagOfWords(items)
print(bow)
# output:
{
    'Lucy': 1,
    'like': 2,
    'to': 2,
    'listen': 1,
    'music': 2,
    'Luna': 1,
}

TF-IDF

TF-IDF is intended to reflect how important a word is to a document in a collection or corpus.^[2]

TF is term frequency. IDF is inverse document frequency.

$$ \text{tf-idf}(t, d, D) = tf(t, d) * idf(t, D) $$

There are various ways to calculate TF and IDF. Here are some of them:

( $ f_{t, d} $ is the number of times that term t occurs in document d. )

weighting scheme	tf weight
binary	$ \{ 0, 1 \} $
raw count	$ f_{t, d} $
term frequency	$ \frac{ f_{t,d} }{ \sum_{ t^{'} }{ f_{t^{'}, d} } } $
log normalization	$ log(1 + f_{t, d}) $

For IDF:

( $n_t$ is the number of documents where term t appears. )

weighting scheme	idf weight
unary	1
inverse document frequency	$ log( \frac{N}{n_t} ) $
inverse document frequency smooth	$ log( \frac{N}{ 1+n_t } ) $

Reference

This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License .
Original Link: https://blog.ny-do.com/posts/6879787950/

weighting scheme	tf weight
binary	\( \{ 0, 1 \} \)
raw count	\( f_{t, d} \)
term frequency	\( \frac{ f_{t,d} }{ \sum_{ t^{'} }{ f_{t^{'}, d} } } \)
log normalization	\( log(1 + f_{t, d}) \)

weighting scheme	idf weight
unary	1
inverse document frequency	\( log( \frac{N}{n_t} ) \)
inverse document frequency smooth	\( log( \frac{N}{ 1+n_t } ) \)