AutoTag
AutoTag is a program that generate tags for documents automatically. The main process includes:
-
Participle (N-gram + lookup in dictionary)
-
Generate bag-of-words for each document.
-
Calculate term frequency and inverse document frequency.
-
Pick top x words with greater tf-idf values as tags.
N-gram
N-gram generate a sequence of n words in every position of a sentence.[1]
sentences = 'Lucy like to listen to music. Luna like music too.'
items = ngram(sentences, 2)
print(items)
# output:
[
'Lucy like',
'like to',
'to listen',
'listen to',
'to music',
'Luna like',
'like music',
'music too',
]
Bag of words
The bag-of-words model is a simplifying representation in NLP and IR.[1]
-
N-gram
-
Count the times that each word appears.
# items = ngram(sentences, 1)
bow = bagOfWords(items)
print(bow)
# output:
{
'Lucy': 1,
'like': 2,
'to': 2,
'listen': 1,
'music': 2,
'Luna': 1,
}
TF-IDF
TF-IDF is intended to reflect how important a word is to a document in a collection or corpus.[2]
TF is term frequency. IDF is inverse document frequency.
$$ \text{tf-idf}(t, d, D) = tf(t, d) * idf(t, D) $$
There are various ways to calculate TF and IDF. Here are some of them:
( \( f_{t, d} \) is the number of times that term t occurs in document d. )
weighting scheme | tf weight |
---|---|
binary | \( \{ 0, 1 \} \) |
raw count | \( f_{t, d} \) |
term frequency | \( \frac{ f_{t,d} }{ \sum_{ t^{'} }{ f_{t^{'}, d} } } \) |
log normalization | \( log(1 + f_{t, d}) \) |
For IDF:
( \(n_t\) is the number of documents where term t appears. )
weighting scheme | idf weight |
---|---|
unary | 1 |
inverse document frequency | \( log( \frac{N}{n_t} ) \) |
inverse document frequency smooth | \( log( \frac{N}{ 1+n_t } ) \) |