AutoTag

# AutoTag

Donny
March 12, 2019
March 13, 2019
364
-

Tags in blue are handcrafted tags; Tags in green are generated using AutoTag.

## AutoTag

AutoTag is a program that generate tags for documents automatically. The main process includes:

1. Participle (N-gram + lookup in dictionary)

2. Generate bag-of-words for each document.

3. Calculate term frequency and inverse document frequency.

4. Pick top x words with greater tf-idf values as tags.

## N-gram

N-gram generate a sequence of n words in every position of a sentence.[1]

sentences = 'Lucy like to listen to music. Luna like music too.'
items = ngram(sentences, 2)
print(items)
# output:
[
'Lucy like',
'like to',
'to listen',
'listen to',
'to music',
'Luna like',
'like music',
'music too',
]


## Bag of words

The bag-of-words model is a simplifying representation in NLP and IR.[1]

1. N-gram

2. Count the times that each word appears.

# items = ngram(sentences, 1)
bow = bagOfWords(items)
print(bow)
# output:
{
'Lucy': 1,
'like': 2,
'to': 2,
'listen': 1,
'music': 2,
'Luna': 1,
}


## TF-IDF

TF-IDF is intended to reflect how important a word is to a document in a collection or corpus.[2]

TF is term frequency. IDF is inverse document frequency.

$$\text{tf-idf}(t, d, D) = tf(t, d) * idf(t, D)$$

There are various ways to calculate TF and IDF. Here are some of them:

( $$f_{t, d}$$ is the number of times that term t occurs in document d. )

weighting scheme tf weight
binary $$\{ 0, 1 \}$$
raw count $$f_{t, d}$$
term frequency $$\frac{ f_{t,d} }{ \sum_{ t^{'} }{ f_{t^{'}, d} } }$$
log normalization $$log(1 + f_{t, d})$$

For IDF:

( $$n_t$$ is the number of documents where term t appears. )

weighting scheme idf weight
unary 1
inverse document frequency $$log( \frac{N}{n_t} )$$
inverse document frequency smooth $$log( \frac{N}{ 1+n_t } )$$