AutoTag
2019-03-12
Tech
364
AutoTag
AutoTag is a program that generate tags for documents automatically. The main process includes:
Participle (N-gram + lookup in dictionary)
Generate bag-of-words for each document.
Calculate term frequency and inverse document frequency.
Pick top x words with greater tf-idf values as tags.
N-gram
N-gram generate a sequence of n words in every position of a sentence.[1]
sentences = 'Lucy like to listen to music. Luna like music too.'
items = ngram(sentences, 2)
print(items)
# output:
[
'Lucy like',
'like to',
'to listen',
'listen to',
'to music',
'Luna like',
'like music',
'music too',
]
Bag of words
The bag-of-words model is a simplifying representation in NLP and IR.[1]
N-gram
Count the times that each word appears