Title: | Modern Text Mining Framework for R |
---|---|
Description: | Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines. |
Authors: | Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph] (Coherence measures for topic models), Qing Wang [aut, cph] (Author of the WaprLDA C++ code) |
Maintainer: | Dmitriy Selivanov <[email protected]> |
License: | GPL (>= 2) | file LICENSE |
Version: | 0.6.4 |
Built: | 2025-01-13 03:33:47 UTC |
Source: | https://github.com/dselivanov/text2vec |
Converts 'dgCMatrix' (or coercible to 'dgCMatrix') to 'lda_c' format
as.lda_c(X)
as.lda_c(X)
X |
Document-Term matrix |
Creates BNS (bi-normal separation) model. Defined as: Q(true positive rate) - Q(false positive rate), where Q is a quantile function of normal distribution.
BNS
BNS
R6Class
object.
Bi-Normal Separation
bns_stat
data.table
with computed BNS statistic.
Useful for feature selection.
For usage details see Methods, Arguments and Examples sections.
bns = BNS$new(treshold = 0.0005) bns$fit_transform(x, y) bns$transform(x)
$new(treshold = 0.0005)
Creates bns model
$fit_transform(x, y)
fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x)
transform new data x
using bns from train data
A BNS
object
An input document term matrix. Preferably in dgCMatrix
format
Binary target variable coercible to logical.
Clipping treshold to avoid infinities in quantile function.
## Not run: data("movie_review") N = 1000 it = itoken(head(movie_review$review, N), preprocessor = tolower, tokenizer = word_tokenizer) vocab = create_vocabulary(it) dtm = create_dtm(it, vocab_vectorizer(vocab)) model_bns = BNS$new() dtm_bns = model_bns$fit_transform(dtm, head(movie_review$sentiment, N)) ## End(Not run)
## Not run: data("movie_review") N = 1000 it = itoken(head(movie_review$review, N), preprocessor = tolower, tokenizer = word_tokenizer) vocab = create_vocabulary(it) dtm = create_dtm(it, vocab_vectorizer(vocab)) model_bns = BNS$new() dtm_bns = model_bns$fit_transform(dtm, head(movie_review$sentiment, N)) ## End(Not run)
This function checks how well the GloVe word embeddings do on the analogy task. For full examples see GloVe.
check_analogy_accuracy(questions_list, m_word_vectors)
check_analogy_accuracy(questions_list, m_word_vectors)
questions_list |
|
m_word_vectors |
word vectors |
prepare_analogy_questions, GloVe
Given a topic model with topics represented as ordered term lists, the coherence may be used to assess the quality of individual topics.
This function is an implementation of several of the numerous possible metrics for such kind of assessments.
Coherence calculation is sensitive to the content of the reference tcm
that is used for evaluation
and that may be created with different parameter settings. Please refer to the details section (or reference section) for information
on typical combinations of metric and type of tcm
. For more general information on measuring coherence
a starting point is given in the reference section.
coherence(x, tcm, metrics = c("mean_logratio", "mean_pmi", "mean_npmi", "mean_difference", "mean_npmi_cosim", "mean_npmi_cosim2"), smooth = 1e-12, n_doc_tcm = -1)
coherence(x, tcm, metrics = c("mean_logratio", "mean_pmi", "mean_npmi", "mean_difference", "mean_npmi_cosim", "mean_npmi_cosim2"), smooth = 1e-12, n_doc_tcm = -1)
x |
A |
tcm |
The term co-occurrence matrix, e.g, a |
metrics |
Character vector specifying the metrics to be calculated. Currently the following metrics are implemented:
|
smooth |
Numeric smoothing constant to avoid logarithm of zero. By default, set to |
n_doc_tcm |
The |
The currently implemented coherence metrics
are described below including a description of the
content type of the tcm
that showed good performance in combination with a specific metric.
For details on how to create tcm
see the example section.
For details on performance of metrics see the resources in the reference section
that served for definition of standard settings for individual metrics.
Note that depending on the use case, still, different settings than the standard settings for creation of tcm
may be reasonable.
Note that for all currently implemented metrics the tcm
is reduced to the top word space on basis of the terms in x
.
Considering the use case of finding the optimum number of topics among several models with different metrics, calculating the mean score over all topics and normalizing this mean coherence scores from different metrics might be considered for direct comparison.
Each metric usually opts for a different optimum number of topics. From initial experience it may be assumed that logratio, pmi and nmpi usually opt for smaller numbers, whereas the other metrics rather tend to propose higher numbers.
Implemented metrics:
"mean_logratio"
The logarithmic ratio is calculated as log(smooth + tcm[x,y]) - log(tcm[y,y])
,
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations are list(c(2,1), c(3,1), c(3,2))
.
The tcm
should represent the boolean term co-occurrence (internally the actual counts are used)
in the original documents and, therefore, is an intrinsic metric in the standard use case.
This metric is similar to the UMass metric, however, with a smaller smoothing constant by default
and using the mean for aggregation instead of the sum.
"mean_pmi"
The pointwise mutual information is calculated as log2((tcm[x,y]/n_doc_tcm) + smooth) - log2(tcm[x,x]/n_doc_tcm) - log2(tcm[y,y]/n_doc_tcm)
,
where x and y are term index pairs from an arbitrary term index combination
that subsets the lower or upper triangle of tcm
, e.g. "preceding".
The tcm
should represent term co-occurrences within a boolean sliding window of size 10
(internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
This metric is similar to the UCI metric, however, with a smaller smoothing constant by default
and using the mean for aggregation instead of the sum.
"mean_npmi"
Similar (in terms of all parameter settings, etc.) to "mean_pmi" metric
but using the normalized pmi instead, which is calculated as (log2((tcm[x,y]/n_doc_tcm) + smooth) - log2(tcm[x,x]/n_doc_tcm) - log2(tcm[y,y]/n_doc_tcm)) / -log2((tcm[x,y]/n_doc_tcm) + smooth)
,
This metric may perform better than the simpler pmi metric.
"mean_difference"
The difference is calculated as tcm[x,y]/tcm[x,x] - (tcm[y,y]/n_tcm_windows)
,
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations are list(c(1,2), c(1,3), c(2,3))
.
The tcm
should represent the boolean term co-occurrence (internally probabilities are used)
in the original documents and, therefore, is an intrinsic metric in the standard use case.
"mean_npmi_cosim"
First, the npmi of an individual top word with each of the top words is calculated as in "mean_npmi".
This result in a vector of npmi values for each top word.
On this basis, the cosine similarity between each pair of vectors is calculated.
The tcm
should represent term co-occurrences within a boolean sliding window of size 5
(internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
"mean_npmi_cosim2"
First, a vector of npmi values for each top word is calculated as in "mean_npmi_cosim".
On this basis, the cosine similarity between each vector and the sum of all vectors is calculated
(instead of the similarity between each pair).
The tcm
should represent term co-occurrences within a boolean sliding window of size 110
(internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
A numeric matrix
with the coherence scores of the specified metrics
per topic.
Below mentioned paper is the main theoretical basis for this code.
Currently only a selection of metrics stated in this paper is included in this R implementation.
Authors: Roeder, Michael; Both, Andreas; Hinneburg, Alexander (2015)
Title: Exploring the Space of Topic Coherence Measures.
In: Xueqi Cheng, Hang Li, Evgeniy Gabrilovich und Jie Tang (Eds.):
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM '15.
the Eighth ACM International Conference. Shanghai, China, 02.02.2015 - 06.02.2015.
New York, USA: ACM Press, p. 399-408.
https://dl.acm.org/citation.cfm?id=2685324
This paper has been implemented by above listed authors as the Java program "palmetto".
See https://github.com/dice-group/Palmetto or http://aksw.org/Projects/Palmetto.html.
## Not run: library(data.table) library(text2vec) library(Matrix) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, progressbar = FALSE) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) n_topics = 10 lda_model = LDA$new(n_topics) fitted = lda_model$fit_transform(dtm, n_iter = 20) tw = lda_model$get_top_words(n = 10, lambda = 1) # for demonstration purposes create intrinsic TCM from original documents # scores might not make sense for metrics that are designed for extrinsic TCM tcm = crossprod(sign(dtm)) # check coherence logger = lgr::get_logger('text2vec') logger$set_threshold('debug') res = coherence(tw, tcm, n_doc_tcm = N) res # example how to create TCM for extrinsic measures from an external corpus external_reference_corpus = tolower(movie_review$review[501:1000]) tokens_ext = word_tokenizer(external_reference_corpus) iterator_ext = itoken(tokens_ext, progressbar = FALSE) v_ext = create_vocabulary(iterator_ext) # for reasons of efficiency vocabulary may be reduced to the terms matched in the original corpus v_ext= v_ext[v_ext$term %in% v$term, ] # external vocabulary may be pruned depending on the use case v_ext = prune_vocabulary(v_ext, term_count_min = 5, doc_proportion_max = 0.2) vectorizer_ext = vocab_vectorizer(v_ext) # for demonstration purposes a boolean co-occurrence within sliding window of size 10 is used # 10 represents sentence co-occurrence, a size of 110 would, e.g., be paragraph co-occurrence window_size = 5 tcm_ext = create_tcm(iterator_ext, vectorizer_ext ,skip_grams_window = window_size ,weights = rep(1, window_size) ,binary_cooccurence = TRUE ) #add marginal probabilities in diagonal (by default only upper triangle of tcm is created) diag(tcm_ext) = attributes(tcm_ext)$word_count # get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)})) ## End(Not run)
## Not run: library(data.table) library(text2vec) library(Matrix) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, progressbar = FALSE) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) n_topics = 10 lda_model = LDA$new(n_topics) fitted = lda_model$fit_transform(dtm, n_iter = 20) tw = lda_model$get_top_words(n = 10, lambda = 1) # for demonstration purposes create intrinsic TCM from original documents # scores might not make sense for metrics that are designed for extrinsic TCM tcm = crossprod(sign(dtm)) # check coherence logger = lgr::get_logger('text2vec') logger$set_threshold('debug') res = coherence(tw, tcm, n_doc_tcm = N) res # example how to create TCM for extrinsic measures from an external corpus external_reference_corpus = tolower(movie_review$review[501:1000]) tokens_ext = word_tokenizer(external_reference_corpus) iterator_ext = itoken(tokens_ext, progressbar = FALSE) v_ext = create_vocabulary(iterator_ext) # for reasons of efficiency vocabulary may be reduced to the terms matched in the original corpus v_ext= v_ext[v_ext$term %in% v$term, ] # external vocabulary may be pruned depending on the use case v_ext = prune_vocabulary(v_ext, term_count_min = 5, doc_proportion_max = 0.2) vectorizer_ext = vocab_vectorizer(v_ext) # for demonstration purposes a boolean co-occurrence within sliding window of size 10 is used # 10 represents sentence co-occurrence, a size of 110 would, e.g., be paragraph co-occurrence window_size = 5 tcm_ext = create_tcm(iterator_ext, vectorizer_ext ,skip_grams_window = window_size ,weights = rep(1, window_size) ,binary_cooccurence = TRUE ) #add marginal probabilities in diagonal (by default only upper triangle of tcm is created) diag(tcm_ext) = attributes(tcm_ext)$word_count # get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)})) ## End(Not run)
Creates Collocations model which can be used for phrase extraction.
Collocations
Collocations
R6Class
object.
collocation_stat
data.table
with collocations(phrases) statistics.
Useful for filtering non-relevant phrases
For usage details see Methods, Arguments and Examples sections.
model = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0, sep = "_") model$partial_fit(it, ...) model$fit(it, n_iter = 1, ...) model$transform(it) model$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0) model$collocation_stat
$new(vocabulary = NULL, collocation_count_min = 50, sep = "_")
Constructor for Collocations model. For description of arguments see Arguments section.
$fit(it, n_iter = 1, ...)
fit Collocations model to input iterator it
.
Iterating over input iterator it
n_iter
times, so hierarchically can learn multi-word phrases.
Invisibly returns collocation_stat
.
$partial_fit(it, ...)
iterates once over data and learns collocations. Invisibly returns collocation_stat
.
Workhorse for $fit()
$transform(it)
transforms input iterator using learned collocations model.
Result of the transformation is new itoken
or itoken_parallel
iterator which will
produce tokens with phrases collapsed into single token.
$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)
filter out non-relevant phrases with low score. User can do it directly by modifying collocation_stat
object.
A Collocation
model object
number of iteration over data
minimal scores of the corresponding statistics in order to collapse tokens into collocation:
pointwise mutual information
"gensim" scores - https://radimrehurek.com/gensim/models/phrases.html adapted from word2vec paper
log-frequency biased mutual dependency
Dunning's logarithm of the ratio between the likelihoods of the hypotheses of dependence and independence
See https://aclanthology.org/I05-1050/ for details.
Also see data in model$collocation_stat
for better intuition
An input itoken
or itoken_parallel
iterator
text2vec_vocabulary
- if provided will look for collocations consisted of only from vocabulary
library(text2vec) data("movie_review") preprocessor = function(x) { gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x)) } sample_ind = 1:100 tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind])) it = itoken(tokens, ids = movie_review$id[sample_ind]) system.time(v <- create_vocabulary(it)) v = prune_vocabulary(v, term_count_min = 5) model = Collocations$new(collocation_count_min = 5, pmi_min = 5) model$fit(it, n_iter = 2) model$collocation_stat it2 = model$transform(it) v2 = create_vocabulary(it2) v2 = prune_vocabulary(v2, term_count_min = 5) # check what phrases model has learned setdiff(v2$term, v$term) # [1] "main_character" "jeroen_krabb" "boogey_man" "in_order" # [5] "couldn_t" "much_more" "my_favorite" "worst_film" # [9] "have_seen" "characters_are" "i_mean" "better_than" # [13] "don_t_care" "more_than" "look_at" "they_re" # [17] "each_other" "must_be" "sexual_scenes" "have_been" # [21] "there_are_some" "you_re" "would_have" "i_loved" # [25] "special_effects" "hit_man" "those_who" "people_who" # [29] "i_am" "there_are" "could_have_been" "we_re" # [33] "so_bad" "should_be" "at_least" "can_t" # [37] "i_thought" "isn_t" "i_ve" "if_you" # [41] "didn_t" "doesn_t" "i_m" "don_t" # and same way we can create document-term matrix which contains # words and phrases! dtm = create_dtm(it2, vocab_vectorizer(v2)) # check that dtm contains phrases which(colnames(dtm) == "jeroen_krabb")
library(text2vec) data("movie_review") preprocessor = function(x) { gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x)) } sample_ind = 1:100 tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind])) it = itoken(tokens, ids = movie_review$id[sample_ind]) system.time(v <- create_vocabulary(it)) v = prune_vocabulary(v, term_count_min = 5) model = Collocations$new(collocation_count_min = 5, pmi_min = 5) model$fit(it, n_iter = 2) model$collocation_stat it2 = model$transform(it) v2 = create_vocabulary(it2) v2 = prune_vocabulary(v2, term_count_min = 5) # check what phrases model has learned setdiff(v2$term, v$term) # [1] "main_character" "jeroen_krabb" "boogey_man" "in_order" # [5] "couldn_t" "much_more" "my_favorite" "worst_film" # [9] "have_seen" "characters_are" "i_mean" "better_than" # [13] "don_t_care" "more_than" "look_at" "they_re" # [17] "each_other" "must_be" "sexual_scenes" "have_been" # [21] "there_are_some" "you_re" "would_have" "i_loved" # [25] "special_effects" "hit_man" "those_who" "people_who" # [29] "i_am" "there_are" "could_have_been" "we_re" # [33] "so_bad" "should_be" "at_least" "can_t" # [37] "i_thought" "isn_t" "i_ve" "if_you" # [41] "didn_t" "doesn_t" "i_m" "don_t" # and same way we can create document-term matrix which contains # words and phrases! dtm = create_dtm(it2, vocab_vectorizer(v2)) # check that dtm contains phrases which(colnames(dtm) == "jeroen_krabb")
Combines multiple vocabularies into one
combine_vocabularies(..., combine_stopwords = function(x) unique(unlist(lapply(x, attr, which = "stopwords"), use.names = FALSE)), combine_ngram = function(x) attr(x[[1]], "ngram"), combine_sep_ngram = function(x) attr(x[[1]], "sep_ngram"))
combine_vocabularies(..., combine_stopwords = function(x) unique(unlist(lapply(x, attr, which = "stopwords"), use.names = FALSE)), combine_ngram = function(x) attr(x[[1]], "ngram"), combine_sep_ngram = function(x) attr(x[[1]], "sep_ngram"))
... |
vocabulary objects created with create_vocabulary. |
combine_stopwords |
function to combine stopwords from input vocabularies. By default we take a union of all stopwords. |
combine_ngram |
function to combine lower and upper boundary for n-grams from input vocabularies. Usually these values should be the same, so we take this parameter from first vocabulary. |
combine_sep_ngram |
function to combine stopwords from input vocabularies. Usually these values should be the same, so we take this parameter from first vocabulary. |
text2vec_vocabulary
see details in create_vocabulary.
This is a high-level function for creating a document-term matrix.
create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken_parallel' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...)
create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken_parallel' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...)
it |
itoken iterator or |
vectorizer |
|
type |
|
... |
placeholder for additional arguments (not used at the moment).
over |
If a parallel backend is registered and first argument is a list of itoken
,
iterators, function will construct the DTM in multiple threads.
User should keep in mind that he or she should split the data itself and provide a list of
itoken iterators. Each element of it
will be handled in separate
thread and combined at the end of processing.
A document-term matrix
## Not run: data("movie_review") N = 1000 it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) v = create_vocabulary(it) #remove very common and uncommon words pruned_vocab = prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) dtm = create_dtm(it, vectorizer) # get tf-idf matrix from bag-of-words matrix dtm_tfidf = transformer_tfidf(dtm) ## Example of parallel mode it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) vectorizer = hash_vectorizer() dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') ## End(Not run)
## Not run: data("movie_review") N = 1000 it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) v = create_vocabulary(it) #remove very common and uncommon words pruned_vocab = prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) dtm = create_dtm(it, vectorizer) # get tf-idf matrix from bag-of-words matrix dtm_tfidf = transformer_tfidf(dtm) ## Example of parallel mode it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) vectorizer = hash_vectorizer() dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') ## End(Not run)
This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.
create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken_parallel' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)
create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken_parallel' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)
it |
|
vectorizer |
|
skip_grams_window |
|
skip_grams_window_context |
one of |
weights |
weights for context/distant words during co-occurence statistics calculation.
By default we are setting |
binary_cooccurence |
|
... |
placeholder for additional arguments (not used at the moment).
|
If a parallel backend is registered, it will construct the TCM in multiple threads.
The user should keep in mind that he/she should split data and provide a list
of itoken iterators. Each element of it
will be handled
in a separate thread combined at the end of processing.
TsparseMatrix
TCM matrix
## Not run: data("movie_review") # single thread tokens = word_tokenizer(tolower(movie_review$review)) it = itoken(tokens) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L) # parallel version # set to number of cores on your machine it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric") ## End(Not run)
## Not run: data("movie_review") # single thread tokens = word_tokenizer(tolower(movie_review$review)) it = itoken(tokens) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L) # parallel version # set to number of cores on your machine it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric") ## End(Not run)
This function collects unique terms and corresponding statistics. See the below for details.
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'character' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken_parallel' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'character' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken_parallel' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
it |
iterator over a |
ngram |
|
stopwords |
|
sep_ngram |
|
window_size |
|
... |
placeholder for additional arguments (not used at the moment). |
text2vec_vocabulary
object, which is actually a data.frame
with following columns:
term |
|
term_count |
|
doc_count |
|
Also it contains metainformation in attributes:
ngram
: integer
vector, the lower and upper boundary of the
range of n-gram-values.
document_count
: integer
number of documents vocabulary was
built.
stopwords
: character
vector of stopwords
sep_ngram
: character
separator for ngrams
character
: creates text2vec_vocabulary
from predefined
character vector. Terms will be inserted as is, without any checks
(ngrams number, ngram delimiters, etc.).
itoken
: collects unique terms and corresponding statistics from object.
itoken_parallel
: collects unique terms and corresponding
statistics from iterator.
data("movie_review") txt = movie_review[['review']][1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) vocab = create_vocabulary(it) pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8, doc_proportion_min = 0.001, vocab_term_max = 20000)
data("movie_review") txt = movie_review[['review']][1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) vocab = create_vocabulary(it) pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8, doc_proportion_min = 0.001, vocab_term_max = 20000)
dist2
calculates pairwise distances/similarities between the
rows of two data matrices. Note that some methods work only on sparse matrices and
others work only on dense matrices.
pdist2
calculates "parallel" distances between the rows of two data matrices.
dist2(x, y = NULL, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none")) pdist2(x, y, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none"))
dist2(x, y = NULL, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none")) pdist2(x, y, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none"))
x |
first matrix. |
y |
second matrix. For |
method |
usually |
norm |
|
Computes the distance matrix computed by using the specified method. Similar to dist function, but works with two matrices.
pdist2
takes two matrices and return a single vector.
giving the ‘parallel’ distances of the vectors.
dist2
returns matrix
of distances/similarities between each row of
matrix x
and each row of matrix y
.
pdist2
returns vector
of "parallel" distances between rows
of x
and y
.
re-export rsparse::GloVe
GlobalVectors
GlobalVectors
An object of class R6ClassGenerator
of length 25.
The result of this function usually used in an itoken function.
ifiles(file_paths, reader = readLines) idir(path, reader = readLines) ifiles_parallel(file_paths, reader = readLines, ...)
ifiles(file_paths, reader = readLines) idir(path, reader = readLines) ifiles_parallel(file_paths, reader = readLines, ...)
file_paths |
|
reader |
|
path |
|
... |
other arguments (not used at the moment) |
## Not run: current_dir_files = list.files(path = ".", full.names = TRUE) files_iterator = ifiles(current_dir_files) parallel_files_iterator = ifiles_parallel(current_dir_files, n_chunks = 4) it = itoken_parallel(parallel_files_iterator) dtm = create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix') ## End(Not run) dir_files_iterator = idir(path = ".")
## Not run: current_dir_files = list.files(path = ".", full.names = TRUE) files_iterator = ifiles(current_dir_files) parallel_files_iterator = ifiles_parallel(current_dir_files, n_chunks = 4) it = itoken_parallel(parallel_files_iterator) dtm = create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix') ## End(Not run) dir_files_iterator = idir(path = ".")
This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.
itoken(iterable, ...) ## S3 method for class 'character' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, progressbar = interactive(), ids = NULL, ...) ## S3 method for class 'list' itoken(iterable, n_chunks = 10, progressbar = interactive(), ids = names(iterable), ...) ## S3 method for class 'iterator' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, progressbar = interactive(), ...) itoken_parallel(iterable, ...) ## S3 method for class 'character' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...) ## S3 method for class 'iterator' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, ...) ## S3 method for class 'list' itoken_parallel(iterable, n_chunks = 10, ids = NULL, ...)
itoken(iterable, ...) ## S3 method for class 'character' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, progressbar = interactive(), ids = NULL, ...) ## S3 method for class 'list' itoken(iterable, n_chunks = 10, progressbar = interactive(), ids = names(iterable), ...) ## S3 method for class 'iterator' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, progressbar = interactive(), ...) itoken_parallel(iterable, ...) ## S3 method for class 'character' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...) ## S3 method for class 'iterator' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, ...) ## S3 method for class 'list' itoken_parallel(iterable, n_chunks = 10, ids = NULL, ...)
iterable |
an object from which to generate an iterator |
... |
arguments passed to other methods |
preprocessor |
|
tokenizer |
|
n_chunks |
|
progressbar |
|
ids |
|
S3 methods for creating an itoken iterator from list of tokens
list
: all elements of the input list should be
character vectors containing tokens
character
: raw text
source: the user must provide a tokenizer function
ifiles
: from files, a user must provide a function to read in the file
(to ifiles) and a function to tokenize it (to itoken)
idir
: from a directory, the user must provide a function to
read in the files (to idir) and a function to tokenize it (to itoken)
ifiles_parallel
: from files in parallel
ifiles, idir, create_vocabulary, create_dtm, vectorizers, create_tcm
data("movie_review") txt = movie_review$review[1:100] ids = movie_review$id[1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids) # Example of stemming tokenizer # stem_tokenizer =function(x) { # lapply(word_tokenizer(x), SnowballC::wordStem, language="en") # } it = itoken_parallel(movie_review$review[1:100], n_chunks = 4) system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix'))
data("movie_review") txt = movie_review$review[1:100] ids = movie_review$id[1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids) # Example of stemming tokenizer # stem_tokenizer =function(x) { # lapply(word_tokenizer(x), SnowballC::wordStem, language="en") # } it = itoken_parallel(movie_review$review[1:100], n_chunks = 4) system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix'))
This function is largely a copy of the repsective function in https://github.com/cpsievert/LDAvis/blob/master/R/createJSON.R, however, with a fix to avoid log(0) proposed by Maren-Eckhoff in https://github.com/cpsievert/LDAvis/issues/56
jsPCA_robust(phi)
jsPCA_robust(phi)
phi |
matrix, with each row containing the distribution over terms for a topic, with as many rows as there are topics in the model, and as many columns as there are terms in the vocabulary. |
Creates Latent Dirichlet Allocation model. At the moment only 'WarpLDA' is implemented. WarpLDA, an LDA sampler which achieves both the best O(1) time complexity per token and the best O(K) scope of random access. Our empirical results in a wide range of testing conditions demonstrate that WarpLDA is consistently 5-15x faster than the state-of-the-art Metropolis-Hastings based LightLDA, and is comparable or faster than the sparsity aware F+LDA.
LatentDirichletAllocation LDA
LatentDirichletAllocation LDA
R6Class
object.
topic_word_distribution
distribution of words for each topic. Available after model fitting with
model$fit_transform()
method.
components
unnormalized word counts for each topic-word entry. Available after model fitting with
model$fit_transform()
method.
For usage details see Methods, Arguments and Examples sections.
lda = LDA$new(n_topics = 10L, doc_topic_prior = 50 / n_topics, topic_word_prior = 1 / n_topics) lda$fit_transform(x, n_iter = 1000, convergence_tol = 1e-3, n_check_convergence = 10, progressbar = interactive()) lda$transform(x, n_iter = 1000, convergence_tol = 1e-3, n_check_convergence = 5, progressbar = FALSE) lda$get_top_words(n = 10, topic_number = 1L:private$n_topics, lambda = 1)
$new(n_topics,
doc_topic_prior = 50 / n_topics, # alpha
topic_word_prior = 1 / n_topics, # beta
method = "WarpLDA")
Constructor for LDA model. For description of arguments see Arguments section.
$fit_transform(x, n_iter, convergence_tol = -1,
n_check_convergence = 0, progressbar = interactive())
fit LDA model to input matrix
x
and transforms input documents to topic space.
Result is a matrix where each row represents corresponding document.
Values in a row form distribution over topics.
$transform(x, n_iter, convergence_tol = -1,
n_check_convergence = 0, progressbar = FALSE)
transforms new documents into topic space. Result is a matrix where each row is a distribution of a documents over latent topic space.
$get_top_words(n = 10, topic_number = 1L:private$n_topics, lambda = 1)
returns "top words"
for a given topic (or several topics). Words for each topic can be
sorted by probability of chance to observe word in a given topic (lambda = 1
) and by
"relevance" which also takes into account frequency of word in corpus (lambda < 1
).
From our experience in most cases setting 0.2 < lambda < 0.4
works well.
See http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf for details.
$plot(lambda.step = 0.1, reorder.topics = FALSE, ...)
plot LDA model using https://cran.r-project.org/package=LDAvis package.
...
will be passed to LDAvis::createJSON
and LDAvis::serVis
functions
A LDA
object
An input document-term matrix (should have column names = terms).
CSR RsparseMatrix
used internally,
other formats will be tried to convert to CSR via as()
function call.
integer
desired number of latent topics. Also knows as K
numeric
prior for document-topic multinomial distribution.
Also knows as alpha
numeric
prior for topic-word multinomial distribution.
Also knows as eta
integer
number of sampling iterations while fitting model
integer
number iterations used when sampling from converged model for inference.
In other words number of samples from distribution after burn-in.
defines how often calculate score to check convergence
numeric = -1
defines early stopping strategy. We stop fitting
when one of two following conditions will be satisfied: (a) we have used
all iterations, or (b) score_previous_check / score_current < 1 + convergence_tol
## Not run: library(text2vec) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, ids = movie_review$id[1:N]) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) lda_model = LDA$new(n_topics = 10) doc_topic_distr = lda_model$fit_transform(dtm, n_iter = 20) # run LDAvis visualisation if needed (make sure LDAvis package installed) # lda_model$plot() ## End(Not run)
## Not run: library(text2vec) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, ids = movie_review$id[1:N]) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) lda_model = LDA$new(n_topics = 10) doc_topic_distr = lda_model$fit_transform(dtm, n_iter = 20) # run LDAvis visualisation if needed (make sure LDAvis package installed) # lda_model$plot() ## End(Not run)
Creates LSA(Latent semantic analysis) model. See https://en.wikipedia.org/wiki/Latent_semantic_analysis for details.
LatentSemanticAnalysis LSA
LatentSemanticAnalysis LSA
R6Class
object.
For usage details see Methods, Arguments and Examples sections.
lsa = LatentSemanticAnalysis$new(n_topics) lsa$fit_transform(x, ...) lsa$transform(x, ...) lsa$components
$new(n_topics)
create LSA model with n_topics
latent topics
$fit_transform(x, ...)
fit model to an input sparse matrix (preferably in dgCMatrix
format) and then transform x
to latent space
$transform(x, ...)
transform new data x
to latent space
A LSA
object.
An input document-term matrix. Preferably in dgCMatrix
format
integer
desired number of latent topics.
Arguments to internal functions. Notably useful for fit_transform()
-
these arguments will be passed to rsparse::soft_svd
data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer(2**10)) n_topics = 5 lsa_1 = LatentSemanticAnalysis$new(n_topics) d1 = lsa_1$fit_transform(dtm) # the same, but wrapped with S3 methods d2 = fit_transform(dtm, lsa_1)
data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer(2**10)) n_topics = 5 lsa_1 = LatentSemanticAnalysis$new(n_topics) d1 = lsa_1$fit_transform(dtm) # the same, but wrapped with S3 methods d2 = fit_transform(dtm, lsa_1)
The labeled dataset consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews. Important note: we removed non ASCII symbols from the original dataset to satisfy CRAN policy.
data("movie_review")
data("movie_review")
A data frame with 5000 rows and 3 variables:
Unique ID of each review
Sentiment of the review; 1 for positive reviews and 0 for negative reviews
Text of the review (UTF-8)
http://ai.stanford.edu/~amaas/data/sentiment/
normalize matrix rows using given norm
normalize(m, norm = c("l1", "l2", "none"))
normalize(m, norm = c("l1", "l2", "none"))
m |
|
norm |
|
normalized matrix
Given document-term matrix, topic-word distribution, document-topic distribution calculates perplexity
perplexity(X, topic_word_distribution, doc_topic_distribution)
perplexity(X, topic_word_distribution, doc_topic_distribution)
X |
sparse document-term matrix which contains terms counts. Internally |
topic_word_distribution |
dense matrix for topic-word distribution. Number of rows = |
doc_topic_distribution |
dense matrix for document-topic distribution. Number of rows = |
## Not run: library(text2vec) data("movie_review") n_iter = 10 train_ind = 1:200 ids = movie_review$id[train_ind] txt = tolower(movie_review[['review']][train_ind]) names(txt) = ids tokens = word_tokenizer(txt) it = itoken(tokens, progressbar = FALSE, ids = ids) vocab = create_vocabulary(it) vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_min = 0.02) dtm = create_dtm(it, vectorizer = vocab_vectorizer(vocab)) n_topic = 10 model = LDA$new(n_topic, doc_topic_prior = 0.1, topic_word_prior = 0.01) doc_topic_distr = model$fit_transform(dtm, n_iter = n_iter, n_check_convergence = 1, convergence_tol = -1, progressbar = FALSE) topic_word_distr_10 = model$topic_word_distribution perplexity(dtm, topic_word_distr_10, doc_topic_distr) ## End(Not run)
## Not run: library(text2vec) data("movie_review") n_iter = 10 train_ind = 1:200 ids = movie_review$id[train_ind] txt = tolower(movie_review[['review']][train_ind]) names(txt) = ids tokens = word_tokenizer(txt) it = itoken(tokens, progressbar = FALSE, ids = ids) vocab = create_vocabulary(it) vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_min = 0.02) dtm = create_dtm(it, vectorizer = vocab_vectorizer(vocab)) n_topic = 10 model = LDA$new(n_topic, doc_topic_prior = 0.1, topic_word_prior = 0.01) doc_topic_distr = model$fit_transform(dtm, n_iter = n_iter, n_check_convergence = 1, convergence_tol = -1, progressbar = FALSE) topic_word_distr_10 = model$topic_word_distribution perplexity(dtm, topic_word_distr_10, doc_topic_distr) ## End(Not run)
This function prepares a list of questions from a
questions-words.txt
format. For full examples see GloVe.
prepare_analogy_questions(questions_file_path, vocab_terms)
prepare_analogy_questions(questions_file_path, vocab_terms)
questions_file_path |
|
vocab_terms |
|
Print a vocabulary.
## S3 method for class 'text2vec_vocabulary' print(x, ...)
## S3 method for class 'text2vec_vocabulary' print(x, ...)
x |
vocabulary |
... |
optional arguments to print methods. |
This function filters the input vocabulary and throws out very
frequent and very infrequent terms. See examples in for the
vocabulary function. The parameter vocab_term_max
can
also be used to limit the absolute size of the vocabulary to only the most
frequently used terms.
prune_vocabulary(vocabulary, term_count_min = 1L, term_count_max = Inf, doc_proportion_min = 0, doc_proportion_max = 1, doc_count_min = 1L, doc_count_max = Inf, vocab_term_max = Inf)
prune_vocabulary(vocabulary, term_count_min = 1L, term_count_max = Inf, doc_proportion_min = 0, doc_proportion_max = 1, doc_count_min = 1L, doc_count_max = Inf, vocab_term_max = Inf)
vocabulary |
a vocabulary from the vocabulary function. |
term_count_min |
minimum number of occurences over all documents. |
term_count_max |
maximum number of occurences over all documents. |
doc_proportion_min |
minimum proportion of documents which should contain term. |
doc_proportion_max |
maximum proportion of documents which should contain term. |
doc_count_min |
term will be kept number of documents contain this term is larger than this value |
doc_count_max |
term will be kept number of documents contain this term is smaller than this value |
vocab_term_max |
maximum number of terms in vocabulary. |
RWMD model can be used to query the "relaxed word movers distance" from a document to a collection of documents. RWMD tries to measure distance between query document and collection of documents by calculating how hard is to transform words from query document into words from each document in collection. For more detail see following article: http://mkusner.github.io/publications/WMD.pdf. However in contrast to the article above we calculate "easiness" of the convertion of one word into another by using cosine similarity (but not a euclidean distance). Also here in text2vec we've implemented effiient RWMD using the tricks from the Linear-Complexity Relaxed Word Mover's Distance with GPU Acceleration article https://arxiv.org/abs/1711.07227
RelaxedWordMoversDistance RWMD
RelaxedWordMoversDistance RWMD
R6Class
object.
For usage details see Methods, Arguments and Examples sections.
rwmd = RelaxedWordMoversDistance$new(x, embeddings) rwmd$sim2(x)
$new(x, embeddings)
Constructor for RWMD model.
x
- docuent-term matrix which represents collection of
documents against which you want to perform queries. embeddings
-
matrix of word embeddings which will be used to calculate similarities
between words (each row represents a word vector).
$sim(x)
calculates similarity from a collection of documents
to collection query documents x
.
x
here is a document-term matrix which represents the set of query documents
$dist(x)
calculates distance from a collection of documents
to collection query documents x
x
here is a document-term matrix which represents the set of query documents
## Not run: library(text2vec) library(rsparse) data("movie_review") tokens = word_tokenizer(tolower(movie_review$review)) v = create_vocabulary(itoken(tokens)) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.5) it = itoken(tokens) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer) tcm = create_tcm(it, vectorizer, skip_grams_window = 5) glove_model = GloVe$new(rank = 50, x_max = 10) wv = glove_model$fit_transform(tcm, n_iter = 5) # get average of main and context vectors as proposed in GloVe paper wv = wv + t(glove_model$components) rwmd_model = RelaxedWordMoversDistance$new(dtm, wv) rwms = rwmd_model$sim2(dtm[1:10, ]) head(sort(rwms[1, ], decreasing = T)) ## End(Not run)
## Not run: library(text2vec) library(rsparse) data("movie_review") tokens = word_tokenizer(tolower(movie_review$review)) v = create_vocabulary(itoken(tokens)) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.5) it = itoken(tokens) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer) tcm = create_tcm(it, vectorizer, skip_grams_window = 5) glove_model = GloVe$new(rank = 50, x_max = 10) wv = glove_model$fit_transform(tcm, n_iter = 5) # get average of main and context vectors as proposed in GloVe paper wv = wv + t(glove_model$components) rwmd_model = RelaxedWordMoversDistance$new(dtm, wv) rwms = rwmd_model$sim2(dtm[1:10, ]) head(sort(rwms[1, ], decreasing = T)) ## End(Not run)
sim2
calculates pairwise similarities between the
rows of two data matrices. Note that some methods work only on sparse matrices and
others work only on dense matrices.
psim2
calculates "parallel" similarities between the rows of two data matrices.
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none")) psim2(x, y, method = c("cosine", "jaccard"), norm = c("l2", "none"))
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none")) psim2(x, y, method = c("cosine", "jaccard"), norm = c("l2", "none"))
x |
first matrix. |
y |
second matrix. For |
method |
|
norm |
|
Computes the similarity matrix using given method.
psim2
takes two matrices and return a single vector.
giving the ‘parallel’ similarities of the vectors.
sim2
returns matrix
of similarities between each row of
matrix x
and each row of matrix y
.
psim2
returns vector
of "parallel" similarities between rows of x
and y
.
This function splits a vector into n
parts of roughly
equal size. These splits can be used for parallel processing. In general,
n
should be equal to the number of jobs you want to run, which
should be the number of cores you want to use.
split_into(vec, n)
split_into(vec, n)
vec |
input vector |
n |
|
list
with n
elements, each of roughly equal length
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
To learn more about text2vec visit project website: http://text2vec.org
Or start with the vignettes:
browseVignettes(package = "text2vec")
Creates TfIdf(Latent semantic analysis) model.
"smooth" IDF (default) is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears) )
"non-smooth" IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears) )
TfIdf
TfIdf
R6Class
object.
Term Frequency Inverse Document Frequency
For usage details see Methods, Arguments and Examples sections.
tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE) tfidf$fit_transform(x) tfidf$transform(x)
$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE)
Creates tf-idf model
$fit_transform(x)
fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x)
transform new data x
using tf-idf from train data
A TfIdf
object
An input term-co-occurence matrix. Preferably in dgCMatrix
format
TRUE
smooth IDF weights by adding one to document
frequencies, as if an extra document was seen containing every term in the
collection exactly once.
c("l1", "l2", "none")
Type of normalization to apply to term vectors.
"l1"
by default, i.e., scale by the number of words in the document.
FALSE
Apply sublinear term-frequency scaling, i.e.,
replace the term frequency with 1 + log(TF)
data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer()) model_tfidf = TfIdf$new() dtm_tfidf = model_tfidf$fit_transform(dtm)
data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer()) model_tfidf = TfIdf$new() dtm_tfidf = model_tfidf$fit_transform(dtm)
Few simple tokenization functions. For more comprehensive list see tokenizers
package:
https://cran.r-project.org/package=tokenizers.
Also check stringi::stri_split_*
.
word_tokenizer(strings, ...) char_tokenizer(strings, ...) space_tokenizer(strings, sep = " ", xptr = FALSE, ...) postag_lemma_tokenizer(strings, udpipe_model, tagger = "default", tokenizer = "tokenizer", pos_keep = character(0), pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ"))
word_tokenizer(strings, ...) char_tokenizer(strings, ...) space_tokenizer(strings, sep = " ", xptr = FALSE, ...) postag_lemma_tokenizer(strings, udpipe_model, tagger = "default", tokenizer = "tokenizer", pos_keep = character(0), pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ"))
strings |
|
... |
other parameters (usually not used - see source code for details). |
sep |
|
xptr |
|
udpipe_model |
- udpipe model, can be loaded with |
tagger |
|
tokenizer |
|
pos_keep |
|
pos_remove |
|
list
of character
vectors. Each element of list contains vector of tokens.
doc = c("first second", "bla, bla, blaa") # split by words word_tokenizer(doc) #faster, but far less general - perform split by a fixed single whitespace symbol. space_tokenizer(doc, " ")
doc = c("first second", "bla, bla, blaa") # split by words word_tokenizer(doc) #faster, but far less general - perform split by a fixed single whitespace symbol. space_tokenizer(doc, " ")
This function creates an object (closure) which defines on how to transform list of tokens into vector space - i.e. how to map words to indices. It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary.
vocab_vectorizer(vocabulary) hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L), signed_hash = FALSE)
vocab_vectorizer(vocabulary) hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L), signed_hash = FALSE)
vocabulary |
|
hash_size |
|
ngram |
|
signed_hash |
|
A vectorizer object
(closure).
create_dtm create_tcm create_vocabulary
data("movie_review") N = 100 vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L)) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) hash_dtm = create_dtm(it, vectorizer) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) v = create_vocabulary(it, c(1L, 1L) ) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) dtm = create_dtm(it, vectorizer)
data("movie_review") N = 100 vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L)) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) hash_dtm = create_dtm(it, vectorizer) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) v = create_vocabulary(it, c(1L, 1L) ) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) dtm = create_dtm(it, vectorizer)