| Title: | Modern Text Mining Framework for R |
|---|---|
| Description: | Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines. |
| Authors: | Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph] (Coherence measures for topic models), Qing Wang [aut, cph] (Author of the WaprLDA C++ code) |
| Maintainer: | Dmitriy Selivanov <[email protected]> |
| License: | GPL (>= 2) | file LICENSE |
| Version: | 0.6.6 |
| Built: | 2026-05-25 06:29:52 UTC |
| Source: | https://github.com/dselivanov/text2vec |
Converts 'dgCMatrix' (or coercible to 'dgCMatrix') to 'lda_c' format
as.lda_c(X)as.lda_c(X)
X |
Document-Term matrix |
Creates BNS (bi-normal separation) model. Defined as: Q(true positive rate) - Q(false positive rate), where Q is a quantile function of normal distribution.
BNSBNS
R6Class object.
Bi-Normal Separation
bns_statdata.table with computed BNS statistic.
Useful for feature selection.
For usage details see Methods, Arguments and Examples sections.
bns = BNS$new(treshold = 0.0005) bns$fit_transform(x, y) bns$transform(x)
$new(treshold = 0.0005)Creates bns model
$fit_transform(x, y)fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x)transform new data x using bns from train data
A BNS object
An input document term matrix. Preferably in dgCMatrix format
Binary target variable coercible to logical.
Clipping treshold to avoid infinities in quantile function.
## Not run: data("movie_review") N = 1000 it = itoken(head(movie_review$review, N), preprocessor = tolower, tokenizer = word_tokenizer) vocab = create_vocabulary(it) dtm = create_dtm(it, vocab_vectorizer(vocab)) model_bns = BNS$new() dtm_bns = model_bns$fit_transform(dtm, head(movie_review$sentiment, N)) ## End(Not run)## Not run: data("movie_review") N = 1000 it = itoken(head(movie_review$review, N), preprocessor = tolower, tokenizer = word_tokenizer) vocab = create_vocabulary(it) dtm = create_dtm(it, vocab_vectorizer(vocab)) model_bns = BNS$new() dtm_bns = model_bns$fit_transform(dtm, head(movie_review$sentiment, N)) ## End(Not run)
This function checks how well the GloVe word embeddings do on the analogy task. For full examples see GloVe.
check_analogy_accuracy(questions_list, m_word_vectors)check_analogy_accuracy(questions_list, m_word_vectors)
questions_list |
|
m_word_vectors |
word vectors |
prepare_analogy_questions, GloVe
Given a topic model with topics represented as ordered term lists, the coherence may be used to assess the quality of individual topics.
This function is an implementation of several of the numerous possible metrics for such kind of assessments.
Coherence calculation is sensitive to the content of the reference tcm that is used for evaluation
and that may be created with different parameter settings. Please refer to the details section (or reference section) for information
on typical combinations of metric and type of tcm. For more general information on measuring coherence
a starting point is given in the reference section.
coherence(x, tcm, metrics = c("mean_logratio", "mean_pmi", "mean_npmi", "mean_difference", "mean_npmi_cosim", "mean_npmi_cosim2"), smooth = 1e-12, n_doc_tcm = -1)coherence(x, tcm, metrics = c("mean_logratio", "mean_pmi", "mean_npmi", "mean_difference", "mean_npmi_cosim", "mean_npmi_cosim2"), smooth = 1e-12, n_doc_tcm = -1)
x |
A |
tcm |
The term co-occurrence matrix, e.g, a |
metrics |
Character vector specifying the metrics to be calculated. Currently the following metrics are implemented:
|
smooth |
Numeric smoothing constant to avoid logarithm of zero. By default, set to |
n_doc_tcm |
The |
The currently implemented coherence metrics are described below including a description of the
content type of the tcm that showed good performance in combination with a specific metric.
For details on how to create tcm see the example section.
For details on performance of metrics see the resources in the reference section
that served for definition of standard settings for individual metrics.
Note that depending on the use case, still, different settings than the standard settings for creation of tcm may be reasonable.
Note that for all currently implemented metrics the tcm is reduced to the top word space on basis of the terms in x.
Considering the use case of finding the optimum number of topics among several models with different metrics, calculating the mean score over all topics and normalizing this mean coherence scores from different metrics might be considered for direct comparison.
Each metric usually opts for a different optimum number of topics. From initial experience it may be assumed that logratio, pmi and nmpi usually opt for smaller numbers, whereas the other metrics rather tend to propose higher numbers.
Implemented metrics:
"mean_logratio"
The logarithmic ratio is calculated as log(smooth + tcm[x,y]) - log(tcm[y,y]),
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations are list(c(2,1), c(3,1), c(3,2)).
The tcm should represent the boolean term co-occurrence (internally the actual counts are used)
in the original documents and, therefore, is an intrinsic metric in the standard use case.
This metric is similar to the UMass metric, however, with a smaller smoothing constant by default
and using the mean for aggregation instead of the sum.
"mean_pmi"
The pointwise mutual information is calculated as log2((tcm[x,y]/n_doc_tcm) + smooth) - log2(tcm[x,x]/n_doc_tcm) - log2(tcm[y,y]/n_doc_tcm),
where x and y are term index pairs from an arbitrary term index combination
that subsets the lower or upper triangle of tcm, e.g. "preceding".
The tcm should represent term co-occurrences within a boolean sliding window of size 10 (internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
This metric is similar to the UCI metric, however, with a smaller smoothing constant by default
and using the mean for aggregation instead of the sum.
"mean_npmi"
Similar (in terms of all parameter settings, etc.) to "mean_pmi" metric
but using the normalized pmi instead, which is calculated as (log2((tcm[x,y]/n_doc_tcm) + smooth) - log2(tcm[x,x]/n_doc_tcm) - log2(tcm[y,y]/n_doc_tcm)) / -log2((tcm[x,y]/n_doc_tcm) + smooth),
This metric may perform better than the simpler pmi metric.
"mean_difference"
The difference is calculated as tcm[x,y]/tcm[x,x] - (tcm[y,y]/n_tcm_windows),
where x and y are term index pairs from a "preceding" term index combination.
Given the indices c(1,2,3), combinations are list(c(1,2), c(1,3), c(2,3)).
The tcm should represent the boolean term co-occurrence (internally probabilities are used)
in the original documents and, therefore, is an intrinsic metric in the standard use case.
"mean_npmi_cosim"
First, the npmi of an individual top word with each of the top words is calculated as in "mean_npmi".
This result in a vector of npmi values for each top word.
On this basis, the cosine similarity between each pair of vectors is calculated.
The tcm should represent term co-occurrences within a boolean sliding window of size 5 (internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
"mean_npmi_cosim2"
First, a vector of npmi values for each top word is calculated as in "mean_npmi_cosim".
On this basis, the cosine similarity between each vector and the sum of all vectors is calculated
(instead of the similarity between each pair).
The tcm should represent term co-occurrences within a boolean sliding window of size 110 (internally probabilities are used)
in an external reference corpus and, therefore, is an extrinsic metric in the standard use case.
A numeric matrix with the coherence scores of the specified metrics per topic.
Below mentioned paper is the main theoretical basis for this code.
Currently only a selection of metrics stated in this paper is included in this R implementation.
Authors: Roeder, Michael; Both, Andreas; Hinneburg, Alexander (2015)
Title: Exploring the Space of Topic Coherence Measures.
In: Xueqi Cheng, Hang Li, Evgeniy Gabrilovich und Jie Tang (Eds.):
Proceedings of the Eighth ACM International Conference on Web Search and Data Mining - WSDM '15.
the Eighth ACM International Conference. Shanghai, China, 02.02.2015 - 06.02.2015.
New York, USA: ACM Press, p. 399-408.
https://dl.acm.org/citation.cfm?id=2685324
This paper has been implemented by above listed authors as the Java program "palmetto".
See https://github.com/dice-group/Palmetto or http://aksw.org/Projects/Palmetto.html.
## Not run: library(data.table) library(text2vec) library(Matrix) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, progressbar = FALSE) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) n_topics = 10 lda_model = LDA$new(n_topics) fitted = lda_model$fit_transform(dtm, n_iter = 20) tw = lda_model$get_top_words(n = 10, lambda = 1) # for demonstration purposes create intrinsic TCM from original documents # scores might not make sense for metrics that are designed for extrinsic TCM tcm = crossprod(sign(dtm)) # check coherence logger = lgr::get_logger('text2vec') logger$set_threshold('debug') res = coherence(tw, tcm, n_doc_tcm = N) res # example how to create TCM for extrinsic measures from an external corpus external_reference_corpus = tolower(movie_review$review[501:1000]) tokens_ext = word_tokenizer(external_reference_corpus) iterator_ext = itoken(tokens_ext, progressbar = FALSE) v_ext = create_vocabulary(iterator_ext) # for reasons of efficiency vocabulary may be reduced to the terms matched in the original corpus v_ext= v_ext[v_ext$term %in% v$term, ] # external vocabulary may be pruned depending on the use case v_ext = prune_vocabulary(v_ext, term_count_min = 5, doc_proportion_max = 0.2) vectorizer_ext = vocab_vectorizer(v_ext) # for demonstration purposes a boolean co-occurrence within sliding window of size 10 is used # 10 represents sentence co-occurrence, a size of 110 would, e.g., be paragraph co-occurrence window_size = 5 tcm_ext = create_tcm(iterator_ext, vectorizer_ext ,skip_grams_window = window_size ,weights = rep(1, window_size) ,binary_cooccurence = TRUE ) #add marginal probabilities in diagonal (by default only upper triangle of tcm is created) diag(tcm_ext) = attributes(tcm_ext)$word_count # get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)})) ## End(Not run)## Not run: library(data.table) library(text2vec) library(Matrix) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, progressbar = FALSE) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) n_topics = 10 lda_model = LDA$new(n_topics) fitted = lda_model$fit_transform(dtm, n_iter = 20) tw = lda_model$get_top_words(n = 10, lambda = 1) # for demonstration purposes create intrinsic TCM from original documents # scores might not make sense for metrics that are designed for extrinsic TCM tcm = crossprod(sign(dtm)) # check coherence logger = lgr::get_logger('text2vec') logger$set_threshold('debug') res = coherence(tw, tcm, n_doc_tcm = N) res # example how to create TCM for extrinsic measures from an external corpus external_reference_corpus = tolower(movie_review$review[501:1000]) tokens_ext = word_tokenizer(external_reference_corpus) iterator_ext = itoken(tokens_ext, progressbar = FALSE) v_ext = create_vocabulary(iterator_ext) # for reasons of efficiency vocabulary may be reduced to the terms matched in the original corpus v_ext= v_ext[v_ext$term %in% v$term, ] # external vocabulary may be pruned depending on the use case v_ext = prune_vocabulary(v_ext, term_count_min = 5, doc_proportion_max = 0.2) vectorizer_ext = vocab_vectorizer(v_ext) # for demonstration purposes a boolean co-occurrence within sliding window of size 10 is used # 10 represents sentence co-occurrence, a size of 110 would, e.g., be paragraph co-occurrence window_size = 5 tcm_ext = create_tcm(iterator_ext, vectorizer_ext ,skip_grams_window = window_size ,weights = rep(1, window_size) ,binary_cooccurence = TRUE ) #add marginal probabilities in diagonal (by default only upper triangle of tcm is created) diag(tcm_ext) = attributes(tcm_ext)$word_count # get number of sliding windows that serve as virtual documents, i.e. n_doc_tcm argument n_skip_gram_windows = sum(sapply(tokens_ext, function(x) {length(x)})) ## End(Not run)
Creates Collocations model which can be used for phrase extraction.
CollocationsCollocations
R6Class object.
collocation_statdata.table with collocations(phrases) statistics.
Useful for filtering non-relevant phrases
For usage details see Methods, Arguments and Examples sections.
model = Collocations$new(vocabulary = NULL, collocation_count_min = 50, pmi_min = 5, gensim_min = 0,
lfmd_min = -Inf, llr_min = 0, sep = "_")
model$partial_fit(it, ...)
model$fit(it, n_iter = 1, ...)
model$transform(it)
model$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)
model$collocation_stat
$new(vocabulary = NULL, collocation_count_min = 50, sep = "_")Constructor for Collocations model. For description of arguments see Arguments section.
$fit(it, n_iter = 1, ...)fit Collocations model to input iterator it.
Iterating over input iterator it n_iter times, so hierarchically can learn multi-word phrases.
Invisibly returns collocation_stat.
$partial_fit(it, ...)iterates once over data and learns collocations. Invisibly returns collocation_stat.
Workhorse for $fit()
$transform(it)transforms input iterator using learned collocations model.
Result of the transformation is new itoken or itoken_parallel iterator which will
produce tokens with phrases collapsed into single token.
$prune(pmi_min = 5, gensim_min = 0, lfmd_min = -Inf, llr_min = 0)filter out non-relevant phrases with low score. User can do it directly by modifying collocation_stat object.
A Collocation model object
number of iteration over data
minimal scores of the corresponding statistics in order to collapse tokens into collocation:
pointwise mutual information
"gensim" scores - https://radimrehurek.com/gensim/models/phrases.html adapted from word2vec paper
log-frequency biased mutual dependency
Dunning's logarithm of the ratio between the likelihoods of the hypotheses of dependence and independence
See https://aclanthology.org/I05-1050/ for details.
Also see data in model$collocation_stat for better intuition
An input itoken or itoken_parallel iterator
text2vec_vocabulary - if provided will look for collocations consisted of only from vocabulary
library(text2vec) data("movie_review") preprocessor = function(x) { gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x)) } sample_ind = 1:100 tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind])) it = itoken(tokens, ids = movie_review$id[sample_ind]) system.time(v <- create_vocabulary(it)) v = prune_vocabulary(v, term_count_min = 5) model = Collocations$new(collocation_count_min = 5, pmi_min = 5) model$fit(it, n_iter = 2) model$collocation_stat it2 = model$transform(it) v2 = create_vocabulary(it2) v2 = prune_vocabulary(v2, term_count_min = 5) # check what phrases model has learned setdiff(v2$term, v$term) # [1] "main_character" "jeroen_krabb" "boogey_man" "in_order" # [5] "couldn_t" "much_more" "my_favorite" "worst_film" # [9] "have_seen" "characters_are" "i_mean" "better_than" # [13] "don_t_care" "more_than" "look_at" "they_re" # [17] "each_other" "must_be" "sexual_scenes" "have_been" # [21] "there_are_some" "you_re" "would_have" "i_loved" # [25] "special_effects" "hit_man" "those_who" "people_who" # [29] "i_am" "there_are" "could_have_been" "we_re" # [33] "so_bad" "should_be" "at_least" "can_t" # [37] "i_thought" "isn_t" "i_ve" "if_you" # [41] "didn_t" "doesn_t" "i_m" "don_t" # and same way we can create document-term matrix which contains # words and phrases! dtm = create_dtm(it2, vocab_vectorizer(v2)) # check that dtm contains phrases which(colnames(dtm) == "jeroen_krabb")library(text2vec) data("movie_review") preprocessor = function(x) { gsub("[^[:alnum:]\\s]", replacement = " ", tolower(x)) } sample_ind = 1:100 tokens = word_tokenizer(preprocessor(movie_review$review[sample_ind])) it = itoken(tokens, ids = movie_review$id[sample_ind]) system.time(v <- create_vocabulary(it)) v = prune_vocabulary(v, term_count_min = 5) model = Collocations$new(collocation_count_min = 5, pmi_min = 5) model$fit(it, n_iter = 2) model$collocation_stat it2 = model$transform(it) v2 = create_vocabulary(it2) v2 = prune_vocabulary(v2, term_count_min = 5) # check what phrases model has learned setdiff(v2$term, v$term) # [1] "main_character" "jeroen_krabb" "boogey_man" "in_order" # [5] "couldn_t" "much_more" "my_favorite" "worst_film" # [9] "have_seen" "characters_are" "i_mean" "better_than" # [13] "don_t_care" "more_than" "look_at" "they_re" # [17] "each_other" "must_be" "sexual_scenes" "have_been" # [21] "there_are_some" "you_re" "would_have" "i_loved" # [25] "special_effects" "hit_man" "those_who" "people_who" # [29] "i_am" "there_are" "could_have_been" "we_re" # [33] "so_bad" "should_be" "at_least" "can_t" # [37] "i_thought" "isn_t" "i_ve" "if_you" # [41] "didn_t" "doesn_t" "i_m" "don_t" # and same way we can create document-term matrix which contains # words and phrases! dtm = create_dtm(it2, vocab_vectorizer(v2)) # check that dtm contains phrases which(colnames(dtm) == "jeroen_krabb")
Combines multiple vocabularies into one
combine_vocabularies(..., combine_stopwords = function(x) unique(unlist(lapply(x, attr, which = "stopwords"), use.names = FALSE)), combine_ngram = function(x) attr(x[[1]], "ngram"), combine_sep_ngram = function(x) attr(x[[1]], "sep_ngram"))combine_vocabularies(..., combine_stopwords = function(x) unique(unlist(lapply(x, attr, which = "stopwords"), use.names = FALSE)), combine_ngram = function(x) attr(x[[1]], "ngram"), combine_sep_ngram = function(x) attr(x[[1]], "sep_ngram"))
... |
vocabulary objects created with create_vocabulary. |
combine_stopwords |
function to combine stopwords from input vocabularies. By default we take a union of all stopwords. |
combine_ngram |
function to combine lower and upper boundary for n-grams from input vocabularies. Usually these values should be the same, so we take this parameter from first vocabulary. |
combine_sep_ngram |
function to combine stopwords from input vocabularies. Usually these values should be the same, so we take this parameter from first vocabulary. |
text2vec_vocabulary see details in create_vocabulary.
This is a high-level function for creating a document-term matrix.
create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken_parallel' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...)create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...) ## S3 method for class 'itoken_parallel' create_dtm(it, vectorizer, type = c("dgCMatrix", "dgTMatrix", "dgRMatrix", "CsparseMatrix", "TsparseMatrix", "RsparseMatrix"), ...)
it |
itoken iterator or |
vectorizer |
|
type |
|
... |
placeholder for additional arguments (not used at the moment).
over |
If a parallel backend is registered and first argument is a list of itoken,
iterators, function will construct the DTM in multiple threads.
User should keep in mind that he or she should split the data itself and provide a list of
itoken iterators. Each element of it will be handled in separate
thread and combined at the end of processing.
A document-term matrix
## Not run: data("movie_review") N = 1000 it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) v = create_vocabulary(it) #remove very common and uncommon words pruned_vocab = prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) dtm = create_dtm(it, vectorizer) # get tf-idf matrix from bag-of-words matrix dtm_tfidf = transformer_tfidf(dtm) ## Example of parallel mode it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) vectorizer = hash_vectorizer() dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') ## End(Not run)## Not run: data("movie_review") N = 1000 it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) v = create_vocabulary(it) #remove very common and uncommon words pruned_vocab = prune_vocabulary(v, term_count_min = 10, doc_proportion_max = 0.5, doc_proportion_min = 0.001) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer) dtm = create_dtm(it, vectorizer) # get tf-idf matrix from bag-of-words matrix dtm_tfidf = transformer_tfidf(dtm) ## Example of parallel mode it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) vectorizer = hash_vectorizer() dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') ## End(Not run)
This is a function for constructing a term-co-occurrence matrix(TCM). TCM matrix usually used with GloVe word embedding model.
create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken_parallel' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...) ## S3 method for class 'itoken_parallel' create_tcm(it, vectorizer, skip_grams_window = 5L, skip_grams_window_context = c("symmetric", "right", "left"), weights = 1/seq_len(skip_grams_window), binary_cooccurence = FALSE, ...)
it |
|
vectorizer |
|
skip_grams_window |
|
skip_grams_window_context |
one of |
weights |
weights for context/distant words during co-occurence statistics calculation.
By default we are setting |
binary_cooccurence |
|
... |
placeholder for additional arguments (not used at the moment).
|
If a parallel backend is registered, it will construct the TCM in multiple threads.
The user should keep in mind that he/she should split data and provide a list
of itoken iterators. Each element of it will be handled
in a separate thread combined at the end of processing.
TsparseMatrix TCM matrix
## Not run: data("movie_review") # single thread tokens = word_tokenizer(tolower(movie_review$review)) it = itoken(tokens) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L) # parallel version # set to number of cores on your machine it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric") ## End(Not run)## Not run: data("movie_review") # single thread tokens = word_tokenizer(tolower(movie_review$review)) it = itoken(tokens) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) tcm = create_tcm(itoken(tokens), vectorizer, skip_grams_window = 3L) # parallel version # set to number of cores on your machine it = token_parallel(movie_review$review[1:N], tolower, word_tokenizer, movie_review$id[1:N]) v = create_vocabulary(jobs) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer, type = 'TsparseMatrix') tcm = create_tcm(jobs, vectorizer, skip_grams_window = 3L, skip_grams_window_context = "symmetric") ## End(Not run)
This function collects unique terms and corresponding statistics. See the below for details.
create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'character' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken_parallel' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'character' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...) ## S3 method for class 'itoken_parallel' create_vocabulary(it, ngram = c(ngram_min = 1L, ngram_max = 1L), stopwords = character(0), sep_ngram = "_", window_size = 0L, ...)
it |
iterator over a |
ngram |
|
stopwords |
|
sep_ngram |
|
window_size |
|
... |
placeholder for additional arguments (not used at the moment). |
text2vec_vocabulary object, which is actually a data.frame
with following columns:
term |
|
term_count |
|
doc_count |
|
Also it contains metainformation in attributes:
ngram: integer vector, the lower and upper boundary of the
range of n-gram-values.
document_count: integer number of documents vocabulary was
built.
stopwords: character vector of stopwords
sep_ngram: character separator for ngrams
character: creates text2vec_vocabulary from predefined
character vector. Terms will be inserted as is, without any checks
(ngrams number, ngram delimiters, etc.).
itoken: collects unique terms and corresponding statistics from object.
itoken_parallel: collects unique terms and corresponding
statistics from iterator.
data("movie_review") txt = movie_review[['review']][1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) vocab = create_vocabulary(it) pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8, doc_proportion_min = 0.001, vocab_term_max = 20000)data("movie_review") txt = movie_review[['review']][1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) vocab = create_vocabulary(it) pruned_vocab = prune_vocabulary(vocab, term_count_min = 10, doc_proportion_max = 0.8, doc_proportion_min = 0.001, vocab_term_max = 20000)
dist2 calculates pairwise distances/similarities between the
rows of two data matrices. Note that some methods work only on sparse matrices and
others work only on dense matrices.
pdist2 calculates "parallel" distances between the rows of two data matrices.
dist2(x, y = NULL, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none")) pdist2(x, y, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none"))dist2(x, y = NULL, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none")) pdist2(x, y, method = c("cosine", "euclidean", "jaccard"), norm = c("l2", "l1", "none"))
x |
first matrix. |
y |
second matrix. For |
method |
usually |
norm |
|
Computes the distance matrix computed by using the specified method. Similar to dist function, but works with two matrices.
pdist2 takes two matrices and return a single vector.
giving the ‘parallel’ distances of the vectors.
dist2 returns matrix of distances/similarities between each row of
matrix x and each row of matrix y.
pdist2 returns vector of "parallel" distances between rows
of x and y.
re-export rsparse::GloVe
GlobalVectorsGlobalVectors
An object of class R6ClassGenerator of length 25.
The result of this function usually used in an itoken function.
ifiles(file_paths, reader = readLines) idir(path, reader = readLines) ifiles_parallel(file_paths, reader = readLines, ...)ifiles(file_paths, reader = readLines) idir(path, reader = readLines) ifiles_parallel(file_paths, reader = readLines, ...)
file_paths |
|
reader |
|
path |
|
... |
other arguments (not used at the moment) |
## Not run: current_dir_files = list.files(path = ".", full.names = TRUE) files_iterator = ifiles(current_dir_files) parallel_files_iterator = ifiles_parallel(current_dir_files, n_chunks = 4) it = itoken_parallel(parallel_files_iterator) dtm = create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix') ## End(Not run) dir_files_iterator = idir(path = ".")## Not run: current_dir_files = list.files(path = ".", full.names = TRUE) files_iterator = ifiles(current_dir_files) parallel_files_iterator = ifiles_parallel(current_dir_files, n_chunks = 4) it = itoken_parallel(parallel_files_iterator) dtm = create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix') ## End(Not run) dir_files_iterator = idir(path = ".")
This family of function creates iterators over input objects in order to create vocabularies, or DTM and TCM matrices. iterators usually used in following functions : create_vocabulary, create_dtm, vectorizers, create_tcm. See them for details.
itoken(iterable, ...) ## S3 method for class 'character' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, progressbar = interactive(), ids = NULL, ...) ## S3 method for class 'list' itoken(iterable, n_chunks = 10, progressbar = interactive(), ids = names(iterable), ...) ## S3 method for class 'iterator' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, progressbar = interactive(), ...) itoken_parallel(iterable, ...) ## S3 method for class 'character' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...) ## S3 method for class 'iterator' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, ...) ## S3 method for class 'list' itoken_parallel(iterable, n_chunks = 10, ids = NULL, ...)itoken(iterable, ...) ## S3 method for class 'character' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, progressbar = interactive(), ids = NULL, ...) ## S3 method for class 'list' itoken(iterable, n_chunks = 10, progressbar = interactive(), ids = names(iterable), ...) ## S3 method for class 'iterator' itoken(iterable, preprocessor = identity, tokenizer = space_tokenizer, progressbar = interactive(), ...) itoken_parallel(iterable, ...) ## S3 method for class 'character' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 10, ids = NULL, ...) ## S3 method for class 'iterator' itoken_parallel(iterable, preprocessor = identity, tokenizer = space_tokenizer, n_chunks = 1L, ...) ## S3 method for class 'list' itoken_parallel(iterable, n_chunks = 10, ids = NULL, ...)
iterable |
an object from which to generate an iterator |
... |
arguments passed to other methods |
preprocessor |
|
tokenizer |
|
n_chunks |
|
progressbar |
|
ids |
|
S3 methods for creating an itoken iterator from list of tokens
list: all elements of the input list should be
character vectors containing tokens
character: raw text
source: the user must provide a tokenizer function
ifiles: from files, a user must provide a function to read in the file
(to ifiles) and a function to tokenize it (to itoken)
idir: from a directory, the user must provide a function to
read in the files (to idir) and a function to tokenize it (to itoken)
ifiles_parallel: from files in parallel
ifiles, idir, create_vocabulary, create_dtm, vectorizers, create_tcm
data("movie_review") txt = movie_review$review[1:100] ids = movie_review$id[1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids) # Example of stemming tokenizer # stem_tokenizer =function(x) { # lapply(word_tokenizer(x), SnowballC::wordStem, language="en") # } it = itoken_parallel(movie_review$review[1:100], n_chunks = 4) system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix'))data("movie_review") txt = movie_review$review[1:100] ids = movie_review$id[1:100] it = itoken(txt, tolower, word_tokenizer, n_chunks = 10) it = itoken(txt, tolower, word_tokenizer, n_chunks = 10, ids = ids) # Example of stemming tokenizer # stem_tokenizer =function(x) { # lapply(word_tokenizer(x), SnowballC::wordStem, language="en") # } it = itoken_parallel(movie_review$review[1:100], n_chunks = 4) system.time(dtm <- create_dtm(it, hash_vectorizer(2**16), type = 'TsparseMatrix'))
This function is largely a copy of the repsective function in https://github.com/cpsievert/LDAvis/blob/master/R/createJSON.R, however, with a fix to avoid log(0) proposed by Maren-Eckhoff in https://github.com/cpsievert/LDAvis/issues/56
jsPCA_robust(phi)jsPCA_robust(phi)
phi |
matrix, with each row containing the distribution over terms for a topic, with as many rows as there are topics in the model, and as many columns as there are terms in the vocabulary. |
Creates Latent Dirichlet Allocation model. At the moment only 'WarpLDA' is implemented. WarpLDA, an LDA sampler which achieves both the best O(1) time complexity per token and the best O(K) scope of random access. Our empirical results in a wide range of testing conditions demonstrate that WarpLDA is consistently 5-15x faster than the state-of-the-art Metropolis-Hastings based LightLDA, and is comparable or faster than the sparsity aware F+LDA.
LatentDirichletAllocation LDALatentDirichletAllocation LDA
R6Class object.
topic_word_distributiondistribution of words for each topic. Available after model fitting with
model$fit_transform() method.
componentsunnormalized word counts for each topic-word entry. Available after model fitting with
model$fit_transform() method.
For usage details see Methods, Arguments and Examples sections.
lda = LDA$new(n_topics = 10L, doc_topic_prior = 50 / n_topics, topic_word_prior = 1 / n_topics) lda$fit_transform(x, n_iter = 1000, convergence_tol = 1e-3, n_check_convergence = 10, progressbar = interactive()) lda$transform(x, n_iter = 1000, convergence_tol = 1e-3, n_check_convergence = 5, progressbar = FALSE) lda$get_top_words(n = 10, topic_number = 1L:private$n_topics, lambda = 1)
$new(n_topics,
doc_topic_prior = 50 / n_topics, # alpha
topic_word_prior = 1 / n_topics, # beta
method = "WarpLDA")Constructor for LDA model. For description of arguments see Arguments section.
$fit_transform(x, n_iter, convergence_tol = -1,
n_check_convergence = 0, progressbar = interactive())fit LDA model to input matrix
x and transforms input documents to topic space.
Result is a matrix where each row represents corresponding document.
Values in a row form distribution over topics.
$transform(x, n_iter, convergence_tol = -1,
n_check_convergence = 0, progressbar = FALSE)transforms new documents into topic space. Result is a matrix where each row is a distribution of a documents over latent topic space.
$get_top_words(n = 10, topic_number = 1L:private$n_topics, lambda = 1)returns "top words"
for a given topic (or several topics). Words for each topic can be
sorted by probability of chance to observe word in a given topic (lambda = 1) and by
"relevance" which also takes into account frequency of word in corpus (lambda < 1).
From our experience in most cases setting 0.2 < lambda < 0.4 works well.
See http://nlp.stanford.edu/events/illvi2014/papers/sievert-illvi2014.pdf for details.
$plot(lambda.step = 0.1, reorder.topics = FALSE, ...)plot LDA model using https://cran.r-project.org/package=LDAvis package.
... will be passed to LDAvis::createJSON and LDAvis::serVis functions
A LDA object
An input document-term matrix (should have column names = terms).
CSR RsparseMatrix used internally,
other formats will be tried to convert to CSR via as() function call.
integer desired number of latent topics. Also knows as K
numeric prior for document-topic multinomial distribution.
Also knows as alpha
numeric prior for topic-word multinomial distribution.
Also knows as eta
integer number of sampling iterations while fitting model
integer number iterations used when sampling from converged model for inference.
In other words number of samples from distribution after burn-in.
defines how often calculate score to check convergence
numeric = -1 defines early stopping strategy. We stop fitting
when one of two following conditions will be satisfied: (a) we have used
all iterations, or (b) score_previous_check / score_current < 1 + convergence_tol
## Not run: library(text2vec) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, ids = movie_review$id[1:N]) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) lda_model = LDA$new(n_topics = 10) doc_topic_distr = lda_model$fit_transform(dtm, n_iter = 20) # run LDAvis visualisation if needed (make sure LDAvis package installed) # lda_model$plot() ## End(Not run)## Not run: library(text2vec) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, ids = movie_review$id[1:N]) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) lda_model = LDA$new(n_topics = 10) doc_topic_distr = lda_model$fit_transform(dtm, n_iter = 20) # run LDAvis visualisation if needed (make sure LDAvis package installed) # lda_model$plot() ## End(Not run)
Creates LSA(Latent semantic analysis) model. See https://en.wikipedia.org/wiki/Latent_semantic_analysis for details.
LatentSemanticAnalysis LSALatentSemanticAnalysis LSA
R6Class object.
For usage details see Methods, Arguments and Examples sections.
lsa = LatentSemanticAnalysis$new(n_topics) lsa$fit_transform(x, ...) lsa$transform(x, ...) lsa$components
$new(n_topics)create LSA model with n_topics latent topics
$fit_transform(x, ...)fit model to an input sparse matrix (preferably in dgCMatrix
format) and then transform x to latent space
$transform(x, ...)transform new data x to latent space
A LSA object.
An input document-term matrix. Preferably in dgCMatrix format
integer desired number of latent topics.
Arguments to internal functions. Notably useful for fit_transform() -
these arguments will be passed to rsparse::soft_svd
data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer(2**10)) n_topics = 5 lsa_1 = LatentSemanticAnalysis$new(n_topics) d1 = lsa_1$fit_transform(dtm) # the same, but wrapped with S3 methods d2 = fit_transform(dtm, lsa_1)data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer(2**10)) n_topics = 5 lsa_1 = LatentSemanticAnalysis$new(n_topics) d1 = lsa_1$fit_transform(dtm) # the same, but wrapped with S3 methods d2 = fit_transform(dtm, lsa_1)
The labeled dataset consists of 5000 IMDB movie reviews, specially selected for sentiment analysis. The sentiment of the reviews is binary, meaning an IMDB rating < 5 results in a sentiment score of 0, and a rating >=7 has a sentiment score of 1. No individual movie has more than 30 reviews. Important note: we removed non ASCII symbols from the original dataset to satisfy CRAN policy.
data("movie_review")data("movie_review")
A data frame with 5000 rows and 3 variables:
Unique ID of each review
Sentiment of the review; 1 for positive reviews and 0 for negative reviews
Text of the review (UTF-8)
http://ai.stanford.edu/~amaas/data/sentiment/
normalize matrix rows using given norm
normalize(m, norm = c("l1", "l2", "none"))normalize(m, norm = c("l1", "l2", "none"))
m |
|
norm |
|
normalized matrix
Given document-term matrix, topic-word distribution, document-topic distribution calculates perplexity
perplexity(X, topic_word_distribution, doc_topic_distribution)perplexity(X, topic_word_distribution, doc_topic_distribution)
X |
sparse document-term matrix which contains terms counts. Internally |
topic_word_distribution |
dense matrix for topic-word distribution. Number of rows = |
doc_topic_distribution |
dense matrix for document-topic distribution. Number of rows = |
## Not run: library(text2vec) data("movie_review") n_iter = 10 train_ind = 1:200 ids = movie_review$id[train_ind] txt = tolower(movie_review[['review']][train_ind]) names(txt) = ids tokens = word_tokenizer(txt) it = itoken(tokens, progressbar = FALSE, ids = ids) vocab = create_vocabulary(it) vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_min = 0.02) dtm = create_dtm(it, vectorizer = vocab_vectorizer(vocab)) n_topic = 10 model = LDA$new(n_topic, doc_topic_prior = 0.1, topic_word_prior = 0.01) doc_topic_distr = model$fit_transform(dtm, n_iter = n_iter, n_check_convergence = 1, convergence_tol = -1, progressbar = FALSE) topic_word_distr_10 = model$topic_word_distribution perplexity(dtm, topic_word_distr_10, doc_topic_distr) ## End(Not run)## Not run: library(text2vec) data("movie_review") n_iter = 10 train_ind = 1:200 ids = movie_review$id[train_ind] txt = tolower(movie_review[['review']][train_ind]) names(txt) = ids tokens = word_tokenizer(txt) it = itoken(tokens, progressbar = FALSE, ids = ids) vocab = create_vocabulary(it) vocab = prune_vocabulary(vocab, term_count_min = 5, doc_proportion_min = 0.02) dtm = create_dtm(it, vectorizer = vocab_vectorizer(vocab)) n_topic = 10 model = LDA$new(n_topic, doc_topic_prior = 0.1, topic_word_prior = 0.01) doc_topic_distr = model$fit_transform(dtm, n_iter = n_iter, n_check_convergence = 1, convergence_tol = -1, progressbar = FALSE) topic_word_distr_10 = model$topic_word_distribution perplexity(dtm, topic_word_distr_10, doc_topic_distr) ## End(Not run)
This function prepares a list of questions from a
questions-words.txt format. For full examples see GloVe.
prepare_analogy_questions(questions_file_path, vocab_terms)prepare_analogy_questions(questions_file_path, vocab_terms)
questions_file_path |
|
vocab_terms |
|
Print a vocabulary.
## S3 method for class 'text2vec_vocabulary' print(x, ...)## S3 method for class 'text2vec_vocabulary' print(x, ...)
x |
vocabulary |
... |
optional arguments to print methods. |
This function filters the input vocabulary and throws out very
frequent and very infrequent terms. See examples in for the
vocabulary function. The parameter vocab_term_max can
also be used to limit the absolute size of the vocabulary to only the most
frequently used terms.
prune_vocabulary(vocabulary, term_count_min = 1L, term_count_max = Inf, doc_proportion_min = 0, doc_proportion_max = 1, doc_count_min = 1L, doc_count_max = Inf, vocab_term_max = Inf)prune_vocabulary(vocabulary, term_count_min = 1L, term_count_max = Inf, doc_proportion_min = 0, doc_proportion_max = 1, doc_count_min = 1L, doc_count_max = Inf, vocab_term_max = Inf)
vocabulary |
a vocabulary from the vocabulary function. |
term_count_min |
minimum number of occurences over all documents. |
term_count_max |
maximum number of occurences over all documents. |
doc_proportion_min |
minimum proportion of documents which should contain term. |
doc_proportion_max |
maximum proportion of documents which should contain term. |
doc_count_min |
term will be kept number of documents contain this term is larger than this value |
doc_count_max |
term will be kept number of documents contain this term is smaller than this value |
vocab_term_max |
maximum number of terms in vocabulary. |
RWMD model can be used to query the "relaxed word movers distance" from a document to a collection of documents. RWMD tries to measure distance between query document and collection of documents by calculating how hard is to transform words from query document into words from each document in collection. For more detail see following article: http://mkusner.github.io/publications/WMD.pdf. However in contrast to the article above we calculate "easiness" of the convertion of one word into another by using cosine similarity (but not a euclidean distance). Also here in text2vec we've implemented effiient RWMD using the tricks from the Linear-Complexity Relaxed Word Mover's Distance with GPU Acceleration article https://arxiv.org/abs/1711.07227
RelaxedWordMoversDistance RWMDRelaxedWordMoversDistance RWMD
R6Class object.
For usage details see Methods, Arguments and Examples sections.
rwmd = RelaxedWordMoversDistance$new(x, embeddings) rwmd$sim2(x)
$new(x, embeddings)Constructor for RWMD model.
x - docuent-term matrix which represents collection of
documents against which you want to perform queries. embeddings -
matrix of word embeddings which will be used to calculate similarities
between words (each row represents a word vector).
$sim(x)calculates similarity from a collection of documents
to collection query documents x.
x here is a document-term matrix which represents the set of query documents
$dist(x)calculates distance from a collection of documents
to collection query documents x
x here is a document-term matrix which represents the set of query documents
## Not run: library(text2vec) library(rsparse) data("movie_review") tokens = word_tokenizer(tolower(movie_review$review)) v = create_vocabulary(itoken(tokens)) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.5) it = itoken(tokens) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer) tcm = create_tcm(it, vectorizer, skip_grams_window = 5) glove_model = GloVe$new(rank = 50, x_max = 10) wv = glove_model$fit_transform(tcm, n_iter = 5) # get average of main and context vectors as proposed in GloVe paper wv = wv + t(glove_model$components) rwmd_model = RelaxedWordMoversDistance$new(dtm, wv) rwms = rwmd_model$sim2(dtm[1:10, ]) head(sort(rwms[1, ], decreasing = T)) ## End(Not run)## Not run: library(text2vec) library(rsparse) data("movie_review") tokens = word_tokenizer(tolower(movie_review$review)) v = create_vocabulary(itoken(tokens)) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.5) it = itoken(tokens) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer) tcm = create_tcm(it, vectorizer, skip_grams_window = 5) glove_model = GloVe$new(rank = 50, x_max = 10) wv = glove_model$fit_transform(tcm, n_iter = 5) # get average of main and context vectors as proposed in GloVe paper wv = wv + t(glove_model$components) rwmd_model = RelaxedWordMoversDistance$new(dtm, wv) rwms = rwmd_model$sim2(dtm[1:10, ]) head(sort(rwms[1, ], decreasing = T)) ## End(Not run)
sim2 calculates pairwise similarities between the
rows of two data matrices. Note that some methods work only on sparse matrices and
others work only on dense matrices.
psim2 calculates "parallel" similarities between the rows of two data matrices.
sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none")) psim2(x, y, method = c("cosine", "jaccard"), norm = c("l2", "none"))sim2(x, y = NULL, method = c("cosine", "jaccard"), norm = c("l2", "none")) psim2(x, y, method = c("cosine", "jaccard"), norm = c("l2", "none"))
x |
first matrix. |
y |
second matrix. For |
method |
|
norm |
|
Computes the similarity matrix using given method.
psim2 takes two matrices and return a single vector.
giving the ‘parallel’ similarities of the vectors.
sim2 returns matrix of similarities between each row of
matrix x and each row of matrix y.
psim2 returns vector of "parallel" similarities between rows of x and y.
This function splits a vector into n parts of roughly
equal size. These splits can be used for parallel processing. In general,
n should be equal to the number of jobs you want to run, which
should be the number of cores you want to use.
split_into(vec, n)split_into(vec, n)
vec |
input vector |
n |
|
list with n elements, each of roughly equal length
Fast vectorization, topic modeling, distances and GloVe word embeddings in R.
To learn more about text2vec visit project website: http://text2vec.org
Or start with the vignettes:
browseVignettes(package = "text2vec")
Creates TfIdf(Latent semantic analysis) model.
"smooth" IDF (default) is defined as follows: idf = log(1 + (# documents in the corpus) / (# documents where the term appears) )
"non-smooth" IDF is defined as follows: idf = log((# documents in the corpus) / (# documents where the term appears) )
TfIdfTfIdf
R6Class object.
Term Frequency Inverse Document Frequency
For usage details see Methods, Arguments and Examples sections.
tfidf = TfIdf$new(smooth_idf = TRUE, norm = c('l1', 'l2', 'none'), sublinear_tf = FALSE)
tfidf$fit_transform(x)
tfidf$transform(x)
$new(smooth_idf = TRUE, norm = c("l1", "l2", "none"), sublinear_tf = FALSE)Creates tf-idf model
$fit_transform(x)fit model to an input sparse matrix (preferably in "dgCMatrix" format) and then transforms it.
$transform(x)transform new data x using tf-idf from train data
A TfIdf object
An input term-co-occurence matrix. Preferably in dgCMatrix format
TRUE smooth IDF weights by adding one to document
frequencies, as if an extra document was seen containing every term in the
collection exactly once.
c("l1", "l2", "none") Type of normalization to apply to term vectors.
"l1" by default, i.e., scale by the number of words in the document.
FALSE Apply sublinear term-frequency scaling, i.e.,
replace the term frequency with 1 + log(TF)
data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer()) model_tfidf = TfIdf$new() dtm_tfidf = model_tfidf$fit_transform(dtm)data("movie_review") N = 100 tokens = word_tokenizer(tolower(movie_review$review[1:N])) dtm = create_dtm(itoken(tokens), hash_vectorizer()) model_tfidf = TfIdf$new() dtm_tfidf = model_tfidf$fit_transform(dtm)
Few simple tokenization functions. For more comprehensive list see tokenizers package:
https://cran.r-project.org/package=tokenizers.
Also check stringi::stri_split_*.
word_tokenizer(strings, ...) char_tokenizer(strings, ...) space_tokenizer(strings, sep = " ", xptr = FALSE, ...) postag_lemma_tokenizer(strings, udpipe_model, tagger = "default", tokenizer = "tokenizer", pos_keep = character(0), pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ"))word_tokenizer(strings, ...) char_tokenizer(strings, ...) space_tokenizer(strings, sep = " ", xptr = FALSE, ...) postag_lemma_tokenizer(strings, udpipe_model, tagger = "default", tokenizer = "tokenizer", pos_keep = character(0), pos_remove = c("PUNCT", "DET", "ADP", "SYM", "PART", "SCONJ", "CCONJ", "AUX", "X", "INTJ"))
strings |
|
... |
other parameters (usually not used - see source code for details). |
sep |
|
xptr |
|
udpipe_model |
- udpipe model, can be loaded with |
tagger |
|
tokenizer |
|
pos_keep |
|
pos_remove |
|
list of character vectors. Each element of list contains vector of tokens.
doc = c("first second", "bla, bla, blaa") # split by words word_tokenizer(doc) #faster, but far less general - perform split by a fixed single whitespace symbol. space_tokenizer(doc, " ")doc = c("first second", "bla, bla, blaa") # split by words word_tokenizer(doc) #faster, but far less general - perform split by a fixed single whitespace symbol. space_tokenizer(doc, " ")
This function creates an object (closure) which defines on how to transform list of tokens into vector space - i.e. how to map words to indices. It supposed to be used only as argument to create_dtm, create_tcm, create_vocabulary.
vocab_vectorizer(vocabulary) hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L), signed_hash = FALSE)vocab_vectorizer(vocabulary) hash_vectorizer(hash_size = 2^18, ngram = c(1L, 1L), signed_hash = FALSE)
vocabulary |
|
hash_size |
|
ngram |
|
signed_hash |
|
A vectorizer object (closure).
create_dtm create_tcm create_vocabulary
data("movie_review") N = 100 vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L)) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) hash_dtm = create_dtm(it, vectorizer) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) v = create_vocabulary(it, c(1L, 1L) ) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) dtm = create_dtm(it, vectorizer)data("movie_review") N = 100 vectorizer = hash_vectorizer(2 ^ 18, c(1L, 2L)) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) hash_dtm = create_dtm(it, vectorizer) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) v = create_vocabulary(it, c(1L, 1L) ) vectorizer = vocab_vectorizer(v) it = itoken(movie_review$review[1:N], preprocess_function = tolower, tokenizer = word_tokenizer, n_chunks = 10) dtm = create_dtm(it, vectorizer)