Package: text2vec 0.6.4

text2vec: Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

Authors:Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph], Qing Wang [aut, cph]

text2vec_0.6.4.tar.gz
text2vec_0.6.4.zip(r-4.5)text2vec_0.6.4.zip(r-4.4)text2vec_0.6.4.zip(r-4.3)
text2vec_0.6.4.tgz(r-4.5-x86_64)text2vec_0.6.4.tgz(r-4.5-arm64)text2vec_0.6.4.tgz(r-4.4-x86_64)text2vec_0.6.4.tgz(r-4.4-arm64)text2vec_0.6.4.tgz(r-4.3-x86_64)text2vec_0.6.4.tgz(r-4.3-arm64)
text2vec_0.6.4.tar.gz(r-4.5-noble)text2vec_0.6.4.tar.gz(r-4.4-noble)
text2vec_0.6.4.tgz(r-4.4-emscripten)text2vec_0.6.4.tgz(r-4.3-emscripten)
text2vec.pdf |text2vec.html✨
text2vec/json (API)
NEWS

# Install 'text2vec' in R:

install.packages('text2vec', repos = c('https://dselivanov.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/dselivanov/text2vec/issues

Uses libs:

c++– GNU Standard C++ Library v3

Datasets:

movie_review - IMDB movie reviews

On CRAN:

glove latent-dirichlet-allocation natural-language-processing text-mining topic-modeling vectorization word-embeddings word2vec cpp

13.48 score 860 stars 23 packages 1.3k scripts 8.2k downloads 4 mentions 42 exports 14 dependencies

Last updated 7 months agofrom:bea438b105. Checks:8 OK, 4 NOTE. Indexed: yes.

Target	Result	Latest binary
Doc / Vignettes	OK	Mar 14 2025
R-4.5-win-x86_64	NOTE	Mar 14 2025
R-4.5-mac-x86_64	NOTE	Mar 14 2025
R-4.5-mac-aarch64	NOTE	Mar 14 2025
R-4.5-linux-x86_64	NOTE	Mar 14 2025
R-4.4-win-x86_64	OK	Mar 14 2025
R-4.4-mac-x86_64	OK	Mar 14 2025
R-4.4-mac-aarch64	OK	Mar 14 2025
R-4.4-linux-x86_64	OK	Mar 14 2025
R-4.3-win-x86_64	OK	Mar 14 2025
R-4.3-mac-x86_64	OK	Mar 14 2025
R-4.3-mac-aarch64	OK	Mar 14 2025

Exports:as.lda_c BNS char_tokenizer check_analogy_accuracy coherence Collocations combine_vocabularies create_dtm create_tcm create_vocabulary dist2 fit fit_transform GlobalVectors GloVe hash_vectorizer idir ifiles ifiles_parallel itoken itoken_parallel jsPCA_robust LatentDirichletAllocation LatentSemanticAnalysis LDA LSA normalize pdist2 perplexity postag_lemma_tokenizer prepare_analogy_questions prune_vocabulary psim2 RelaxedWordMoversDistance RWMD sim2 space_tokenizer split_into TfIdf vocab_vectorizer vocabulary word_tokenizer

Dependencies:data.table digest float lattice lgr Matrix MatrixExtra mlapi R6 Rcpp RcppArmadillo RhpcBLASctl rsparse stringi

Analyzing Texts with the text2vec package

Dmitriy Selivanov

Rendered fromtext-vectorization.Rmdusingknitr::knitron Mar 14 2025.

Last update: 2023-11-13
Started: 2016-01-10

GloVe Word Embeddings

Dmitriy Selivanov

Rendered fromglove.Rmdusingknitr::knitron Mar 14 2025.

Last update: 2023-11-13
Started: 2016-01-10

Citation

Development and contributors

Readme and manuals

Help Manual

Help page	Topics
Converts document-term matrix sparse matrix to 'lda_c' format	as.lda_c
BNS	BNS
Checks accuracy of word embeddings on the analogy task	check_analogy_accuracy
Coherence metrics for topic models	coherence
Collocations model.	Collocations
Combines multiple vocabularies into one	combine_vocabularies
Document-term matrix construction	create_dtm create_dtm.itoken create_dtm.itoken_parallel
Term-co-occurence matrix construction	create_tcm create_tcm.itoken create_tcm.itoken_parallel
Creates a vocabulary of unique terms	create_vocabulary create_vocabulary.character create_vocabulary.itoken create_vocabulary.itoken_parallel vocabulary
Pairwise Distance Matrix Computation	dist2 distances pdist2
re-export rsparse::GloVe	GlobalVectors GloVe
Creates iterator over text files from the disk	idir ifiles ifiles_parallel
Iterators (and parallel iterators) over input objects	itoken itoken.character itoken.iterator itoken.list itoken_parallel itoken_parallel.character itoken_parallel.iterator itoken_parallel.list
(numerically robust) Dimension reduction via Jensen-Shannon Divergence & Principal Components	jsPCA_robust
Creates Latent Dirichlet Allocation model.	LatentDirichletAllocation LDA
Latent Semantic Analysis model	LatentSemanticAnalysis LSA
IMDB movie reviews	movie_review
Matrix normalization	normalize
Perplexity of a topic model	perplexity
Prepares list of analogy questions	prepare_analogy_questions
Printing Vocabulary	print.text2vec_vocabulary
Prune vocabulary	prune_vocabulary
Creates Relaxed Word Movers Distance (RWMD) model	RelaxedWordMoversDistance RWMD
Pairwise Similarity Matrix Computation	psim2 sim2 similarities
Split a vector for parallel processing	split_into
text2vec	text2vec-package text2vec
TfIdf	TfIdf
Simple tokenization functions for string splitting	char_tokenizer postag_lemma_tokenizer space_tokenizer tokenizers word_tokenizer
Vocabulary and hash vectorizers	hash_vectorizer vectorizers vocab_vectorizer