Package: text2vec 0.6.4

text2vec: Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

Authors:Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph], Qing Wang [aut, cph]

text2vec_0.6.4.tar.gz
text2vec_0.6.4.zip(r-4.5)text2vec_0.6.4.zip(r-4.4)text2vec_0.6.4.zip(r-4.3)
text2vec_0.6.4.tgz(r-4.5-x86_64)text2vec_0.6.4.tgz(r-4.5-arm64)text2vec_0.6.4.tgz(r-4.4-x86_64)text2vec_0.6.4.tgz(r-4.4-arm64)text2vec_0.6.4.tgz(r-4.3-x86_64)text2vec_0.6.4.tgz(r-4.3-arm64)
text2vec_0.6.4.tar.gz(r-4.5-noble)text2vec_0.6.4.tar.gz(r-4.4-noble)
text2vec_0.6.4.tgz(r-4.4-emscripten)text2vec_0.6.4.tgz(r-4.3-emscripten)
text2vec.pdf |text2vec.html
text2vec/json (API)
NEWS

# Install 'text2vec' in R:
install.packages('text2vec', repos = c('https://dselivanov.r-universe.dev', 'https://cloud.r-project.org'))

Bug tracker:https://github.com/dselivanov/text2vec/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:

On CRAN:

glovelatent-dirichlet-allocationnatural-language-processingtext-miningtopic-modelingvectorizationword-embeddingsword2veccpp

13.52 score 855 stars 25 packages 1.3k scripts 8.3k downloads 4 mentions 42 exports 14 dependencies

Last updated 6 months agofrom:bea438b105. Checks:7 OK, 4 NOTE. Indexed: yes.

TargetResultLatest binary
Doc / VignettesOKFeb 12 2025
R-4.5-win-x86_64NOTEFeb 12 2025
R-4.5-mac-x86_64NOTEFeb 12 2025
R-4.5-mac-aarch64NOTEFeb 12 2025
R-4.5-linux-x86_64NOTEFeb 12 2025
R-4.4-win-x86_64OKFeb 12 2025
R-4.4-mac-x86_64OKFeb 12 2025
R-4.4-mac-aarch64OKFeb 12 2025
R-4.3-win-x86_64OKFeb 12 2025
R-4.3-mac-x86_64OKFeb 12 2025
R-4.3-mac-aarch64OKFeb 12 2025

Exports:as.lda_cBNSchar_tokenizercheck_analogy_accuracycoherenceCollocationscombine_vocabulariescreate_dtmcreate_tcmcreate_vocabularydist2fitfit_transformGlobalVectorsGloVehash_vectorizeridirifilesifiles_parallelitokenitoken_paralleljsPCA_robustLatentDirichletAllocationLatentSemanticAnalysisLDALSAnormalizepdist2perplexitypostag_lemma_tokenizerprepare_analogy_questionsprune_vocabularypsim2RelaxedWordMoversDistanceRWMDsim2space_tokenizersplit_intoTfIdfvocab_vectorizervocabularyword_tokenizer

Dependencies:data.tabledigestfloatlatticelgrMatrixMatrixExtramlapiR6RcppRcppArmadilloRhpcBLASctlrsparsestringi

Analyzing Texts with the text2vec package

Rendered fromtext-vectorization.Rmdusingknitr::knitron Feb 12 2025.

Last update: 2023-11-13
Started: 2016-01-10

GloVe Word Embeddings

Rendered fromglove.Rmdusingknitr::knitron Feb 12 2025.

Last update: 2023-11-13
Started: 2016-01-10

Readme and manuals

Help Manual

Help pageTopics
Converts document-term matrix sparse matrix to 'lda_c' formatas.lda_c
BNSBNS
Checks accuracy of word embeddings on the analogy taskcheck_analogy_accuracy
Coherence metrics for topic modelscoherence
Collocations model.Collocations
Combines multiple vocabularies into onecombine_vocabularies
Document-term matrix constructioncreate_dtm create_dtm.itoken create_dtm.itoken_parallel
Term-co-occurence matrix constructioncreate_tcm create_tcm.itoken create_tcm.itoken_parallel
Creates a vocabulary of unique termscreate_vocabulary create_vocabulary.character create_vocabulary.itoken create_vocabulary.itoken_parallel vocabulary
Pairwise Distance Matrix Computationdist2 distances pdist2
re-export rsparse::GloVeGlobalVectors GloVe
Creates iterator over text files from the diskidir ifiles ifiles_parallel
Iterators (and parallel iterators) over input objectsitoken itoken.character itoken.iterator itoken.list itoken_parallel itoken_parallel.character itoken_parallel.iterator itoken_parallel.list
(numerically robust) Dimension reduction via Jensen-Shannon Divergence & Principal ComponentsjsPCA_robust
Creates Latent Dirichlet Allocation model.LatentDirichletAllocation LDA
Latent Semantic Analysis modelLatentSemanticAnalysis LSA
IMDB movie reviewsmovie_review
Matrix normalizationnormalize
Perplexity of a topic modelperplexity
Prepares list of analogy questionsprepare_analogy_questions
Printing Vocabularyprint.text2vec_vocabulary
Prune vocabularyprune_vocabulary
Creates Relaxed Word Movers Distance (RWMD) modelRelaxedWordMoversDistance RWMD
Pairwise Similarity Matrix Computationpsim2 sim2 similarities
Split a vector for parallel processingsplit_into
text2vectext2vec-package text2vec
TfIdfTfIdf
Simple tokenization functions for string splittingchar_tokenizer postag_lemma_tokenizer space_tokenizer tokenizers word_tokenizer
Vocabulary and hash vectorizershash_vectorizer vectorizers vocab_vectorizer