Package: text2vec 0.6.4

text2vec: Modern Text Mining Framework for R

Fast and memory-friendly tools for text vectorization, topic modeling (LDA, LSA), word embeddings (GloVe), similarities. This package provides a source-agnostic streaming API, which allows researchers to perform analysis of collections of documents which are larger than available RAM. All core functions are parallelized to benefit from multicore machines.

Authors:Dmitriy Selivanov [aut, cre, cph], Manuel Bickel [aut, cph], Qing Wang [aut, cph]

text2vec_0.6.4.tar.gz
text2vec_0.6.4.zip(r-4.5)text2vec_0.6.4.zip(r-4.4)text2vec_0.6.4.zip(r-4.3)
text2vec_0.6.4.tgz(r-4.4-x86_64)text2vec_0.6.4.tgz(r-4.4-arm64)text2vec_0.6.4.tgz(r-4.3-x86_64)text2vec_0.6.4.tgz(r-4.3-arm64)
text2vec_0.6.4.tar.gz(r-4.5-noble)text2vec_0.6.4.tar.gz(r-4.4-noble)
text2vec_0.6.4.tgz(r-4.4-emscripten)text2vec_0.6.4.tgz(r-4.3-emscripten)
text2vec.pdf |text2vec.html
text2vec/json (API)
NEWS

# Install 'text2vec' in R:
install.packages('text2vec', repos = c('https://dselivanov.r-universe.dev', 'https://cloud.r-project.org'))

Peer review:

Bug tracker:https://github.com/dselivanov/text2vec/issues

Uses libs:
  • c++– GNU Standard C++ Library v3
Datasets:

On CRAN:

glovelatent-dirichlet-allocationnatural-language-processingtext-miningtopic-modelingvectorizationword-embeddingsword2vec

13.62 score 852 stars 21 packages 1.2k scripts 6.5k downloads 4 mentions 42 exports 14 dependencies

Last updated 3 months agofrom:bea438b105. Checks:OK: 7 NOTE: 2. Indexed: yes.

TargetResultDate
Doc / VignettesOKNov 14 2024
R-4.5-win-x86_64NOTENov 14 2024
R-4.5-linux-x86_64NOTENov 14 2024
R-4.4-win-x86_64OKNov 14 2024
R-4.4-mac-x86_64OKNov 14 2024
R-4.4-mac-aarch64OKNov 14 2024
R-4.3-win-x86_64OKNov 14 2024
R-4.3-mac-x86_64OKNov 14 2024
R-4.3-mac-aarch64OKNov 14 2024

Exports:as.lda_cBNSchar_tokenizercheck_analogy_accuracycoherenceCollocationscombine_vocabulariescreate_dtmcreate_tcmcreate_vocabularydist2fitfit_transformGlobalVectorsGloVehash_vectorizeridirifilesifiles_parallelitokenitoken_paralleljsPCA_robustLatentDirichletAllocationLatentSemanticAnalysisLDALSAnormalizepdist2perplexitypostag_lemma_tokenizerprepare_analogy_questionsprune_vocabularypsim2RelaxedWordMoversDistanceRWMDsim2space_tokenizersplit_intoTfIdfvocab_vectorizervocabularyword_tokenizer

Dependencies:data.tabledigestfloatlatticelgrMatrixMatrixExtramlapiR6RcppRcppArmadilloRhpcBLASctlrsparsestringi

Analyzing Texts with the text2vec package

Rendered fromtext-vectorization.Rmdusingknitr::knitron Nov 14 2024.

Last update: 2023-11-13
Started: 2016-01-10

GloVe Word Embeddings

Rendered fromglove.Rmdusingknitr::knitron Nov 14 2024.

Last update: 2023-11-13
Started: 2016-01-10

Readme and manuals

Help Manual

Help pageTopics
Converts document-term matrix sparse matrix to 'lda_c' formatas.lda_c
BNSBNS
Checks accuracy of word embeddings on the analogy taskcheck_analogy_accuracy
Coherence metrics for topic modelscoherence
Collocations model.Collocations
Combines multiple vocabularies into onecombine_vocabularies
Document-term matrix constructioncreate_dtm create_dtm.itoken create_dtm.itoken_parallel
Term-co-occurence matrix constructioncreate_tcm create_tcm.itoken create_tcm.itoken_parallel
Creates a vocabulary of unique termscreate_vocabulary create_vocabulary.character create_vocabulary.itoken create_vocabulary.itoken_parallel vocabulary
Pairwise Distance Matrix Computationdist2 distances pdist2
re-export rsparse::GloVeGlobalVectors GloVe
Creates iterator over text files from the diskidir ifiles ifiles_parallel
Iterators (and parallel iterators) over input objectsitoken itoken.character itoken.iterator itoken.list itoken_parallel itoken_parallel.character itoken_parallel.iterator itoken_parallel.list
(numerically robust) Dimension reduction via Jensen-Shannon Divergence & Principal ComponentsjsPCA_robust
Creates Latent Dirichlet Allocation model.LatentDirichletAllocation LDA
Latent Semantic Analysis modelLatentSemanticAnalysis LSA
IMDB movie reviewsmovie_review
Matrix normalizationnormalize
Perplexity of a topic modelperplexity
Prepares list of analogy questionsprepare_analogy_questions
Printing Vocabularyprint.text2vec_vocabulary
Prune vocabularyprune_vocabulary
Creates Relaxed Word Movers Distance (RWMD) modelRelaxedWordMoversDistance RWMD
Pairwise Similarity Matrix Computationpsim2 sim2 similarities
Split a vector for parallel processingsplit_into
text2vectext2vec-package text2vec
TfIdfTfIdf
Simple tokenization functions for string splittingchar_tokenizer postag_lemma_tokenizer space_tokenizer tokenizers word_tokenizer
Vocabulary and hash vectorizershash_vectorizer vectorizers vocab_vectorizer