Changes in version 0.6.6 (2025-12-01) - fix R CMD check notes in documentation of R6 classes Changes in version 0.6.5 - fix test discovered with Matrix==1.6-2 release Changes in version 0.6.4 (2023-11-09) - update dependency Matrix>=1.5-2, fixes #338 Changes in version 0.6.2 (2022-09-11) - removed test which is not needed with Matrix package v 1.5 Changes in version 0.6 (2020-02-18) 1. 2019-12-17 - breaking change - removed construction of a vocabulary in parallel on windows - use rsparse package for SVD and GloVe factorizations - updated RWMD implementation (hopefully bug free) 2. 2018-09-10 - breaking change - changed IDF formula - see #280 for details. 3. 2018-05-28 - Added postag_lemma_tokenizer() (wrapper around udpipe::udpipe_annotate). Can be used as a drop-in replacement for more simple tokenizers in text2vec. 4. 2018-05-25 - Made combine_vocabularies() part of public API - see #260 for details. 5. 2018-05-10 - Added coherence() function for comprehensive coherence metrics. Thanks to Manuel Bickel ( @manuelbickel ) for conrtibution. 6. 2018-05-02 - Fixed bug LSA model - document embeddings calculated as left singular vectors multiplied by singular values (not square root of values as before). Thanks to Sloane Simmons ( @singularperturbation ) - Now fit_transform and transform methods in LDA model produce same results. Thanks to @jiunsiew for reporting. Also now LDA has n_iter_inference parameter. It controls number of the samples from converged distribution for document-topic inference. This leads to more robust document-topic probabilities (reduced variance). Default value is 10. 7. 2018-01-17 - more numerically robust PMI, LFMD - thanks to @andland. Also adds iteration number iter to collocation_stat. iter shows iteration number when collocation stats (and counters) were calculated. Changes in version 0.5.1 (2018-01-11) 1. 2018-01-10 - removed rank* columns from collocation_stat - were never used internally. Users can easily calculate ranks themselves 2. 2018-01-09 - Added Bi-Normal Separation transformation, thanks to Pavel Shashkin ( @pshashk ) - Added Dunning's log-likelihood ratio for collocations, thanks to Chris Lee ( @Chrisss93 ) - Early stopping for collocations learning 3. 2017-12-18 - fixed several bugs #219 #217 #205 - decreased number of dependencies - no more magrittr, uuid, tokenizers - removed distributed LDA which didn't work correctly 4. 2017-10-18 - Now tokenization is based on tokenizers and THE stringi packages. - models API follow mlapi package. No API changes on text2vec side - we just put abstract scikit-learn-like classes to a separate package in order to make them more reusable. Changes in version 0.5.0 (2017-08-08) 1. 2017-06-12 - Add additional filters to prune_vocabulary - filter by document counts - Clean up LSA, fixed transform method. Added option to use randomized SVD algorithm from irlba. 2. 2017-05-17 - Imrove dist2 performamce for RWMD - incorporate ideas from gensim PR discussion. 3. 2017-05-17 - API breaking change - vocabulary format change - now plain data.frame with meta-information in attributes (stopwords, ngram, number of docs, etc). 4. 2017-03-25 - No more rely on RcppModules - API breaking change - removed lda_c from formats in DTM construction - added ifiles_parallel, itoken_parallel high-level functions for parallel computing - API breaking change chunks_numer parameter renamed to n_chunks 5. 2017-01-02 - API breaking change - removed create_corpus from public API, moved co-occurence related optons to create_tcm from vecorizers - add ability to add custom weights for co-occurence statistics calculations 6. 2016-12-30 - Noticeable speedup (1.5x) and even more noticeable improvement on memory usage (2x less!) for create_dtm, create_tcm . Now package relies on sparsepp library for underlying hash maps. 7. 2016-10-30 - Collocations - detection of multi-word phrases using differend heuristics - PMI, gensim, LFMD. 8. 2016-10-20 - Fixed bug in as.lda_c() function Changes in version 0.4.0 (2016-10-04) 2016-10-03. See 0.4 milestone tags. 1. Now under GPL (>= 2) Licence 2. "immutable" iterators - no need to reinitialize them 3. unified models interface 4. New models: LSA, LDA, GloVe with L1 regularization 5. Fast similarity and distances calculation: Cosine, Jaccard, Relaxed Word Mover's Distance, Euclidean 6. Better hadnling UTF-8 strings, thanks to @qinwf 7. iterators and models rely on R6 package Changes in version 0.3.0 (2016-03-31) 1. 2016-01-13 fix for #46, thanks to @buhrmann for reporting 2. 2016-01-16 format of vocabulary changed. - do not keep doc_proportions. see #52. - add stop_words argument to prune_vocabulary. signature also was changed. 3. 2016-01-17 fix for #51. if iterator over tokens returns list with names, these names will be: - stored as attr(corpus, 'ids') - rownames in dtm - names for dtm list in lda_c format 4. 2016-02-02 high level function for corpus and vocabulary construction. - construction of vocabulary from list of itoken. - construction of dtm from list of itoken. 5. 2016-02-10 rename transformers - now all transformers starts with transform_* - more intuitive + simpler usage with autocompletion 6. 2016-03-29 (accumulated since 2016-02-10) - rename vocabulary to create_vocabulary. - new functions create_dtm, create_tcm. - All core functions are able to benefit from multicore machines (user have to register parallel backend themselves) - Fix for progress bars. Now they are able to reach 100% and ticks increased after computation. - ids argument to itoken. Simplifies assignement of ids to rows of DTM - create_vocabulary now can handle stopwords - see all updates here 7. 2016-03-30 more robust split_into() util. Changes in version 0.2.0 (2016-01-10) First CRAN release of text2vec. - Fast text vectorization with stable streaming API on arbitrary n-grams. - Functions for vocabulary extraction and management - Hash vectorizer (based on digest murmurhash3) - Vocabulary vectorizer - GloVe algorithm word embeddings. - Fast term-co-occurence matrix factorization via parallel async AdaGrad. - All core functions written in C++.