V-Gram features construction library¶
About¶
Vgram is the implementation of the new method for constructing an optimal feature set from sequential data. It creates a dictionary of n-grams of variable length, based on the minimum description length principle. The method is a dictionary coder and works simultaneously as both a compression algorithm and as unsupervised feature extraction. The length of constructed v-grams is not limited by any bound and exceeds 100 characters in provided experiments. Constructed v-grams can be used for any sequential data analysis and allows transfer bag-of-word techniques to non-text data types. Extracted features generate a practical basis for text classification, that shows competitive results on standard text classification collections without using the text structure. Combining extracted character v-grams with the words from the original text we achieved substantially better classification quality than on words or v-grams alone.
See the CIKM ‘18 paper for details.
Igor Kuralenok, Natalia Starikova, Aleksandr Khvorov, and Julian Serdyuk. Construction of Efficient V-Gram Dictionary for Sequential Data Analysis, CIKM ‘18 Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Pages 1343-1352
Install¶
Install vgram by running:
pip install vgram
Maybe you keep some errors about not installed pybind11, but it is okay.
Contents:¶
- Use cases and theory
- V-Gram construction
- Stream V-Gram construction
- Save dictionary
- Tokenizers
- Examples
- Basic example
- Save dictionary
- Include in scikit-learn pipeline
- Real example
- Words and v-grams union
- Build v-grams on int sequences
- IntVGram in text pipeline
- Save and load v-grams
- Construct VGram from file
- Saving intermediate dictionaries to file
- StreamVGram
- Load StreamVGram from file
- Fine-tune StreamVGram
- Our experiments
License¶
The project is licensed under the MIT license.