Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
-
Updated
Mar 17, 2025 - Python
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Crawler for linguistic corpora
Crawl Ask.fm QA lists and create corpus for ML.
Automated text preprocessing pipeline for large corpora. Features customizable filters for diacritics, stop words, punctuation, and regex.
App and Scripts working with the corpus-builder CorpusCook, to have a corpus updated with corrected wrong predictions
Extract text from Vikidia/Wikipedia articles [fr]
Add a description, image, and links to the corpus-builder topic page so that developers can more easily learn about it.
To associate your repository with the corpus-builder topic, visit your repo's landing page and select "manage topics."