Weekly | Top 12 GitHub Repos | Week 37 - 2025
Noteworthy data-ops & analytics repos that first shipped less than a year ago.
#12. denizsafak/abogen
Generate audiobooks from EPUBs, PDFs and text with synchronized captions.
["audiobook", "audiobooks", "content-creation", "content-creator", "epub-converter", "kokoro", "media-generation", "narrator", "speech-synthesis", "subtitles", "text-to-audio", "text-to-speech", "tts", "voice-synthesis", "kokoro-82m", "kokoro-tts"]
This repo was first pushed to Github on 2025-04-25. Its license was listed as: MIT License. Its primary language is Python.
#11. drasi-project/drasi-platform
This repo was first pushed to Github on 2024-09-23. Its license was listed as: Apache License 2.0. Its primary language is Rust.
#10. JayLZhou/GraphRAG
In-depth study of the graphrag
This repo was first pushed to Github on 2024-11-11. Its primary language is Python.
#9. NanoNets/docext
An on-premises unstructured data extraction tool powered by vision language models.
["document", "document-analysis", "extraction", "llms", "machine-learning", "nlp", "ocr", "pdf-extractor", "rag", "unstructured-data", "vlms", "onprem", "document-data-extraction", "ocr-onpremise", "llm-ocr", "onprem-ocr", "onprem-vision", "onpremise", "table-extraction"]
This repo was first pushed to Github on 2025-04-04. Its license was listed as: Apache License 2.0. Its primary language is Python.
#8. duckdb/ducklake
DuckLake is an integrated data lake and catalog format
This repo was first pushed to Github on 2025-05-27. Its license was listed as: MIT License. Its primary language is C++.
#7. hristo2612/SQLNoir
Solve mysteries through SQL.
This repo was first pushed to Github on 2025-02-04. Its license was listed as: MIT License. Its primary language is TypeScript.
#6. opendatalab/DocLayout-YOLO
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception
This repo was first pushed to Github on 2024-10-14. Its license was listed as: GNU Affero General Public License v3.0. Its primary language is Python.
#5. OpenCoder-llm/OpenCoder-llm
The Open Cookbook for Top-Tier Code Large Language Model
This repo was first pushed to Github on 2024-10-26. Its license was listed as: MIT License. Its primary language is Python.
#4. apple/embedding-atlas
Embedding Atlas is a tool that provides interactive visualizations for large embeddings. It allows you to visualize, cross-filter, and search embeddings and metadata.
["embedding", "visualization"]
This repo was first pushed to Github on 2025-05-13. Its license was listed as: MIT License. Its primary language is TypeScript/Svelte.
🔥 InfiniteYou: Flexible Photo Recrafting While Preserving Your Identity
["diffusion-transformer", "face", "flux", "identity-preserving", "image-editing", "image-generation", "personalization", "text-to-image", "diffusion", "diffusers", "pytorch"]
This repo was first pushed to Github on 2025-03-20. Its license was listed as: Apache License 2.0. Its primary language is Python.
#2. CatchTheTornado/text-extract-api
Document (PDF, Word, PPTX ...) extraction and parse API using state of the art modern OCRs + Ollama supported models. Anonymize documents. Remove PII. Convert any document or picture to structured JSON or Markdown
["api", "extract", "json", "llm", "pdf", "anonymization", "ocr", "ocr-python", "pii"]
This repo was first pushed to Github on 2024-10-23. Its license was listed as: GNU General Public License v3.0. Its primary language is Python.
#1. jwohlwend/boltz
Official repository for the Boltz-1 biomolecular interaction model
This repo was first pushed to Github on 2024-11-17. Its license was listed as: MIT License. Its primary language is Python.



