Weekly | Top 11 GitHub Repos | Week 34 - 2025
Noteworthy data-ops & analytics repos that first shipped less than a year ago.
#11. OpenDCAI/DataFlow
Easy Data Preparation with latest LLMs-based Operators and Pipelines.
["data", "data-cleaning", "data-pipelines", "data-processing", "data-science", "data-synthesis", "llms", "operators", "data-agent", "sglang-bankend", "vllm-backend"]
This repo was first pushed to Github on 2024-10-13. Its license was listed as: Apache License 2.0. Its primary language is Python.
#10. risesoft-y9/DataFlow-Engine
数据流引擎是一款面向数据集成、数据同步、数据交换、数据共享、任务配置、任务调度的底层数据驱动引擎。数据流引擎采用管执分离、多流层、插件库等体系应对大规模数据任务、数据高频上报、数据高频采集、异构数据兼容的实际数据问题。
This repo was first pushed to Github on 2024-08-26. Its license was listed as: GNU General Public License v3.0. Its primary language is Java.
#9. open-thoughts/open-thoughts
Open Thoughts: Fully Open Data Curation for Thinking Models
["open-data", "reasoning"]
This repo was first pushed to Github on 2025-01-28. Its license was listed as: Apache License 2.0. Its primary language is Python.
#8. facebookresearch/uco3d
Uncommon Objects in 3D dataset
This repo was first pushed to Github on 2024-12-29. Its license was listed as: Creative Commons Attribution 4.0 International. Its primary language is Jupyter Notebook.
#7. firecrawl/fire-enrich
🔥 AI-powered data enrichment tool that transforms emails into rich datasets with company profiles, funding data, tech stacks, and more using Firecrawl and multi-agent AI
This repo was first pushed to Github on 2025-06-11. Its license was listed as: MIT License. Its primary language is TypeScript.
#6. worldbank/metadata-editor
Web tool to edit, validate, and manage metadata for Microdata (DDI Codebook), documents, tables, media, and geospatial data.
["ddi-codebook", "metadata-editor"]
This repo was first pushed to Github on 2025-05-13. Its license was listed as: MIT License. Its primary language is PHP.
#5. microsoft/data-formulator
🪄 Create rich visualizations with AI
This repo was first pushed to Github on 2024-08-29. Its license was listed as: MIT License. Its primary language is TypeScript.
#4. deepseek-ai/profile-data
Analyze computation-communication overlap in V3/R1.
This repo was first pushed to Github on 2025-02-27. Its primary language is Mixed/Unspecified.
#3. shaiwz/data-platform-open
🔥🔥🔥可视化拖拽式大数据集成平台后端、包含数据流、数据源、数据对齐、查询模板、完善的监控等。
["big-data", "dataflow", "doris", "java", "kafka", "starrocks"]
This repo was first pushed to Github on 2025-01-19. Its license was listed as: Other. Its primary language is Java.
#2. boxed-dev/cognidb
This repo was first pushed to Github on 2024-12-06. Its primary language is Python.
#1. shiyu-coder/Kronos
Kronos: A Foundation Model for the Language of Financial Markets
This repo was first pushed to Github on 2025-07-01. Its license was listed as: MIT License. Its primary language is Python.
You may notice this week's selections feel slightly different, and hopefully more relevant. We've implemented the switch from keyword matching to semantic understanding using text embeddings. This has increased the number of repos being effectively tagged, which you may notice by:
1. Checking the star counts of repos in the list. More repos means better chance of find ones with high star counts.
2. Better coverage of niche projects - we can now classify repos even when they have limited metadata, using their README content to understand what they actually do
What is coming next?
Improvements to the re-ranking algorithm for surfacing the best repos. This will help to optimize what gets seen each week in the newsletter. Having more repos to choose from is a good problem to have, but we still have to refine things further to make these newsletters as relevant as possible.
Let me know what you think! Drop a comment below or hit the heart if you notice the improvement. Your engagement helps me understand what's working.
Want to discuss classification improvements? Join us on Discord: https://discord.gg/jCeSn3M7


