Over the last year, the team has been able to scale up to 100,000 vCPUs in the IBM Cloud to produce 40 trillion tokens for training AI models out of 14 petabytes of raw data from web crawls and other sources. They’ve been working on automating the pipelines they’ve built on Kubeflow on IBM Cloud. Zerfos said that the team has found that stripping HTML out of the Common Crawl data and mapping the content into markdown format was 24 times faster using their processes than other methods they have used. All the Granite code and language models that IBM has open-sourced over the last year have been trained on data that went through the team’s data-preparation process.
The team has turned much of their work into a community GitHub project called Data Prep Kit. It’s a toolkit for streamlining data preparation for LLM applications (right now for code and language models) supporting pre-training, fine-tuning, and RAG use cases. The modules in the kit are built on common distributed processing frameworks (including Spark and Ray) that allows the developers to build their own custom modules that can be scaled quickly across a variety of runtimes and readily scale from laptop to data centers.