Technical note
17 minute read

Introducing the GneissWeb dataset

The amount and quality of data that a model is trained on play a vital role in determining the performance of a large language model (LLM). High-quality data, in particular, can significantly boost the LLM’s ability to generalize on a wide range of downstream tasks. To better serve the needs of IBM’s burgeoning family of Granite models, this team focused on producing a 10 trillion-token dataset, named GneissWeb, that is higher quality than all other datasets of similar size available.

In this post, we introduce GneissWeb dataset along with the recipe of how we produced this dataset. The GneissWeb recipe consists of a sharded exact substring deduplication and a judiciously constructed ensemble of quality filters. Below, we present the key evaluations that guided our design choices and provide filtering thresholds that can be used to filter the dataset to match the token and quality needs of Stage-1 (early pre-training) or Stage-2 (annealing) datasets.

Our evaluations demonstrate that GneissWeb outperforms state-of-the-art large open datasets over 5T tokens. Specifically, ablation models trained on GneissWeb outperform those trained on FineWeb.V1.1 by 2.14 percentage points in terms of average score computed on a set of 11 benchmarks (for both zero-shot and few-shot) commonly used to evaluate pre-train datasets. When the evaluation set is extended to 20 benchmarks (both zero-shot and few-shot), ablation models trained on GneissWeb outperform those trained on FineWeb.V1.1 by 1.49 percentage points. In the future, we plan to release a detailed technical paper with fine grained details and the IBM Data Prep Kit to create the GneissWeb dataset.

The GneissWeb Recipe in a Nutshell

Building on Top of FineWeb

Hugging Face had introduced FineWeb V1.1, a large-scale dataset for LLM pre-training, consisting of 15 trillion tokens (which takes up 44TB of disk space). FineWeb is derived from 96 Common Crawl snapshots, focusing on English text by applying a series of processing steps, including language classification, deduplication, and heuristic rule-based quality filters. Models trained on FineWeb are shown to outperform those trained on other publicly available datasets, such as C4, RefinedWeb, Dolma, RedPajamav2, SlimPajama, and The Pile. While we focused on FineWeb V1.1 to prepare GneissWeb, our recipe can also be applied on FineWeb V1.2, which was recently released.

Subsequently, Hugging Face released two smaller but higher quality versions called FineWeb.Edu (also referred to as FineWeb-Edu-Small) and FineWeb.Edu.Score2 (also referred to as FineWeb-Edu-Large), derived from FineWeb. These datasets consist of 1.3 trillion and 5.4 trillion tokens respectively. The smaller high-quality versions of FineWeb are created by retaining documents perceived to have higher educational value from the original FineWeb dataset.

We started with the goal of distilling roughly 10 trillion high quality tokens from FineWeb V1.1, so that we get sufficiently large number of quality tokens suitable for Stage-1 pre-training. Unlike the FineWeb.Edu families, which rely on a single quality annotator and perform aggressive filtering, we developed a multi-faceted ensemble of quality annotators to enable fine-grained quality filtering. This allowed us to achieve a finer trade-off between the quality and quantity of the tokens retained. While the GneissWeb recipe was focused at obtaining 10T+ high quality tokens suitable for Stage-1 pre-training, it is also possible to adapt the recipe by tuning filtering parameters to produce smaller and higher quality datasets fit for Stage-2 kind of training.

An Overview of the GneissWeb Recipe

The GneissWeb dataset was obtained by applying the following processing steps to Fineweb:

  • Exact substring deduplication at line level
  • Custom built fastText quality filter
  • Custom built fastText category classifier
  • Custom built Category-aware readability score quality filter
  • Custom built Category-aware extreme_tokenized quality filter ⠀ These were applied in the order shown in the Fig. 1.
GneissWeb3.png
Figure 1: GneissWeb recipe.

The net impact was that the dataset size of 15 trillion tokens was filtered down to approximately 10 trillion tokens. In subsequent sections, we’ll describe the overall performance obtained using GneissWeb compared to other baselines. We’ll then dive deeper into each of these processing steps in detail and the impact they have individually through a series of ablations.

Evaluation Strategy

To compare GneissWeb against the baselines, we trained decoder models with 1.4B, 3B, and 7B parameters on a Llama architecture. These were trained on 35B (roughly optimal according to the Chinchilla scaling law) tokens to obtain signals and select hyperparameters for each processing step. We further trained ablation models on 100B (roughly three times the optimal according to the Chinchilla scaling law) as well as 350B tokens to validate the performance of each processing step. The data was tokenized using a StarCoder tokenizer, and training was done with a sequence length of 8,192.

The baselines from which equivalent data was subsampled and used for this comparison included:

DatasetNumber of TokensDatasetNumber of Tokens
FineWeb V1.115TDCLM-Baseline3.8T
FineWeb-Edu-Score-25.4TDolma3T
FineWeb-Edu1.3T
RefinedWeb600B

Fig. 2 shows how the subsamples were created for the FineWeb baselines as well as for GneissWeb. A similar strategy as for the creation of the FineWeb baseline was used for other baselines too.

ablation_strategy.png
Figure 2: Subsampling and Ablation Strategy

We trained and evaluated our models on an LSF (Load Sharing Facility) cluster with each node equipped with eight H100 GPUs. For training tasks involving 35 billion tokens, we typically trained models with 1.4 billion trainable parameters across 64 GPUs. For more compute intensive tasks, we scaled up to 128 or 256 GPUs to reduce training time. For evaluation tasks we generally used 8 GPUs.

The tokens for an experimental dataset are read from IBM’s GPFS (General Parallel File System) to minimize network traffic during training. With this computational infrastructure, the training speed of an FSDP model with 1.4 billion parameters is approximately 32,000 tokens/GPU/sec. Consequently, training the model with 35 billion tokens on 64 GPUs typically takes about 4.6 hours. Model checkpoints are saved regularly and evaluated in real time, with results automatically uploaded, stored and visualized.

Evaluation benchmarks selection

We evaluated our ablation models using lm-evaluation-harness on two categories of tasks: 11 High-Signal tasks (0-shot and few-shot) and 20 Extended tasks (0-shot and few-shot).

High-signal tasks:

Since ablations are performed by training ‘small’ models (1.4B parameter models) for a few billion tokens (typically 35B tokens), it is important to identify benchmarks that provide a good signal at this relatively small scale. Similar to FineWeb, we used the following criteria for selecting the 11 high-signal/early-signal tasks: accuracy above random guessing; accuracy monotonically increasing over training epochs; and small variance across runs. These are shown in Fig. 3 and cover commonsense reasoning, reading comprehension, world knowledge and language understanding task categories. We used both the zero-shot as well as few-shot variations of these tasks.

HighSignal.png
Figure 3: High signal Tasks — provide good signal at relatively small scale — of 1.4B models trained on 35B to 100B tokens

The high-signal tasks were used to analyze individual ingredients and possible recipe combinations via ablations. After we narrowed a few candidate recipes using these signals, we used the extended set of benchmarks to evaluate the model’s ability to generalize.

Extended tasks:

The extended tasks shown in Fig. 4 are a superset of the high-signal tasks. Besides the task categories of commonsense reasoning, reading comprehension, world knowledge, language understanding, it also has the category of symbolic problem solving. For the extended set, we also focus on zero-shot as well as few-shot variations.

Extended_Tasks.png
Figure 4: Extended tasks — broader set of tasks to evaluate generalization at larger number of tokens and/or larger model sizes

The extended task set have some tasks which are not in high-signal. These tasks are useful but at ablation scale may have high standard deviation (like PubMedQA) or are at random guessing the entire training cycle (like MMLU) or are above random guessing but do not show improvement with training (like GSM8k). However, these tasks are useful indicators for larger model performance and thus have been retained in the Extended Tasks set.

These differences between the tasks are seen in Fig. 5 where we see a comparison of the high signal tasks compared to those which are in the extended tasks and excluded from the high signal tasks. We see that the average accuracy increases in the former and is relatively static in the latter. This was one criteria for excluding them from the high signal task set.

accuracy_HS_vs_excluded_tasks_350b_no_stdev_v2.png
Figure 5: High-signal tasks show increasing accuracy with more training

The high-signal tasks also show lower coefficient of variationcompared to the excluded tasks as shown in Figure 6. The coefficient of variation is calculated as the ratio between the standard deviation of the average score divided by the mean, where statistics are computed across three random training seeds. Lower coefficient of variation shows more stable results, due to lower variance across random seeds. Their lower coefficient of variation makes the high-signal tasks more reliable at the ablation scale.

coeff_variation_HS_vs_excluded_v2.png
Figure 6: Coefficient of variation (standard deviation divided by mean) for high-signal set and excluded set

Evaluation results

At 1.4 billion model size trained on 350 billion tokens:

DatasetTokensHigh-Signal Eval ScoreExtended Eval Score
Large Datasets (5T+ Tokens), suitable for Stage-1 Pre-Training
FineWeb.V1.115T56.26 ± 0.1447.33 ± 0.3
GneissWeb10T58.40 ± 0.19 (+2.14)​48.82 ± 0.27 (+1.49)
FineWeb-Edu-Score-25.4T57.36 ± 0.4248.16 ± 0.29
Small Datasets (<5T Tokens), which can be used for Stage-2 Pre-Training
DCLM-Baseline3.8T61.36 ± 0.1151.09 ± 0.42
Dolma3T54.18 ± 0.6547.39 +/- 0.75
FineWeb-Edu1.3T58.44 ± 0.1448.91 ± 0.13
RefineWeb0.6T57.77 ± 0.1048.11 ± 0.3

Figure 7: Average scores of 1.4 billion parameter models trained on 350 billion tokens randomly sampled from state-of-the-art open datasets. Scores are averaged over three random seeds used for data sampling and are reported along with standard deviations. GneissWeb performs the best among the class of large datasets.

The datasets evaluated are broken down into those which are above 5 trillion tokens in size and those below 5 trillion. The former are useful for Stage-1 training and are the primary focus of this study, The latter are useful for Stage-2 training and with certain tuning of parameters of filtering a version of GneissWeb can be produced for this space.

For those in the greater than 5 trillion token set size, in Fig. 8 we show the performance broken down into the various categories of tasks — commonsense reasoning, language understanding, reading comprehension, world knowledge and symbolic problem solving. As shown, GneissWeb is not only the best overall but actually does best in all categories of tasks, barring world knowledge.

DatasetTokensCommonsense ReasoningLanguage UnderstandingReading ComprehensionWorld KnowledgeSymbolic Problem SolvingAverage
FineWeb.V1.115T45.2347.5862.6739.0126.1647.17
GneissWeb10T45.5348.7765.2141.0927.9248.82
FineWeb-Edu-score-25.4T45.3247.263.2942.2427.2548.16

Figure 8: Comparison of average evaluation scores grouped by categories for 1.4 billion models trained on 350 billion tokens.

In Fig. 9, we show the progression of accuracy with training for high-signal tasks for 1.4 billion parameter model for 350 billion tokens. We see that for all three datasets compared, the accuracy increases over time and the accuracy of GneissWeb is consistently higher than FineWeb and FineWeb-Edu-score-2.

1.4B.png
Figure 9: Average evaluation score on High-Signal tasks versus the number of tokens for 1.4 billion parameter models. The model trained on GneissWeb consistently outperforms the ones trained on FineWeb.V1.1 and FineWeb-Edu-score-2.

At 3 and 7 billion model size with 100 billion tokens:

Given that training models of size 3 billion and 7 billion parameters require much more compute, as does their evaluation, we have limited training to 100 billion tokens. We see that the 7 billion parameter models do better than the 3 billion parameter models. We also see that the models trained on GneissWeb outperform the models trained on FineWeb.V1.1 and FineWeb-Edu-score-2.

At the 3 billion model size, models trained on GneissWeb outperform those trained on FineWeb.V1.1 by 1.83 percentage points in terms of the average score computed on a set of 11 high-signal benchmarks (both zero-shot and few-shot), and 1.09% on extended benchmarks (both zero-shot and few-shot).

High-Signal Eval ScoreExtended Eval Score
FineWeb.V1.157.46 ± 0.0648.31 ± 0.31
GneissWeb59.29 ± 0.05 (+1.83)49.4 ± 0.24 (+1.09)
FineWeb-Edu-score-258.81 ± 0.00949.25 ± 0.34

Figure 10: Comparison of average evaluation scores on high-signal and extended eval tasks at 3B model size. Scores are averaged over three random seeds used for data sampling and are reported along with standard deviations.

3B.png
Figure 11: Average evaluation score on high-signal tasks versus the number of tokens at 3 billion model size for 100 billion tokens. The model trained on GneissWeb consistently outperforms the one trained on FineWeb.V1.1 throughout the training.

This gain further increases at 7 billion model size, models trained on GneissWeb outperform those trained on FineWeb.V1.1 by 2.04 percentage points in terms of the average score computed on a set of 11 high-signal benchmarks (both zero-shot and few-shot), and 1.32 percentage points on extended benchmarks (both zero-shot and few-shot).

High-Signal Eval ScoreExtended Eval Score
FineWeb.V1.161.05 ± 0.2551.01 ± 0.28​
GneissWeb63.09 ± 0.10 (+2.04)52.33 ± 0.24 (+1.32)
FineWeb-Edu-score-262.30 ± 0.00251.81 ± 0.15

Figure 12: Comparison of average evaluation scores on high-signal and extended eval tasks at a 7 billion model size. Scores are averaged over three random seeds used for data sampling and are reported along with standard deviations.

7B.png
Figure 13: Average evaluation score on high-signal tasks versus the number of tokens at a 7 billion model size for 100 billion tokens. The model trained on GneissWeb consistently outperforms the one trained on FineWeb.V1.1 throughout the training.

GneissWeb recipe details

In this section, we describe the key ingredients of the GneissWeb recipes that provide significant gains by explaining each of the components (or processing steps) along with the evaluation results of their individual ablation experiments.

Exact substring deduplication

Similar to RefinedWeb, we applied line-level deduplication to reduce memorization. We utilized the implementation from Lee et al. (2022), to identify exact duplicate text that matched character-for-character across multiple documents using a suffix array. These exact duplicates may have bypassed the MinHash deduplication stage for several reasons: they might not represent a significant enough portion of a document, or a single document could include repeated sections from various documents. This line-level deduplication process allows fine-tuning through parameters such as length_threshold (the minimum length of repeated text sequences) and frequency_threshold. We utilized a length_threshold of 50, consistent with the original implementation from Google Research and RefinedWeb.

Several modifications were made to the original implementation. First, we adapted it to remove exact duplicates at the level of individual Parquet files. Second, rather than removing all copies of a duplicate, our approach retains the first instance of each duplicate cluster. Specifically, we keep the first match and remove any subsequent matches exceeding 50 consecutive tokens.

In Fig. 14, we show the progression of accuracy with training for high-signal tasks at 1.4 billion parameter model for 350 billion tokens. We see that for both datasets compared, the accuracy increases over time and the accuracy of the dataset with exact substring deduplication is consistently higher ending at 57.39 than the baseline which ends at 55.99.

RepRemoval.png
Figure 14: Ablation experiment comparing exact substring deduplication against the FineWeb.V1.1 baseline at 1.4 billion model size for 350 billion tokens.

Custom data quality classifiers (fastText)

The fastText family of binary classifiers have been shown to perform well in identifying high-quality pre-training documents. Specifically, DCLM trained a fastText classifier on a mix of instruction-formatted data (OpenHermes-2.5) and high scoring posts from ELI5, and demonstrated that its effectiveness for quality filtering, surpassing compute-heavy methods such as AskLLM (prompting an LLM to ask if a document is helpful). After annotating a subset of using the DCLM-fastText, we observed that it favors well-structured, well-formatted documents (such as including bullet points), but tends to miss high-quality informational documents without substantial formatting.

In addition to DCLM-fastText, we trained a custom fastText classifier on a mix of high-quality synthetic data and data annotated by LLM for high educational value. Specifically, we used 400,000 documents, equality split between positive (i.e., high-quality) and negative (i.e., low-quality) classes. We obtained the 200,000 positive documents as:

  • 190,000 synthetic documents randomly sampled from the Cosmopedia dataset — an open synthetic dataset consisting of textbooks, blogposts, stories, posts and WikiHow articles generated by Mixtral-8x7B-Instruct-v0.1.
  • 10,000 documents with high educational value as follows: we annotated 600,000 random documents from FineWeb.V1.1, using the Mixtral-8x22B-Instruct model to score each document between 1 and 5 for its educational quality (with 5 being the highest quality), using a prompt similar to the one used by FineWeb-Edu. Next, we selected 10k random documents with scores >= 4.

We selected 200k documents out of 600k Mixtral-annotated documents with scores <=2 as the negative documents.

We performed an ablation where we combined the DCLM-fastText filter and the Cosmopedia-Edu-fastText filter using an OR rule. In particular, we retained documents which at least one filter votes as high-quality. Using the OR rule allowed us to achieve similar performance as the AND rule (wherein documents are retained only if both the classifiers vote as high-quality) and better performance than individual fastText classifiers, while retaining substantially larger number of tokens.

fasttext_ablation_35b_single_seed_s42.png
Figure 15: Ablation experiment comparing a combination of fastText filters against the FineWeb.V1.1 baseline.

In Figure 15, we show the plot of the average eval score on high-signal tasks versus the number of training tokens for a 1.4 billion parameter model. We observe that filtering with the combination of fastText classifiers outperforms the FineWeb.V1.1 baseline throughout the training.

Readability scores

We applied novel document-quality filtering to effectively utilize information based on humans’ ability to read documents by leveraging the McAlpine_eflaw readability score for identifying and filtering out hard-to-read documents as low-quality documents. The McAlpine_eflaw readability score is a readability score of a body of text in English for a foreign learner; the lower the score, the easier is the document to read.

We observed the readability score distributions in certain categories, such as science, education, technology, and medical health differs from the overall distribution across all categories in our dataset. This variation occurs because some documents in these categories demand a higher level of education to understand and as such have a high readability score.

Based on this observation, there is a risk of losing high-quality documents when selecting a threshold based on the overall data distribution and applying the same threshold to all documents. Guided by readability score distributions in different categories, we leverage the category information of documents and performed category-aware readability score quality filtering. Specifically, we used a more lenient threshold for these specific categories to prevent filtering out documents with potential educational value solely because of their high readability scores. This results in better performance compared to filtering without leveraging category information.

In Figure 16, we show the progression of accuracy with training for High Signal Tasks for 1.4 billion parameter model on 35 billion tokens. We see that for both datasets compared, the accuracy increases over time and the accuracy of the dataset with the readability score quality filter is consistently higher and ending at 53.20 than the baseline at 51.94.

Rscore.png
Figure 16: Ablation experiment comparing readability score filter against the FineWeb.V1.1 baseline at 1.4 billion model size for 35 billion tokens.

Extreme tokenized documents removal

Despite applying various heuristic rules to filter out low-quality documents, we found that many abnormal documents were still misidentified. In particular, we noticed that while many documents had similar lengths, they produced significantly different token counts after being processed by a tokenizer. Motivated by this observation, we proposed to effectively leverage information from the “pre-tokenization” stage (document character length, document size) and the “post-tokenization” stage (token counts) to identify low-quality documents as those with an extremely high or low number of tokens per character (or per byte), which refer to as extreme tokenized documents. By calculating the average number of tokens per character, we can effectively compare documents of varying lengths. Plotting this calculated statistic across all documents reveals a Gaussian-shaped distribution, and extreme tokenized documents are those falling into the two extremes of the distribution.

Extreme_ schematic.png
Figure 17: A schematic outlining the steps for removing extreme tokenized documents.

As a simple and straightforward approach, one can treat extreme tokenized documents as those falling into the two extremes of this distribution, where their average number of tokens per character deviates by at least two standard deviations from the distribution's mean. More aggressive thresholds can be applied if the data are numerous or if the performance of trained proxy models is not negatively impacted.

The computed statistic of average number of tokens per character can vary across different document categories or domains. For instance, it may differ between code documents and textbooks, or between science and medical texts. There is a risk of losing good-quality documents by using a single set of thresholds for all categories. Hence, we have computed and filtered extreme tokenized documents within each document category separately. Categories of documents can be identified based on the source where the documents have been downloaded from, or being inferred from fastText category classifiers.

In Figure 18, we show the progression of accuracy with training for High Signal Tasks for 1.4 billion parameter model for 35 billion tokens. We see that for both datasets compared, the accuracy increases over time and the accuracy of the dataset with Extreme_tokenized quality filter at 52.78 is higher than the baseline at 51.94.

Extreme.png
Figure 18: Ablation experiment comparing Extreme_tokenized filter against the FineWeb.V1.1 baseline at 1.4 billion model size for 35 billion tokens.

Document categorization classifiers

As mentioned earlier, the quality score distributions in certain categories that potentially contain higher education level documents — such as science, education, technology and computing, and medical health — differs from the overall distribution across all categories in our dataset.

Guided by quality score distributions in different categories, we leverage the category information of documents and performed category-aware quality filtering to prevent losing high-quality documents by using a single set of thresholds for all categories in our data quality filtering steps.

We trained binary fastText category classifiers for the categories with a significant difference between their score distribution plots and the overall distribution across all categories. Specifically, we trained four fastText classifiers for education, science, technology and computing, and medical health categories using WatsonNLP category annotation.

For each category, we fixed the size of the training set to be 800,000 examples (i.e., 400,000 positive, 400,000 negative). 400,000 positively labeled samples were sampled randomly from documents that are labeled with that specific category with a confidence score 0.95 and above, and 400,000 negatively labeled samples were selected randomly from documents labeled with any category other than these four categories with a confidence score of 0.95 and above.

Once we have trained the document categorization classifiers, we annotated all the 96 Common Crawl snapshots. We leveraged these category annotations in our category-aware readability score quality filtering and extreme_tokenized quality filtering.

Combining GneissWeb components into a winning recipe

There are various ways to combine key ingredients and build a recipe, including deciding which components to include and their order, as well as designing ensemble filtering rules using multiple quality annotators. A specific combination of ingredients along with a filtering rule determines the quantity of retained data — as well as its quality.

Indredients.png
Figure 19: Key ingredients selected for building the GneissWeb recipe.

We combined key ingredients in various variations and orders, with the aim of maximizing downstream task performance under the constraint of retaining at least 10 trillion tokens from FineWeb.V1.1. Through our ablations, we determined that the following combination of the processing steps produces the best results. We first applied the exact substring deduplication, followed by our ensemble quality filter as shown in Figure 1.

Using the notation:

A: Custom built fastText quality filter B: Custom built category-aware readability score quality filter by leveraging custom built fastText category classifier C: Custom built category-aware extreme_tokenized quality filter by leveraging custom built fastText category classifier

GneissWeb recipe:

Exact substring deduplication → ((A AND B) OR (A AND C))

Union of ((Intersection of fastText Classifiers and Category-Aware Readability score) and (Intersection of fastText Classifiers and Category-Aware Extreme_tokenized filter)

Recipe 2:

Exact substring deduplication → (A AND B AND C)

Intersection of fastText Classifiers AND Category-Aware Readability score AND Category-Aware Extreme_tokenized filter

Using these combination of components, FineWeb.V1.1 was filtered down to 10 trillion tokens from the 15 trillion initial tokens.

In Figure 20, we show accuracy over high-signal tasks and extended tasks of two recipes for 7 billion parameter model on 100 billion tokens. The GneissWeb recipe outperforms both the other recipe and the FineWeb.V1.1 baseline.

High-Signal Eval ScoreExtended Eval Score
FineWeb.V1.1_7b61.05 ± 0.25​51.01 ± 0.28​
Recipe2_7b62.65 ± 0.3751.82 ± 0.41
GneissWeb_7b63.09 ± 0.10 (+2.04)​52.33 ± 0.24 (+1.32)​

Figure 20: Comparison of ablations at 7 billion model size for 100 billion tokens.

Conclusion and future work

This blog presents GneissWeb dataset produced by IBM Research using IBM Data Prep Kit. GneissWeb consists of 96 common-crawl snapshots outperforming some state-of-the-art datasets of comparative size. We continue to perform further data ablation experiments and plan to open-source the recipe via IBM Data Prep Kit. We are currently processing the latest seven snapshots that we aim to include in GneissWeb after conducting further evaluations and verifications.