Common LLM Setings

NameDatasetTokenizerTraining LibraryEmbeddingsNormalizationParallel LayersBiasesActivation FunctionD Attn / D FFOptimizerOptimizer HyperparametersLR WarmupLR DecayPrecisionClippingDropoutWeight DecayDateSource
GPT1UnreleasedGPT-1UnreleasedLearnedLayerNormNoYesGeLu4AdamNot DisclosedCosine to 0Not DisclosedNot Disclosed0.1Not disclosed6/11/2018https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf
Pythiathe PileGPT-NeoX-20BGPT-NeoXRotaryLayerNormYesYesGeLu4Adam0.9, 0.95Cosine to 10%fp32 / fp161.000.112/10/2022https://arxiv.org/abs/2304.01373

Notes

Many papers say that they follow another paper except for some changes. When such a paper doesn't mention a config, we assume it's the same as the previous paper unless there's a specific reason to not do so. Assumed values, whether becasue of the above principle or because they've been inferred from materials but not explicitly stated, are in italics
My primary goal is to document best practices and how they evolve over time. Therefore if something was supposed to be done but failed to due to a bug, I list what was intended. All such examples have a note disclaiming them when they occur
There are close to 100 LMs that have been trained in the past three years. For ease of use I've been focusing on what I consider to be central examples that people look to to inform their decision-making. If there are models you'd like added, feel free to leave a comment! Some of this info has been divined from staring at model inference code for a long time. I'm reasonably confidant it's correct, but please comment with any corrections (w/ a source!)