Gemini is powered by Google’s most capable AI models, designed with varying capabilities and use cases. Like most LLMs today, these models are pre-trained on a variety of data from publicly ailable sources. We apply quality filters to all datasets, using both heuristic rules and model-based classifiers. We also perform safety filtering to remove content likely to produce policy-violating outputs. To maintain the integrity of model evaluations, we search for and remove any evaluation data that may he been in our training corpus before using data for training. The final data mixtures and weights are determined through ablations on smaller models. We stage training to alter the mixture composition during training – increasing the weight of domain-relevant data towards the end of training. Data quality can be an important factor for highly performing models, and we believe that many interesting questions remain around finding the optimal dataset distribution for pre-training.
This pre-training allows the model to learn to pick up on patterns in language and use them to predict the next probable word or words in a sequence. For example, as an LLM learns, it can predict that the next word in “peanut butter and ___’’ is more likely to be “jelly” than “shoelace.” However, if an LLM picks only the most probable next word, it will lead to less creative responses. So LLMs are often given flexibility to pick from reasonable, albeit slightly less probable, choices (say, “banana”) in order to generate more interesting responses. It’s worth noting that while LLMs can perform well on factual prompts and create the impression of retrieving information, they are neither information databases nor deterministic information retrieval systems. So while you can expect a consistent response to a database query (one that is a literal retrieval of the fixed information stored in the database), an LLM’s response to the same prompt will not necessarily be the same every time (nor will it literally retrieve the information it was trained on). This is also an important reason why LLMs can generate plausible-sounding responses that can at times contain factual errors — not ideal when factuality matters but potentially useful for generating creative or unexpected outputs.