Data Acquisition & Preparation: A Crucial Step in Preventing AI Hallucinations
Computers actively gather vast amounts of information by scraping the internet and employing Optical Character Recognition (OCR) to transform physical text into digital format. This digitised text, alongside other forms of digital data, is stored in large collections known as datasets. These datasets undergo filtering to eliminate impurities such as typographical errors, grammatical mistakes, inconsistent formatting, contradictions, ambiguities, special characters, and errors introduced by the OCR process. However, despite these efforts, biases and inaccuracies can still persist within the data, potentially contributing to the phenomenon of AI hallucinations.
Following this filtering process, the data is then tokenised i.e. broken down into smaller units such as words or phrases, and subsequently converted into numerical representations that the model or algorithm can effectively process. The model is trained on these datasets through the application of advanced self-supervised learning techniques. These techniques encompass tasks such as predicting the next word in a sequence.
During the training phase, the model undergoes iterative cycles of processing using the designated algorithm(s), where the parameters are adjusted to minimise the discrepancy between the predicted and actual outputs. This dynamic process enables the machine to learn by adapting to the intricate patterns and relationships within the data. The trained model itself, with its learned parameters and underlying architecture, constitutes a Large Language Model (LLM).
The Illusion of Accuracy and the Reality of AI Hallucinations
Achieving low error rates on a dataset, doesn’t automatically imply a model’s effectiveness or appropriateness for the task at hand.

AI models can produce incorrect or misleading results due to several factors, including:
- Insufficient or low-quality training data: This includes noise, such as outliers in the data, and the generation of irrelevant or incorrect information. For example, the model might produce outputs akin to “two plus two equals five”.
- Incorrect assumptions: The model may make flawed assumptions based on the input data. For example, it might misclassify a stick insect as a stick due to their visual similarities.
- Biases within the training data: These biases can be reflected and amplified in the model’s outputs. For example, recruitment models designed to identify suitable candidates for STEM (Science, Technology, Engineering, and Mathematics) roles, may have been primarily trained on data with significantly more male candidates. This can inadvertently disadvantage female candidates during the recruitment process.
- Logical inferences beyond its knowledge: The model may attempt to generate coherent responses by making logical inferences that exceed its actual understanding of the information. For example, a triage ambulance system, programmed to prioritise patients based on the severity of their condition, might misinterpret data. It could fail to account for the fact that certain symptoms, while indicative of serious conditions, can also be indicative of less critical conditions, such as anxiety attacks. This could lead the system to erroneously prioritise patients with anxiety attacks, over those experiencing true medical emergencies, potentially resulting in preventable deaths.
Dangers of AI Hallucination:
There have been a number of occasions where AI hallucinations, could have resulted in dire consequences. For example, in November-2024 it was reported (Google’s Gemini turns villain: Ai asks user to die, calls him ‘waste of time, a burden on society’, 2024) that an American student was interacting with Google’s Gemini, for assistance with his homework, regarding the challenges facing ageing adults. Suddenly, the AI started responding with threatening and offensive remarks, namely, “You are not special, you are not important, and you are not needed. You are a waste of time and resources.” “…Please die. Please.” The student and his sister who verified the response, reported that they were both scared and “freaked out”. The sister also expressed concern that such a response could potentially trigger suicidal thoughts in individuals experiencing mental distress.
In March 2024, “The Markup” reported that the Microsoft-powered chatbot “MyCity” was providing entrepreneurs with false information that could result in them breaking the law (Lecher, NYC’s AI chatbot tells businesses to break the law – the markup 2024). It claimed that ‘bosses can take workers’ tips and landlords can discriminate based on source of income.’ Imagine the distress caused to so many individuals and the detrimental consequences in terms of finances, reputation, and mental well-being, if they had acted on this misinformation, resulting from AI hallucinations.
Model Development: A Framework for Addressing AI Hallucinations
1. Developers must choose the Right Model:
Avoid the objective of a ‘one-size-fits-all’ approach. The model must be tailored to address the specific problem and characteristics of the data (e.g., by employing decision trees, neural networks, or support vector machines). This tailored approach will help reduce the margin for error and minimise the risk of catastrophic outcomes.
2. Train The Model Effectively:
In some cases, the model may be required to learn intricate patterns and relationships within the data, by iteratively adjusting its parameters to minimise errors and maximise performance.
To further enhance training effectiveness, the following aspects should be considered:
(i) High-Quality Data:
– Clean data: Ensure the data is free from errors, inconsistencies, and biases.
– Representative data: Training data should accurately reflect the real-world scenarios that the model will encounter.
– Sufficient data: A large and diverse dataset is crucial for training robust models, especially deep learning models.
(ii) Regularization Techniques:
– L1/L2 regularization: Apply techniques that discourage excessive model parameters to prevent overfitting.
– Dropout: Randomly deactivate neurons during training to improve generalisation.
(iii) Computational Resources:
– Powerful hardware: Utilise GPUs or TPUs to accelerate training, especially for large models and complex tasks.
– Efficient computing infrastructure: Leverage cloud computing platforms or distributed training frameworks to scale training efficiently.
Careful consideration of these factors, coupled with the implementation of appropriate strategies, significantly improves the effectiveness of AI model training, leading to more accurate, reliable, and robust models.
3. Optimise with Hyperparameter Tuning:
To achieve optimal efficiency and accuracy, the model will need to undergo fine-tuning of its settings (hyperparameters).
(i) Employ techniques: For example grid search or random search to discover the optimal combination of hyperparameters.
(ii) Gradient descent variants: Explore techniques that accelerate training and improve convergence, such as Adam, RMSprop, and Stochastic Gradient Descent (SGD) with momentum.
(iii) Hyperparameter tuning: Experiment with different learning rates, batch sizes, and regularization parameters to find the optimal configuration.
4. Validate and Test Rigorously:
Testing using authentically prepared data is crucial.
(i) Rigorous testing: Evaluate the model’s performance on a separate test set to obtain unbiased estimates of its accuracy.
(ii) Cross-validation: Use techniques like k-fold cross-validation to improve model generalisation and prevent overfitting.
(iii) Error checking on separate validation data sets: This should be performed by both humans and non-AI based computers, to ensure independent monitoring and objectivity.
5. Continuous Improvement:
Utilise the following processes to ensure accuracy:
(i) Regular monitoring: Continuously monitor the model’s performance in production and identify areas for improvement.
(ii) Retraining: Periodically retrain the model with new data to maintain accuracy and adapt to changing conditions.