: Why We Need Automated Fact-Checking
In comparison to the traditional media, where articles are edited and verified before getting published, social media changed the approach completely. Suddenly, everyone could raise their voice. Posts are shared instantly, enabling the access to ideas and perspectives from all over the world. That was the dream, at least.
What began as an idea of protecting freedom of speech, giving individuals the opportunity to express opinions without censorship, has come with a trade-off. Very little information gets checked. And that makes it harder than ever to detect what’s accurate and what’s not.
An additional challenge is created as false claims rarely appear just once. They are often reshared on different platforms, often altered in wording, format, length, or even language, making detection and verification even more difficult. As these variations circulate across platforms they can seem familiar and therefore believable to its readers.
The original idea of a space for open, uncensored, and reliable information has run into a paradox. The very openness meant to empower people also makes it easy for misinformation to spread. That’s exactly where fact-checking systems come in.
The Development of Fact-checking Pipelines
Traditionally, fact-checking was a manual process that relied on experts (journalists, researchers, or fact-checking organizations) to verify claims by referencing them with sources such as official documents, or expert opinions. This approach was very reliable and thorough, but also very time-consuming. The result of this delay was therefore more time for the false narratives to circulate, shape public opinion, and enable further manipulation.
This is where automation comes in. Researchers have developed fact-checking pipelines that behave as the human-fact-checking-experts, but can scale to massive amounts of online content. The fact-checking pipeline follows a structured process, which usually includes the following five steps:
- Claim Detection – find statements with factual implications.
- Claim Prioritization – rank them by speed of spread, potential harm, or public interest, prioritizing the most impactful cases.
- Retrieval of Evidence – gather supporting material and provide the context to evaluate it.
- Veracity Prediction – decide whether the claim is true, false, or something in between.
- Generation of Explanation – produce a justification that readers can understand.
In addition to the five steps, many pipelines also add a sixth step: retrieval of previously fact-checked claims (PFCR). Instead of redoing the work from scratch, the system checks whether a claim, even reformulated, has already been verified. If so, it is linked to the fact-check and the claim’s verdict. If not, the pipeline proceeds with evidence retrieval.
This shortcut saves effort, speeds up verification, and further benefits in multilingual settings, as it allows fact-checks in one language to support verification in another.
This component is known by many names; verified claim retrieval, claim matching, or previously fact-checked claim retrieval (PFCR). Regardless of the name, the idea is the same: reuse knowledge that already exists to fight misinformation faster and more effectively.
Designing the PFCR Component (Retrieval Pipeline)
At its core, previously fact-checked claim retrieval (PFCR) is an information retrieval task: given a claim from a social media post, we want to find the most relevant match in a large collection of already fact-checked (verified) claims. If a match exists, we can immediately link it to the source and the verdict, so there is no need to start verification from scratch!
Most modern information retrieval systems use a retriever–reranker architecture. The retriever acts as the first-layer filter returning a larger set of candidate documents (top k) from the corpus. The reranker then takes those candidates and refines the ranking using a deeper, more computationally intensive model. This two-stage design balances speed (retriever) and accuracy (reranker).
Models used for retrieval can be grouped into two categories:
- Lexical models: fast, interpretable, and effective when there’s strong word overlap. But they struggle when ideas are phrased differently (synonyms, paraphrases, translations).
- Semantic models: capture meaning rather than surface words, making them ideal for PFCR. They would recognize that, for example, “the Earth orbits the Sun” and “our planet revolves around the star at the center of the solar system” are describing the same fact, even though the wording is completely different.
Once candidates are retrieved, the reranking stage applies more powerful models (often cross-encoders) to carefully re-score the top results ensuring that the most relevant fact-checks rank higher. As rerankers are more expensive to run, they’re only applied to a smaller pool of candidates (e.g., the top 100).
Together, the retriever–reranker pipeline provides both coverage (by recognizing a wider range of possible matches) and precision (by ranking higher the most similar ones). For PFCR, this balance is crucial as it enables a fast and scalable way to detect repeating claims, but with a high accuracy so that users can trust the information they read.
Building the Ensemble
The retriever–reranker pipeline already delivers solid performance. But as I evaluated the models and ran the experiments, one thing became clear: no single model is good enough on its own.
Lexical models, like BM25, are great at exact keyword matches, but as soon as the claim is phrased differently, they fail. That’s where semantic models step in. They have no problem with handling paraphrases, translations, or crosslingual scenarios, but sometimes struggle with straightforward matches where wording matters the most. Not all the semantic models are the same either, each one had its own niche: some work better in English, others in multilingual settings, another for capturing subtle contextual nuances. In other words, just as misinformation mutates and reappears in countless variations, semantic retrieval models also bring different strengths depending on how they were trained. If misinformation is adaptable, then the retrieval system must be as well.
That’s where the idea of an ensemble came in. Instead of betting on a single “best” model, I combined the predictions of multiple models in an ensemble so they could collaborate and complement each other. Instead of relying on a single model, why not let them work as a team.
Before going further into the ensemble design, I will briefly explain the decision making process for the choice of retrievers.
Establishing a Baseline (Lexical Models)
BM25 is one of the most effective and widely used lexical retrieval models often used as a baseline in modern IR research. Before evaluating the embedding-based (semantic) models, I was interested to see how good (or bad) BM25 can perform. And as it turns out, not bad at all!
Tech detail:
BM25 is a ranking function built upon TF-IDF. It improves TF-IDF by introducing a saturation function and document length normalization. Unlike term frequency scoring, BM25 accounts for repeated occurrences of a term, preventing long documents from being unfairly favoured. It also includes a parameter (b) that controls the weight assigned to term frequency and document length.
Semantic Models
As a starting point for the semantic (embedding-based) models, I referred to the HuggingFace’s Massive Text Embedding Benchmark (MTEB) and evaluated the leading models while keeping the GPU resource constraints in mind.
The two models that stood out were E5 (intfloat/multilingual-e5-large-instruct) and BGE (BAAI/bge-m3). Both achieved strong results when retrieving the top 100 candidates, so I selected them for further tuning and integration with BM25.
Ensemble Design
With retrievers in place, the question was: how do we combine them? I tested different aggregation strategies including majority voting, exponential decay weighting, and reciprocal rank fusion (RRF).
RRF worked best as it doesn’t just average scores, it rewards documents that consistently appear high across different rankings, regardless of which model produced them. This way, the ensemble favored claims that multiple models “agreed on,” while still allowing each model to contribute independently.
I also experimented with the number of candidates retrieved in the first stage (commonly referred to as hyperparameter k). The idea is simple: if you only pull in a very small set of candidates, you risk missing relevant fact-checks altogether. On the other hand, if you select too many, the reranker has to go through a lot of noise, which adds computational cost without actually improving accuracy.
Through the experiments, I found that as k increased, performance improved at first because the ensemble had more chances to find the right fact-checks. But after a certain point, adding more candidates stopped helping. The reranker could already see enough relevant fact-checks to make good decisions, and the extra ones were mostly irrelevant. In practice, this meant finding a “sweet spot” where the candidate pool was large enough to ensure coverage, but not so large that it decreased the reranker’s effectiveness.
As a final step, I adjusted the weights of each model. Reducing the BM25’s influence while giving more weight to the semantic retrievers boosted the performance. In other words, BM25 is useful, but the heavy lifting is done by E5 and BGE.
To shortly go through the PFCR component; the pipeline consists of retrieval and reranking where for the retrieval we can use lexical or semantic models while for the reranking we would use a semantic model. Additionally, we noticed that combining multiple models within an ensemble improves the retrieval/reranking performance. Ok, so where do we integrate the ensemble?
Where Does the Ensemble Fit?
The ensemble wasn’t limited to just one part of the pipeline. I applied it within both the retrieval and reranking.
- Retriever stage → I merged the candidate lists produced by BM25, E5, and BGE. This way, the system didn’t rely on a single model’s “view” of what might be relevant but instead pooled their perspectives into a stronger starting set.
- Reranker stage → I then combined the rankings from multiple rerankers (again referring to MTEB and my GPU constraints). Since each reranker captures slightly different nuances of similarity, blending them helped refine the final ordering of fact-checks with greater accuracy.
At the retriever stage, the ensemble enabled a wider pool of candidates, making sure that fewer relevant claims slipped through the cracks (improving recall).While the reranker stage narrowed down the focus, pushing the most relevant fact-checks to the top (improving precision).
Bringing It All Together (TL;DR)
Long story short; the envisioned digital utopia for open information sharing does not work without verification, and can even create the contrary – a channel for misinformation.
That was the driving force for the development of automated fact-checking pipelines, which helped us move closer to that original promise. They make it easier to verify information quickly and at scale, so when false claims pop up in new forms, they can be spotted and addressed without delay, helping maintain accuracy and trust in the digital world.
The takeaway is simple: diversity is key. Just as misinformation spreads by taking on many forms, a resilient fact-checking system benefits from multiple perspectives working together. Using an ensemble, the pipeline becomes more robust, more adaptable, and ultimately enabling a trustworthy digital space.
For the curious minds
If you’re interested in a deeper technical dive into the retrieval and ensemble strategies behind this pipeline, you can check out my full paper here. It goes into the model choices, experiments, and detailed evaluation metrics within the system.
References
Scott A. Hale, Adriano Belisario, Ahmed Mostafa, and Chico Camargo. 2024. Analyzing Misinformation Claims During the 2022 Brazilian General Election on WhatsApp, Twitter, and Kwai. ArXiv:2401.02395.
Rrubaa Panchendrarajan and Arkaitz Zubiaga. 2024. Claim detection for automated fact-checking: A survey on monolingual, multilingual and cross-lingual research. Natural Language Processing Journal, 7:100066.
Matúš Pikuliak, Ivan Srba, Robert Moro, Timo Hromadka, Timotej Smolen, Martin Melišek, Ivan ˇ Vykopal, Jakub Simko, Juraj Podroužek, and Maria Bielikova. 2023. Multilingual Previously FactChecked Claim Retrieval. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 16477–16500, Singapore. Association for Computational Linguistics.
Preslav Nakov, David Corney, Maram Hasanain, Firoj Alam, Tamer Elsayed, Alberto Barrón-Cedeño, Paolo Papotti, Shaden Shaar, and Giovanni Da San Martino. 2021. Automated Fact-Checking for Assisting Human Fact-Checkers. ArXiv:2103.07769.
Oana Balalau, Pablo Bertaud-Velten, Younes El Fraihi, Garima Gaur, Oana Goga, Samuel Guimaraes, Ioana Manolescu, and Brahim Saadi. 2024. FactCheckBureau: Build Your Own Fact-Check Analysis Pipeline. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM ’24, pages 5185–5189, New York, NY, USA. Association for Computing Machinery
Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan Hamdan, Alex Nikolov, Shaden Shaar, and Zien Sheikh Ali. 2020. Overview of CheckThat! 2020: Automatic Identification and Verification of Claims in Social Media. In Experimental IR Meets Multilinguality, Multimodality, and Interaction, pages 215–236, Cham. Springer International Publishing.
Ashkan Kazemi, Kiran Garimella, Devin Gaffney, and Scott Hale. 2021a. Claim Matching Beyond English to Scale Global Fact-Checking. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4504–4517, Online. Association for Computational Linguistics.
Shaden Shaar, Nikolay Babulkov, Giovanni Da San Martino, and Preslav Nakov. 2020. That is a Known Lie: Detecting Previously Fact-Checked Claims. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3607– 3618, Online. Association for Computational Linguistics.
Alberto Barrón-Cedeño, Tamer Elsayed, Preslav Nakov, Giovanni Da San Martino, Maram Hasanain, Reem Suwaileh, Fatima Haouari, Nikolay Babulkov, Bayan Hamdan, Alex Nikolov, Shaden Shaar, and Zien Sheikh Ali. 2020. Overview of checkthat! 2020: Automatic identification and verification of claims in social media. In Experimental IR Meets Multilinguality, Multimodality, and Interaction: 11th International Conference of the CLEF Association, CLEF 2020, Thessaloniki, Greece, September 22–25, 2020, Proceedings, page 215–236, Berlin, Heidelberg. Springer-Verlag.
Gordon V. Cormack, Charles L A Clarke, and Stefan Buettcher. 2009. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’09, page 758–759, New York, NY, USA. Association for Computing Machinery
Iva Pezo, Allan Hanbury, and Moritz Staudinger. 2025. ipezoTU at SemEval-2025 Task 7: Hybrid Ensemble Retrieval for Multilingual Fact-Checking. In Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025), pages 1159–1167, Vienna, Austria. Association for Computational Linguistics.
Source link
#Building #FactChecking #Systems #Catching #Repeating #False #Claims #Spread