Spaces:
Running
"Fake-News-Detector" does not detect fake news, only word-level associations, can cause serious harm and contains no disclaimer. (legal liability)
@JainilP30 This project's advertised purpose is as a "Fake News" detector using an ensemble method framed to improve accuracy, however this is completely b4ll$h17, and if people use this in a serious manner to fact check news, it can cause serious harm. (even if intended as a demo, many people don't know how this tech works, so they may just trust it).
Here are the main fatal flaws:
It has 300 tokens/words cap, meaning everything after gets cut off (This may impact longer article's, like those in research journals or articles covering issues more comprehensively).
text cleaning (
string.punctuation=!"#$%&\'()*+,-./:;<=>?@[\\]^_{|}~` is stripped, important emotive/grammar/structural info is removed) how does the model know the article contains a quote, or its what the author wrote? This is one of the largest flaws, i know you are using non-contextual text embeddings, that are word level, that's one of the main issues.Models Used: Naive Bayes with TF-IDF (55% weight to prediction), this model doesn't understand word order, grammar or context(that's what "Naive" means in this context), so it cannot distinguish between "a man bit a dog" and "a dog bit a man." for example, okay then, that about model #2: Logistic Regression (10% weight to prediction): This model is linear in nature, and cannot pick up on nuanced connections, it will only pick up on words frequently used in "fake news" marked articles and "real news" marked articles from the datasets, this will fail when language us more subtle like parody articles or in articles which happen to have less emotive language, #3: GloVe embeddings(35% weight to prediction): these are static, general pre-trained embeddings, which are trained on public corpus of information, with no fine tuning for news classification, even assuming this model was fine-tuned, it is a word level embeddings which are not contextual, meaning "river bank" and "investment bank" would have the same embedding representation .
Frozen NLTK-DATA zip: you should let the NLTK data get dynamically pulled instead of storing the snapshot of when you used it, parts may change over time, although i don't use it often, you might see issues if using it in an unintended way like that.
Legal issues: You may find licencing issues with keeping checkpoints of models without attribution(Even permissive licenses often require attribution), even if some are finetune's, you usually need to cite them, also the lack of disclaimer's open you up to potential legal liability.
I pulled a topic sentence from a real GB-NEWS article.Rachel Reeves poked fun at Nigel Farage during her Spending Review announcement in the Commons, suggesting the Reform UK leader should "spend less time in the Westminster Arms" pub.
Model 1 0.5837
Model 2 0.2951
Model 3 0.2330
Ensemble Score: 0.4321
Final Prediction: ❌ Fake News
If you replace Nigel Farage with Nelson Mandela or anyone which is likely to have positive sentiment/alignment in the dataset, it gives this:
Model 1 0.6026
Model 2 0.4013
Model 3 0.3942
Ensemble Score: 0.5096
Final Prediction: ✅ Real News
Nelson Mandella died in 1999 so it's not a training data issue, of course it doesn't know who he is as none of these methods provide meaningful contextual embedding's information.
Conclusion:
I realise this is a personal project and its a cool NLP project, but please be careful of claims, people are gullible and do stupid things, especially when the technology is not well understood.
"AI" is hot right now(meaning mostly transformer models), and it is very useful, it will change society, but that's also whats scary because its hard to know what is genuinely useful and what is snake oil from a outsiders perspective.
I would recommend renaming this space to "Negative Word/Phrase Presence Detector", or similar, and put a disclaimer.
Or you could do something with text entailment or political bias detection which are attainable.
Using ML for Fact checking is *largely a unsolved problem, with so many traps, what you need i a ground truth essentially, the problem is reality itself, there are not ground truths for every statement(some things have no answer or are so nuanced that a broad statement is untestable), which is why instead of "fake" vs "real", one would need to split the statement into parts and evaluate statement's of fact separately, i have seen some spaces on here that do this, but most here use naive approaches which don't even hold up to basic scrutiny.
For now, Fact checking should be done by humans, humans can use models for semantic search to find quality sources, but ultimately AI can't do this right now. (Provide a link to a open source project or peer reviewed research resource that does this, and i would be delighted to be proven wrong here, ive been trying to get something like that)
Hi,
Thank you so much for your in-depth and honest feedback — I truly appreciate the time and effort you took to break down the limitations and risks of this project.
You're absolutely right that fact-checking is a very complex task, and the simple methods I’ve used in this project are not nearly enough to handle it properly. I understand that these basic models can't truly understand the meaning or context of news content, and using them for something as serious as identifying fake news can be misleading—especially for people who may not know how these tools work behind the scenes.
This project was built for educational purpose only and is my first hands-on exploration into NLP and ML deployment. I intended it only for learning and experimentation, and not for production use or serious fact-checking.
Following your suggestion, I’ve already made the following updates:
Added a clear disclaimer in the README highlighting that the app is not meant for real-world use.
Included the MIT license and proper attributions for external tools like Gradio and GloVe.
Am actively exploring better modeling approaches for contextual understanding.
Your comments have helped me realize where I need to be more responsible, especially when working with something as sensitive as misinformation detection. I’ll make sure to carry these lessons forward into future work.
Once again, thank you for your insights it genuinely helped me grow as a learner in this space.
Best regards
Jainil