Question Answering is a technique inside the fields of natural language processing, which is concerned about building frameworks that consequently answer addresses presented by people in natural language processing. The capacity to peruse the content and afterward answer inquiries concerning it, is a difficult undertaking for machines, requiring information about the world. Existing datasets for Question answering have two main weaknesses: those that are used in training data are excessively little for preparing present-day information, while those that are enormous don’t have similar attributes as express perusing comprehension questions.
To address the need for large and high-quality Question answering datasets, we will discuss some of the popular datasets and their code implementation using TensorFlow and Pytorch. Further, we will discuss some of the benchmark models that gave high accuracy on these datasets.
SQuAD
Stanford Question Answering Dataset (SQuAD) is a dataset comprising 100,000+ inquiries presented by crowd workers on a bunch of Wikipedia articles, where the response to each address is a fragment of text from the comparing understanding entry. The dataset was presented by researchers: Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang from Stanford University.
Loading the dataset using PyTorch
import jsonfrom torchnlp.download import download_file_maybe_extractdef squad_dataset(directory='data/',train=False,dev=False,train_filename='train-v2.0.json',dev_filename='dev-v2.0.json',check_files_train=['train-v2.0.json'],check_files_dev=['dev-v2.0.json'],url_train='https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json',url_dev='https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v2.0.json'):download_file_maybe_extract(url=url_dev, directory=directory, check_files=check_files_dev)download_file_maybe_extract(url=url_train, directory=directory, check_files=check_files_train)squad= []splits_text = [(train, train_filename), (dev, dev_filename)]splits_text = [f for (requested, f) in splits_text if requested]for filename in splits_text :full_path = os.path.join(directory, filename)with open(full_path, 'r') as temp:ret.append(json.load(temp)['data'])if len(squad) == 1:return ret[0]else:return tuple(squad)
Loading the dataset using TensorFlow
import tensorflow as tfdef squad(path):data = tf.data.TextLineDataset(path)def content_filter(source):return tf.logical_not(tf.strings.regex_full_match(source,'([[:space:]][=])+.+([[:space:]][=])+[[:space:]]*'))data = data.filter(content_filter)data = data.map(lambda x: tf.strings.split(x, ' . '))data = data.unbatch()return datatrain= squad('https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v2.0.json')
State of the Art
The current state of the art on SQuAD dataset is SA-Net on Albert. The model gave an F1 score of 93.011.
bAbI
The bAbI-Question Answering is a dataset for question noting and text understanding. The dataset is made out of a bunch of contexts, with numerous inquiry answer sets accessible depending on the specific situations. It contains both English and Hindi content. The “ContentElements” field contains training data and testing data. The initial two give admittance to information designed to normal preparing errands. They are retrieved from the 10,000k variant in English.bAbI was presented by Facebook Group.
Loading the dataset using PyTorch
import osfrom io import openimport torchfrom torchtext.data import Dataset, Field, Example, Iteratorclass BABI20Field(Field):def __init__(self, memory_size, **kwargs):super(BABI20Field, self).__init__(**kwargs)self.memory_size = memory_sizeself.unk_token = Noneself.batch_first = Truedef preprocess(self, x):if isinstance(x, list):return [super(BABI20Field, self).preprocess(s) for s in x]else:return super(BABI20Field, self).preprocess(x)def pad(self, minibatch):if isinstance(minibatch[0][0], list):self.fix_length = max(max(len(x) for x in ex) for ex in minibatch)padded = []for ex in minibatch:# sentences are indexed in reverse order and truncated to memory_sizenex = ex[::-1][:self.memory_size]padded.append(super(BABI20Field, self).pad(nex)+ [[self.pad_token] * self.fix_length]* (self.memory_size - len(nex)))self.fix_length = Nonereturn paddedelse:return super(BABI20Field, self).pad(minibatch)def numericalize(self, arr, device=None):if isinstance(arr[0][0], list):tmp = [super(BABI20Field, self).numericalize(x, device=device).datafor x in arr]arr = torch.stack(tmp)if self.sequential:arr = arr.contiguous()return arrelse:return super(BABI20Field, self).numericalize(arr, device=device)class BABI20(Dataset):urls = ['http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz']name = ''dirname = ''def __init__(self, path, text_field, only_supporting=False, **kwargs):fields = [('story', text_field), ('query', text_field), ('answer', text_field)]self.sort_key = lambda x: len(x.query)with open(path, 'r', encoding="utf-8") as f:triplets = self._parse(f, only_supporting)examples = [Example.fromlist(triplet, fields) for triplet in triplets]super(BABI20, self).__init__(examples, fields, **kwargs)@staticmethoddef _parse(file, only_supporting):datanew, parse_story = [], []for line in file:tid, text = line.rstrip('\n').split(' ', 1)if tid == '1':parse_story = []if text.endswith('.'):parse_story.append(text[:-1])else:query, answer, supporting = (x.strip() for x in text.split('\t'))if only_supporting:substory = [parse_story[int(i) - 1] for i in supporting.split()]else:substory = [x for x in story if x]datanew.append((substory, query[:-1], answer)) # remove '?'parse_story.append("")return datanew@classmethoddef iters(cls, batch_size=32, root='.data', memory_size=50, task=1, joint=False,tenK=False, only_supporting=False, sort=False, shuffle=False, device=None,**kwargs):textnew = BABI20Field(memory_size)train, val, test = BABI20.splits(textnew, root=root, task=task, joint=joint,tenK=tenK, only_supporting=only_supporting,**kwargs)textnew.build_vocab(train)return Iterator.splits((train, val, test), batch_size=batch_size, sort=sort,shuffle=shuffle, device=device)
Loading the dataset using Keras
import reimport tarfileimport numpy as npfrom functools import reducefrom keras.utils.data_utils import get_filefrom keras.preprocessing.sequence import pad_sequencestry:path_new = get_file('babi-tasks-v1-2.tar.gz', origin='https://s3.amazonaws.com/text-datasets/babi_tasks_1-20_v1-2.tar.gz')except:print('Error downloading dataset, please download it manually:\n''$ wget http://www.thespermwhale.com/jaseweston/babi/tasks_1-20_v1-2.tar.gz\n''$ mv tasks_1-20_v1-2.tar.gz ~/.keras/datasets/babi-tasks-v1-2.tar.gz')raisereadfile= tarfile.open(path_new )
State of the Art
The current state of the art on bAbI dataset is STM. The model gave an accuracy of 99.85.
Natural Questions
Natural Questions contains 307,373 questions for training, 7,830 questions for development, and 7,842 questions for testing, alongside human-annotated answers from Wikipedia pages, to be utilized in preparing Question Answer frameworks. This dataset is the first to repeat start to finish measure wherein individuals discover answers to questions. It was developed by the researchers: Lin Pan, Rishav Chakravarti, Anthony Ferritto and Michael Glass.
Loading the dataset using TensorFlow
import bertfrom bert import tokenizationimport tensorflow as tfimport tensorflow_hub as hubimport numpy as npimport hashlibimport globimport osfrom tensorflow.python.ops import math_opsfrom collections import Counterfrom tensorflow.metrics import accuracy%matplotlib inlineBERT_MODEL_HUB = "https://tfhub.dev/google/bert_uncased_L-12_H-768_A-12/1"tf.enable_eager_execution()import jsonlines_train_file_path = '/Users/deniz/natural_questions/data/nq-train-*.jsonl'training = glob.glob(_train_file_path)examples = []for _train_file in train_files:print(training)with jsonlines.open(training) as reader:for i, example in enumerate(reader):# pop ununsed keysdel example['document_html']examples.append(example)
State of the Art
The current state of the art on Natural Questions dataset is GPT-3 175B. The model gave an accuracy of 29.9.
Conclusion
In this article, we have covered some of the high-quality datasets that are used in Question Answering. Further, we implemented these data corpus using different Python Libraries. These datasets feature a diverse range of question and answer types. From the above result, we can see STM model performed exceptionally well on bAbI dataset with accuracy over 99%.
A data analyst with expertise in statistical analysis, data visualization ready to serve the industry using various analytical platforms. I look forward to having in-depth knowledge of machine learning and data science. Outside work, you can find me as a fun-loving person with hobbies such as sports and music.
- natural language processing, nlp in data, nlp pipeline, Pytorch, Question Answering, Tensorflow
Related Posts
AI Companies are Nothing Without PyTorch
Sagar Sharma05/07/2024
K L Krithika03/05/2024
PyTorch Releases Version 2.3 with Focus on Large Language Models and Sparse Inference
K L Krithika25/04/2024
Wayve AI Introduces LINGO-2, Making Driving Easy with Natural Language
Siddharth Jindal17/04/2024
PyTorch Releases torchtune for Easily Fine-Tuning LLMs
Mohit Pandey17/04/2024
In 5 Years, Coding will be Done in Natural Language
Mohit Pandey19/03/2024
Democratize data analysis and insights generation through the seamless translation of Natural Language into SQL queries
Anshika Mathews04/03/2024
What’s New in the Latest TensorFlow 2.16
Tasmia Ansari27/02/2024
Upcoming Large format Conference
Cypher 2024India's Biggest AI Summit
Sep 25-27, 2024 | 📍 Bangalore, India
India Developers Like to be Exploited
Vandana Nair
Depending on the country, data crowdsourcing platforms can pay remote workers as little as $1.2/hour for a project.
The Chief Architect of Aadhaar Suggests Indian Govt to Offer ‘Compute as a Bond’ for Generative AI
Siddharth Jindal
Coal Dependence Threatens Indian Data Centres’ Sustainability Goals
Shyam Nandan Upadhyay
Top Editorial Picks
Dhruva Space Receives IN-SPACe Authorisation for Ground Station as a Service
Shyam Nandan Upadhyay
Researchers Recreate Human Episodic Memory to Give LLMs Infinite Context
Donna Eva
Devin Autonomously Attempts to Charge $100 on Reddit to Create a Website
Donna Eva
Google in Advanced Talks to Acquire Wiz at $23 Billion
MIA
Researchers Make AI-Generated Board Games Using CodeLLaMa
Donna Eva
AMD Partners with IIT-Bombay to Boost Semiconductor Startups in India
MIA
Apple’s India Sales Soar to Record $8 Billion as China Diversification Continues
MIA
Subscribe to The Belamy: Our Weekly Newsletter
Biggest AI stories, delivered to your inbox every week.
Flagship Events
AI Forum for India
Our Discord Community for AI Ecosystem, In collaboration withNVIDIA.
GenAI
Corner
View All
Microsoft Introduces SPREADSHEETLLM for Efficient Spreadsheet Understanding
Vertiv Unveils Next-Gen UPS for High-Capacity AI Power Demands
BluSmart Raises $24 Million in Pre-Series B Funding to Accelerate EV Operations
UptimeAI Secures $14 Million in Series A Funding to Revolutionise Manufacturing with AI
10 AI Tools for Sales and Marketing Professionals
Vector Databases are Ridiculously Good
‘India’s UPI Moment in AI Will Be Driven by Usage, Not Production’
Top 10 Uncensored LLMs You Can Run on a Laptop