• How It Works
  • PhD thesis writing
  • Master thesis writing
  • Bachelor thesis writing
  • Dissertation writing service
  • Dissertation abstract writing
  • Thesis proposal writing
  • Thesis editing service
  • Thesis proofreading service
  • Thesis formatting service
  • Coursework writing service
  • Research paper writing service
  • Architecture thesis writing
  • Computer science thesis writing
  • Engineering thesis writing
  • History thesis writing
  • MBA thesis writing
  • Nursing dissertation writing
  • Psychology dissertation writing
  • Sociology thesis writing
  • Statistics dissertation writing
  • Buy dissertation online
  • Write my dissertation
  • Cheap thesis
  • Cheap dissertation
  • Custom dissertation
  • Dissertation help
  • Pay for thesis
  • Pay for dissertation
  • Senior thesis
  • Write my thesis

214 Best Big Data Research Topics for Your Thesis Paper

big data research topics

Finding an ideal big data research topic can take you a long time. Big data, IoT, and robotics have evolved. The future generations will be immersed in major technologies that will make work easier. Work that was done by 10 people will now be done by one person or a machine. This is amazing because, in as much as there will be job loss, more jobs will be created. It is a win-win for everyone.

Big data is a major topic that is being embraced globally. Data science and analytics are helping institutions, governments, and the private sector. We will share with you the best big data research topics.

On top of that, we can offer you the best writing tips to ensure you prosper well in your academics. As students in the university, you need to do proper research to get top grades. Hence, you can consult us if in need of research paper writing services.

Big Data Analytics Research Topics for your Research Project

Are you looking for an ideal big data analytics research topic? Once you choose a topic, consult your professor to evaluate whether it is a great topic. This will help you to get good grades.

  • Which are the best tools and software for big data processing?
  • Evaluate the security issues that face big data.
  • An analysis of large-scale data for social networks globally.
  • The influence of big data storage systems.
  • The best platforms for big data computing.
  • The relation between business intelligence and big data analytics.
  • The importance of semantics and visualization of big data.
  • Analysis of big data technologies for businesses.
  • The common methods used for machine learning in big data.
  • The difference between self-turning and symmetrical spectral clustering.
  • The importance of information-based clustering.
  • Evaluate the hierarchical clustering and density-based clustering application.
  • How is data mining used to analyze transaction data?
  • The major importance of dependency modeling.
  • The influence of probabilistic classification in data mining.

Interesting Big Data Analytics Topics

Who said big data had to be boring? Here are some interesting big data analytics topics that you can try. They are based on how some phenomena are done to make the world a better place.

  • Discuss the privacy issues in big data.
  • Evaluate the storage systems of scalable in big data.
  • The best big data processing software and tools.
  • Data mining tools and techniques are popularly used.
  • Evaluate the scalable architectures for parallel data processing.
  • The major natural language processing methods.
  • Which are the best big data tools and deployment platforms?
  • The best algorithms for data visualization.
  • Analyze the anomaly detection in cloud servers
  • The scrutiny normally done for the recruitment of big data job profiles.
  • The malicious user detection in big data collection.
  • Learning long-term dependencies via the Fourier recurrent units.
  • Nomadic computing for big data analytics.
  • The elementary estimators for graphical models.
  • The memory-efficient kernel approximation.

Big Data Latest Research Topics

Do you know the latest research topics at the moment? These 15 topics will help you to dive into interesting research. You may even build on research done by other scholars.

  • Evaluate the data mining process.
  • The influence of the various dimension reduction methods and techniques.
  • The best data classification methods.
  • The simple linear regression modeling methods.
  • Evaluate the logistic regression modeling.
  • What are the commonly used theorems?
  • The influence of cluster analysis methods in big data.
  • The importance of smoothing methods analysis in big data.
  • How is fraud detection done through AI?
  • Analyze the use of GIS and spatial data.
  • How important is artificial intelligence in the modern world?
  • What is agile data science?
  • Analyze the behavioral analytics process.
  • Semantic analytics distribution.
  • How is domain knowledge important in data analysis?

Big Data Debate Topics

If you want to prosper in the field of big data, you need to try even hard topics. These big data debate topics are interesting and will help you to get a better understanding.

  • The difference between big data analytics and traditional data analytics methods.
  • Why do you think the organization should think beyond the Hadoop hype?
  • Does the size of the data matter more than how recent the data is?
  • Is it true that bigger data are not always better?
  • The debate of privacy and personalization in maintaining ethics in big data.
  • The relation between data science and privacy.
  • Do you think data science is a rebranding of statistics?
  • Who delivers better results between data scientists and domain experts?
  • According to your view, is data science dead?
  • Do you think analytics teams need to be centralized or decentralized?
  • The best methods to resource an analytics team.
  • The best business case for investing in analytics.
  • The societal implications of the use of predictive analytics within Education.
  • Is there a need for greater control to prevent experimentation on social media users without their consent?
  • How is the government using big data; for the improvement of public statistics or to control the population?

University Dissertation Topics on Big Data

Are you doing your Masters or Ph.D. and wondering the best dissertation topic or thesis to do? Why not try any of these? They are interesting and based on various phenomena. While doing the research ensure you relate the phenomenon with the current modern society.

  • The machine learning algorithms are used for fall recognition.
  • The divergence and convergence of the internet of things.
  • The reliable data movements using bandwidth provision strategies.
  • How is big data analytics using artificial neural networks in cloud gaming?
  • How is Twitter accounts classification done using network-based features?
  • How is online anomaly detection done in the cloud collaborative environment?
  • Evaluate the public transportation insights provided by big data.
  • Evaluate the paradigm for cancer patients using the nursing EHR to predict the outcome.
  • Discuss the current data lossless compression in the smart grid.
  • How does online advertising traffic prediction helps in boosting businesses?
  • How is the hyperspectral classification done using the multiple kernel learning paradigm?
  • The analysis of large data sets downloaded from websites.
  • How does social media data help advertising companies globally?
  • Which are the systems recognizing and enforcing ownership of data records?
  • The alternate possibilities emerging for edge computing.

The Best Big Data Analysis Research Topics and Essays

There are a lot of issues that are associated with big data. Here are some of the research topics that you can use in your essays. These topics are ideal whether in high school or college.

  • The various errors and uncertainty in making data decisions.
  • The application of big data on tourism.
  • The automation innovation with big data or related technology
  • The business models of big data ecosystems.
  • Privacy awareness in the era of big data and machine learning.
  • The data privacy for big automotive data.
  • How is traffic managed in defined data center networks?
  • Big data analytics for fault detection.
  • The need for machine learning with big data.
  • The innovative big data processing used in health care institutions.
  • The money normalization and extraction from texts.
  • How is text categorization done in AI?
  • The opportunistic development of data-driven interactive applications.
  • The use of data science and big data towards personalized medicine.
  • The programming and optimization of big data applications.

The Latest Big Data Research Topics for your Research Proposal

Doing a research proposal can be hard at first unless you choose an ideal topic. If you are just diving into the big data field, you can use any of these topics to get a deeper understanding.

  • The data-centric network of things.
  • Big data management using artificial intelligence supply chain.
  • The big data analytics for maintenance.
  • The high confidence network predictions for big biological data.
  • The performance optimization techniques and tools for data-intensive computation platforms.
  • The predictive modeling in the legal context.
  • Analysis of large data sets in life sciences.
  • How to understand the mobility and transport modal disparities sing emerging data sources?
  • How do you think data analytics can support asset management decisions?
  • An analysis of travel patterns for cellular network data.
  • The data-driven strategic planning for citywide building retrofitting.
  • How is money normalization done in data analytics?
  • Major techniques used in data mining.
  • The big data adaptation and analytics of cloud computing.
  • The predictive data maintenance for fault diagnosis.

Interesting Research Topics on A/B Testing In Big Data

A/B testing topics are different from the normal big data topics. However, you use an almost similar methodology to find the reasons behind the issues. These topics are interesting and will help you to get a deeper understanding.

  • How is ultra-targeted marketing done?
  • The transition of A/B testing from digital to offline.
  • How can big data and A/B testing be done to win an election?
  • Evaluate the use of A/B testing on big data
  • Evaluate A/B testing as a randomized control experiment.
  • How does A/B testing work?
  • The mistakes to avoid while conducting the A/B testing.
  • The most ideal time to use A/B testing.
  • The best way to interpret results for an A/B test.
  • The major principles of A/B tests.
  • Evaluate the cluster randomization in big data
  • The best way to analyze A/B test results and the statistical significance.
  • How is A/B testing used in boosting businesses?
  • The importance of data analysis in conversion research
  • The importance of A/B testing in data science.

Amazing Research Topics on Big Data and Local Governments

Governments are now using big data to make the lives of the citizens better. This is in the government and the various institutions. They are based on real-life experiences and making the world better.

  • Assess the benefits and barriers of big data in the public sector.
  • The best approach to smart city data ecosystems.
  • The big analytics used for policymaking.
  • Evaluate the smart technology and emergence algorithm bureaucracy.
  • Evaluate the use of citizen scoring in public services.
  • An analysis of the government administrative data globally.
  • The public values are found in the era of big data.
  • Public engagement on local government data use.
  • Data analytics use in policymaking.
  • How are algorithms used in public sector decision-making?
  • The democratic governance in the big data era.
  • The best business model innovation to be used in sustainable organizations.
  • How does the government use the collected data from various sources?
  • The role of big data for smart cities.
  • How does big data play a role in policymaking?

Easy Research Topics on Big Data

Who said big data topics had to be hard? Here are some of the easiest research topics. They are based on data management, research, and data retention. Pick one and try it!

  • Who uses big data analytics?
  • Evaluate structure machine learning.
  • Explain the whole deep learning process.
  • Which are the best ways to manage platforms for enterprise analytics?
  • Which are the new technologies used in data management?
  • What is the importance of data retention?
  • The best way to work with images is when doing research.
  • The best way to promote research outreach is through data management.
  • The best way to source and manage external data.
  • Does machine learning improve the quality of data?
  • Describe the security technologies that can be used in data protection.
  • Evaluate token-based authentication and its importance.
  • How can poor data security lead to the loss of information?
  • How to determine secure data.
  • What is the importance of centralized key management?

Unique IoT and Big Data Research Topics

Internet of Things has evolved and many devices are now using it. There are smart devices, smart cities, smart locks, and much more. Things can now be controlled by the touch of a button.

  • Evaluate the 5G networks and IoT.
  • Analyze the use of Artificial intelligence in the modern world.
  • How do ultra-power IoT technologies work?
  • Evaluate the adaptive systems and models at runtime.
  • How have smart cities and smart environments improved the living space?
  • The importance of the IoT-based supply chains.
  • How does smart agriculture influence water management?
  • The internet applications naming and identifiers.
  • How does the smart grid influence energy management?
  • Which are the best design principles for IoT application development?
  • The best human-device interactions for the Internet of Things.
  • The relation between urban dynamics and crowdsourcing services.
  • The best wireless sensor network for IoT security.
  • The best intrusion detection in IoT.
  • The importance of big data on the Internet of Things.

Big Data Database Research Topics You Should Try

Big data is broad and interesting. These big data database research topics will put you in a better place in your research. You also get to evaluate the roles of various phenomena.

  • The best cloud computing platforms for big data analytics.
  • The parallel programming techniques for big data processing.
  • The importance of big data models and algorithms in research.
  • Evaluate the role of big data analytics for smart healthcare.
  • How is big data analytics used in business intelligence?
  • The best machine learning methods for big data.
  • Evaluate the Hadoop programming in big data analytics.
  • What is privacy-preserving to big data analytics?
  • The best tools for massive big data processing
  • IoT deployment in Governments and Internet service providers.
  • How will IoT be used for future internet architectures?
  • How does big data close the gap between research and implementation?
  • What are the cross-layer attacks in IoT?
  • The influence of big data and smart city planning in society.
  • Why do you think user access control is important?

Big Data Scala Research Topics

Scala is a programming language that is used in data management. It is closely related to other data programming languages. Here are some of the best scala questions that you can research.

  • Which are the most used languages in big data?
  • How is scala used in big data research?
  • Is scala better than Java in big data?
  • How is scala a concise programming language?
  • How does the scala language stream process in real-time?
  • Which are the various libraries for data science and data analysis?
  • How does scala allow imperative programming in data collection?
  • Evaluate how scala includes a useful REPL for interaction.
  • Evaluate scala’s IDE support.
  • The data catalog reference model.
  • Evaluate the basics of data management and its influence on research.
  • Discuss the behavioral analytics process.
  • What can you term as the experience economy?
  • The difference between agile data science and scala language.
  • Explain the graph analytics process.

Independent Research Topics for Big Data

These independent research topics for big data are based on the various technologies and how they are related. Big data will greatly be important for modern society.

  • The biggest investment is in big data analysis.
  • How are multi-cloud and hybrid settings deep roots?
  • Why do you think machine learning will be in focus for a long while?
  • Discuss in-memory computing.
  • What is the difference between edge computing and in-memory computing?
  • The relation between the Internet of things and big data.
  • How will digital transformation make the world a better place?
  • How does data analysis help in social network optimization?
  • How will complex big data be essential for future enterprises?
  • Compare the various big data frameworks.
  • The best way to gather and monitor traffic information using the CCTV images
  • Evaluate the hierarchical structure of groups and clusters in the decision tree.
  • Which are the 3D mapping techniques for live streaming data.
  • How does machine learning help to improve data analysis?
  • Evaluate DataStream management in task allocation.
  • How is big data provisioned through edge computing?
  • The model-based clustering of texts.
  • The best ways to manage big data.
  • The use of machine learning in big data.

Is Your Big Data Thesis Giving You Problems?

These are some of the best topics that you can use to prosper in your studies. Not only are they easy to research but also reflect on real-time issues. Whether in University or college, you need to put enough effort into your studies to prosper. However, if you have time constraints, we can provide professional writing help. Are you looking for online expert writers? Look no further, we will provide quality work at a cheap price.

Foster Care Research Paper Topics

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Comment * Error message

Name * Error message

Email * Error message

Save my name, email, and website in this browser for the next time I comment.

As Putin continues killing civilians, bombing kindergartens, and threatening WWIII, Ukraine fights for the world's peaceful future.

Ukraine Live Updates

research questions on big data

Top 20 Latest Research Problems in Big Data and Data Science

Problem statements in 5 categories, research methodology and research labs to follow.

Dr. Sunil Kumar Vuppala

Dr. Sunil Kumar Vuppala

Towards Data Science

E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the Intellectual Property of my organizations) and the latest trends in these domains [1,2]. These problems are covered under 5 different categories, namely

Core Big data area to handle the scale Handling Noise and Uncertainty in the data Security and Privacy aspects Data Engineering Intersection of Big data and Data science The article also covers a research methodology to solve specified problems and top research labs to follow which are working in these areas.

I encourage researchers to solve applied research problems which will have more impact on society at large. The reason to stress this point is that we are hardly analyzing 1% of the available data. On the other hand, we are generating terabytes of data every day. These problems are not very specific to a domain and can be applied across the domains.

Let me first introduce 8 V’s of Big data (based on an interesting article from Elena ), namely Volume, Value, Veracity, Visualization, Variety, Velocity, Viscosity, and Virality. If we closely look at the questions on individual V’s in Fig 1, they trigger interesting points for the researchers. Even though they are business questions, there are underlying research problems. For instance, 02-Value: “Can you find it when you most need it?” qualifies for analyzing the available data and giving context-sensitive answers when needed.

Having understood the 8V’s of big data, let us look into details of research problems to be addressed. General big data research topics [3] are in the lines of:

  • Scalability — Scalable Architectures for parallel data processing
  • Real-time big data analytics — Stream data processing of text, image, and video
  • Cloud Computing Platforms for Big Data Adoption and Analytics — Reducing the cost of complex analytics in the cloud
  • Security and Privacy issues
  • Efficient storage and transfer
  • How to efficiently model uncertainty
  • Graph databases
  • Quantum computing for Big Data Analytics
Next, let me cover some of the specific research problems across the five listed categories mentioned above. The problems related to core big data area of handling the scale:-
  • Scalable architectures for parallel data processing:

Hadoop or Spark kind of environment is used for offline or online processing of data. The industry is looking for scalable architectures to carry out parallel data processing of big data. There is a lot of progress in recent years, however, there is a huge potential to improve performance.

2. Handling real-time video analytics in a distributed cloud:

With the increased accessibility to the internet even in developing countries, videos became a common medium of data exchange. There is a role of telecom infrastructure, operators, deployment of the Internet of Things (IoT), and CCTVs in this regard. Can the existing systems be enhanced with low latency and more accuracy? Once the real-time video data is available, the question is how the data can be transferred to the cloud, how it can be processed efficiently both at the edge and in a distributed cloud?

3. Efficient graph processing at scale:

Social media analytics is one such area that demands efficient graph processing. The role of graph databases in big data analytics is covered extensively in the reference article [4]. Handling efficient graph processing at a large scale is still a fascinating problem to work on.

The research problems to handle noise and uncertainty in the data:-

4. Identify fake news in near real-time:

This is a very pressing issue to handle the fake news in real-time and at scale as the fake news spread like a virus in a bursty way. The data may come from Twitter or fake URLs or WhatsApp. Sometimes it may look like an authenticated source but still may be fake which makes the problem more interesting to solve.

5. Dimensional Reduction approaches for large scale data:

One can extend the existing approaches of dimensionality reduction to handle large scale data or propose new approaches. This also includes visualization aspects. One can use existing open-source contributions to start with and contribute back to the open-source.

6. Training / Inference in noisy environments and incomplete data :

Sometimes, one may not get a complete distribution of the input data or data may be lost due to a noisy environment. Can the data be augmented in a meaningful way by oversampling, Synthetic Minority Oversampling Technique (SMOTE), or using Generative Adversarial Networks (GANs)? Can the augmentation help in improving the performance? How one can train and infer is the challenge to be addressed.

7. Handling uncertainty in big data processing:

There are multiple ways to handle the uncertainty in big data processing[4]. This includes sub-topics such as how to learn from low veracity, incomplete/imprecise training data. How to handle uncertainty with unlabeled data when the volume is high? We can try to use active learning, distributed learning, deep learning, and fuzzy logic theory to solve these sets of problems.

The research problems in the security and privacy [5] area:-

8. Anomaly Detection in Very Large Scale Systems:

The anomaly detection is a very standard problem but it is not a trivial problem at a large scale in real-time. The range of application domains includes health care, telecom, and financial domains.

9. Effective anonymization of sensitive fields in the large scale systems :

Let me take an example from Healthcare systems. If we have a chest X-ray image, it may contain PHR (Personal Health Record). How one can anonymize the sensitive fields to preserve the privacy in a large scale system in near real-time? This can be applied to other fields as well primarily to preserve privacy.

10. Secure federated learning with real-world applications:

Federated learning enables model training on decentralized data. It can be adopted where the data cannot be shared due to regulatory / privacy issues but still may need to build the models locally and then share the models across the boundaries. Can we still make the federated learning work at scale and make it secure with standard software/hardware-level security is the next challenge to be addressed. Interested researchers can explore further information from RISELab of UCB in this regard.

11. Scalable privacy preservation on big data:

Privacy preservation for large scale data is a challenging research problem to work on as the range of applications varies from the text, image to videos. The difference in country/region level privacy regulations will make the problem more challenging to handle.

The research problems related to data engineering aspects:-

12. Lightweight Big Data analytics as a Service:

Everything offering as a service is a new trend in the industry such as Software as a Service (SaaS). Can we work towards providing lightweight big data analytics as a service?

13. Auto conversion of algorithms to MapReduce problems:

MapReduce is a well-known programming model in Big data. It is not just a map and reduce functions but provide scalability and fault-tolerance to the applications. However, there are not many algorithms that support map-reduce directly. Can we build a library to do an auto conversion of standard algorithms to support MapReduce?

14. Automated Deployment of Spark Clusters:

A lot of progress is witnessed in the usage of spark clusters in recent times but they are not completely ready for automated deployment. This is yet another challenging problem to explore further.

The research problems in intersection of big data with data science:-

15. Approaches to make the models learn with less number of data samples:

In the last 10 years, the complexity of deep learning models increased with the availability of more data and compute power. Some researchers proudly claim that they solved a complex problem with hundreds of layers in deep learning. For instance, image segmentation may need a 100 layer network to solve the segmentation problem. However, the recent trend is that can anyone solve the same problem with less relevant data and with less complexity? The reason behind this thinking is to run the models at the edge devices, not just only at the cloud environment using GPUs/TPUs. For instance, the deep learning models trained on big data might need deployment in CCTV / Drones for real-time usage. This is fundamentally changing the approach of solving complex problems. You may work on challenging problems in this sub-topic.

16. Neural Machine Translation to Local languages:

One can use Google translation for neural machine translation (NMT) activities. However, there is a lot of research in local universities to do neural machine translation in local languages with support from the Governments. The latest advances in Bidirectional Encoder Representations from Transformers (BERT) are changing the way of solving these problems. One can collaborate with those efforts to solve real-world problems.

17. Handling Data and Model drift for real-world applications:

Do we need to run the model on inference data if one knows that the data pattern is changing and the performance of the model will drop? Can we identify the drift in the data distribution even before passing the data to the model? If one can identify the drift, why should one pass the data for inference of models and waste the compute power. This is a compelling research problem to solve at scale in the real world. Active learning and online learning are some of the approaches to solve the model drift problem.

18. Handling interpretability of deep learning models in real-time applications:

Explainable AI is the recent buzz word. Interpretability is a subset of explainability. Machine / Deep learning models are no more black-box models. Few models such as Decision Trees are interpretable. However, if the complexity increases, the base model itself may not be useful to interpret the results. We may need to depend on surrogate models such as Local interpretable model-agnostic explanations (LIME) / SHapley Additive exPlanations (SHAP) to interpret. This can help the decision-makers with the justification of the results produced. For instance, rejection of a loan application or classifying the chest x-ray as COVID-19 positive. Can the interpretable models handle large scale real-time applications?

19. Building context-sensitive large scale systems:

Building a large scale context-sensitive system is the latest trend. There are some open-source efforts to kick start. However, it requires a lot of effort in collecting the right set of data and building context-sensitive systems to improve search capability. One can choose a research problem in this topic if you have a background on search, knowledge graphs, and Natural Language Processing (NLP). This is applicable across the domains.

20. Building large scale generative based conversational systems (Chatbot frameworks):

One specific area gaining momentum is building conversational systems such as Q&A and Chatbot generative systems. A lot of chatbot frameworks are available. Making them generative and preparing summary in real-time conversations are still challenging problems. The complexity of the problem increases as the scale increases. A lot of research is going on in this area. This requires a good understanding of Natural Language Processing and the latest advances such as Bidirectional Encoder Representations from Transformers (BERT) to expand the scope of what conversational systems can solve at scale.

Research Methodology:

Hope you can frame specific problems with your domain and technical expertise from the topics highlighted above. Let me recommend a methodology to solve any of these problems. Some points may look obvious for the researchers, however, let me cover the points in the interest of a larger audience:

Identify your core strengths whether it is in theory, implementation, tools, security, or in a specific domain. Other new skills you can acquire while doing the research. Identifying the right research problem with suitable data is kind of reaching 50% of the milestone. This may overlap with other technology areas such as the Internet of Things (IoT), Artificial Intelligence (AI), and Cloud. Your passion for research will determine how long you can go in solving that problem. The trend is interdisciplinary research problems across the departments. So, one may choose a specific domain to apply the skills of big data and data science.

Literature survey : I strongly recommend to follow only the authenticated publications such as IEEE, ACM, Springer, Elsevier, Science direct, etc… Do not get into the trap of “International journal …” which publish without peer reviews. Please do not limit the literature survey to only IEEE/ACM papers only. A lot of interesting papers are available in arxiv.org and paperswithcode . One needs to check/follow the top research labs in industry and academia as per the shortlisted topic. That gives the latest research updates and helps to identify the gaps to fill in.

Lab ecosystem : Create a good lab environment to carry out strong research. This can be in your research lab with professors, post-docs, Ph.D. scholars, masters, and bachelor students in academia setup or with senior, junior researchers in industry setup. Having the right partnership is the key to collaboration and you may try the virtual groups as well. Having that good ecosystem boosts up the results as one can challenge the others on their approach to improve the results further.

Publish at right avenues: As mentioned in the literature survey, publish the research papers in the right forum where you will receive peer reviews from the experts around the world. We may get obstacles in this process in the way of rejections. However, as long as you receive constructive feedback, one should be thankful to the anonymous reviewers. You may see the potential opportunity to patent the ideas if the approach is novel, non-obvious, and inventive. The recent trend is to open source the code while publishing the paper. If your institution permits it to open source, you may do so by uploading the relevant code in Github with appropriate licensing terms and conditions.

Top Research labs to follow:

Some of these research areas are active in the top research centers around the world. I request you to follow them and identify further gaps to continue the work. Here are some of the top research centers around the world to follow in big data + data science area:

RISE Lab at the University of Berkeley , USA

Doctoral Research Centre in Data Science, The University of Edinburgh, United Kingdom

Data Science Institute, Columbia University, USA

The Institute of Data-Intensive Engineering and Science, John Hopkins University, USA

Facebook Data Science research

Big Data Institute, University of Oxford, United Kingdom

Center for Big Data Analytics, The University of Texas at Austin, USA

Center for data science and big data analytics, Oakland University, USA

Institute for Machine Learning, ETH Zurich, Switzerland

The Alan Turing Institute, United Kingdom

IISc Computational and Data Sciences Research

Data Lab, Carnegie Mellon University, USA

If you wish to continue your learning in big data , here are my recommendations:

Coursera Big Data Specialization

Big data course from the University of California San Diego

Top 10 books based on your need can be picked up from the summary article in Analytics India Magazine.

Data Challenges:

In the process of solving the real-world problems, one may come across these challenges related to data:

  • What is the relevant data in the available data?
  • The Lack of International Standards for Data Privacy Regulations
  • The General Data Protection Regulation (GDPR) kind of rules across the countries
  • Federated learning concepts to adhere to the rules — one can build the model and share, still, data belongs to the country/organization.

Conclusion:

In this article, I briefly introduced the big data research issues in general and listed Top 20 latest research problems in big data and data science in 2020. These problems are further divided and presented in 5 categories so that the researchers can pick up the problem based on their interests and skill set. This list is no means exhaustive. However, I hope these inputs can excite some of you to solve the real problems in big data and data science. I covered these points along with some background on big data in a webinar for your reference [7]. You may refer to my other article which lists the problems to solve with data science amid Covid-19[8]. Let us come together to build a better world with technology.

References:

[1] https://www.gartner.com/en/newsroom/press-releases/2019-10-02-gartner-reveals-five-major-trends-shaping-the-evoluti

[2] https://www.forbes.com/sites/louiscolumbus/2019/09/25/whats-new-in-gartners-hype-cycle-for-ai-2019/#d3edc37547bb

[3] https://arxiv.org/ftp/arxiv/papers/1705/1705.04928.pdf

[4] https://www.xenonstack.com/insights/graph-databases-big-data/

[5] https://journalofbigdata.springeropen.com/articles/10.1186/s40537-019-0206-3

[6] https://www.rd-alliance.org/group/big-data-ig-data-security-and-trust-wg/wiki/big-data-security-issues-challenges-tech-concerns

[7] https://www.youtube.com/watch?v=maZonSZorGI

[8] https://medium.com/@sunil.vuppala/ds4covid-19-what-problems-to-solve-with-data-science-amid-covid-19-a997ebaadaa6

Choose the right research problem and apply your skills to solve it. All the very best. Please share your feedback in the comments section. Feel free to add if you come across further topics in this area.

Dr. Sunil Kumar Vuppala

Written by Dr. Sunil Kumar Vuppala

Dr. Sunil is a Director of Data Science in Ericssion. 16+ years of exp in ML/DL, IoT, Analytics; Inventor, Speaker, Thought leader. Top Data Scientist in India.

Text to speech

My Paper Done

  • Services Paper editing services Paper proofreading Business papers Philosophy papers Write my paper Term papers for sale Term paper help Academic term papers Buy research papers College writing services Paper writing help Student papers Original term papers Research paper help Nursing papers for sale Psychology papers Economics papers Medical papers Blog

research questions on big data

166 Latest Big Data Research Topics And Fascinating Ideas

big data research topics

Big data refers to a huge volume of data, whether organized or unorganized, whose analysis shapes technologies and methodologies. Big data is so massive and complicated that it cannot be handled using ordinary application software. For instance, some frameworks, such as Hadoop, are built to process large amounts of data. Big data has gained much attention, hence it’s a trendy topic and essay for students and researchers who want to write thesis, projects, and dissertations. Based on this, there are several searchable and interesting topics to explore for undergraduate and master’s theses in big data, same as doctoral degrees. In this article, we have provided every topic you need on big data. Our topics stretch from big data analytics, big data research questions, to IoT and database essays. If you’ve been looking for the latest big data research topics, your search stops here. Read on to see some of the most interesting topics for your thesis.

Interesting Big Data Analytics Research Topics

Data analytics is the lifeblood of the modern IT sector. Big data is one of the strategies and technologies for analyzing large amounts of data. Data analytics is being used by the industry to acquire knowledge of system performance and customer behavior. Here are some of the best big data analytics topics and ideas for academic papers.

  • The surge of Internet of Things (IoT)
  • Explain the significance of augmented reality.
  • What is the significance of artificial intelligence?
  • Describe the graph analytics procedure.
  • What is agile data science, and how does it differ from traditional data science?
  • What role does machine intelligence play in today’s businesses?
  • What is hyper-personalization, and how does it work?
  • Describe how behavioral analytics works.
  • What is the experience economy, and how does it work?
  • Talk about the science of travel.
  • Discuss the validation and extraction of knowledge.
  • What is semantic data management, and how does it work?
  • Describe the process of deep learning.
  • Describe software engineering in the context of big data science.
  • What is structured machine learning, and how does it work?
  • Describe how to answer a semantic question
  • What is distributed semantic analytics, and how does it work?
  • What role does domain knowledge play in data analysis?
  • Why is data exploration important in data analysis?
  • Who uses big data analytics?

So, it’s not an easy task to write a paper for a high grade. Sometimes every student need a professional help with research paper writing. Therefore, don’t be afraid to hire a writer to complete your assignment. Just contact us and get your paper done soon. 

Trending Big Data Research Topics

Students and researchers who want to write about big data latest research topics on appearing issues and topics should pick current topics in data science. Below are some current big data analysis research topics and essays to look into if writing a research essay or paper.

  • Analyze the digital tools and programs for processing large data.
  • Discuss the effect of the sophistication of big data on human privacy.
  • Evaluate how scalable architectures can be used for processing parallel data.
  • List the different growth oriented big data storage mechanics.
  • Visualizing big data.
  • Business acumen in combination with big data analytics.
  • Map-reductionist architecture.
  • Methods of machine learning in big data.
  • Big data analytics and impact on privacy preservation.
  • The processing of big data and impact on climate change.
  • Risks and uncertainties in big data management.
  • Detecting anomalies in large-scale data systems.
  • Analyze the big data for social networks.
  • Platforms for large scale data computing: big data analysis and acceptance.
  • Discuss the procedures of analyzing big data.
  • Discuss the many effective ways of managing big data.
  • Big data programming and process methods.
  • Big data semantics.
  • How big data influences biomedical information and strategies.
  • The significance of big data strategies on small and medium-sized businesses.

Most Debatable Big Data Research Topics and Essays

The rapid rise of big data in our current time is not without controversy. There is a myriad of ongoing debates in the discipline that have gone unresolved for quite some time. The list below contains the most common big data debate topics.

  • Big data and its major vulnerabilities.
  • What measures are in place to recognize a legit user of big data?
  • Explain the significance of user-access control.
  • Investigate the importance of centralized key management.
  • Identify ways to prevent illegal access of data.
  • Intrusion-detection system: Which is the best?
  • Does machine learning enhance data quality?
  • Which security technology has proven to be the best for big data protection?
  • What strategies should be used for data governance and who should implement data policies?
  • Should tech giants regularly update security measures and be transparent about them?
  • How has poor data security contributed to the loss of historical evidence?
  • What are the most important big data management tools and strategies?
  • What is data retention and explain its relevance?
  • Artificial intelligence will lead to the loss of employment and human interaction.
  • Enterprise analytics: How to manage platforms?
  • Can data management foster the promotion of peace and freedom?
  • Who should be in control of data security: Tech giants or the government?
  • What are the functions of the government in big data management and security?
  • Discuss how big data is leading to the end of morals and ethics.
  • How is big data contributing to the rise of global climate and why tech should pay carbon taxes.

Interesting Dissertation Topics on Big Data

Many research theses and big data topics can be found online for undergraduates, Masters, and Ph.D. students. The list below comprises some dissertation topics on big data.

  • Privacy and security issues in big data and how to curtail them.
  • Impacts of storage systems of scalable big data.
  • The significance of big data processing and data management to industrial development.
  • Techniques and data mining tools for big data.
  • The benefits of data analytics and cloud computing to the future of work.
  • Parallel data processing: effective data architecture and how to go about it.
  • Impacts of machine learning algorithms on the fashion industry.
  • Using bandwidth provision, how the world of streaming is changing.
  • What are the benefits and threats of dedicated networks to governance?
  • Cloud gaming and impacts on Millennials and Generation Z.
  • Ways to enhance and maximize spread efficiency using flow authority model.
  • How divergent and convergent is the Internet of Things (IoT) on manufacturing?
  • Data mining and environmental impact: The way forward.
  • Geopolitics and the surge of demographic mapping in big data.
  • Impacts of travel patterns on big data analytics and data management.
  • The rise of deep learning in the automotive industry.
  • The sophistication of big data and its implications on cybersecurity.
  • Discuss how the big data manufacturing process indicates positive globalization.
  • Evaluate the future of data mining and the adaptation of humans to big data.
  • Human and material wastes in big data management.

Interesting Research Topics on A/B Testing in Big Data

The A/B testing is also known as controlled experiments and is used widely by companies and firms to make decisions in product launches. Tech companies use the test to know the acceptability of a certain product by the users. However, below are some key research topics on A/B testing in Big Data

  • Evaluate the common A/B pitfalls in the automotive industry.
  • Discuss the benefits of improving library user experience with A/B Testing.
  • How to design A/B tests in a collaboration network.
  • Analyze how the future of social network advertising can be improved by A/B testing.
  • Effectiveness of A/B experiments in MOOCs for better instructional methods.
  • Strategies of Bayesian A/B testing for business decisions.
  • A/B testing challenges in large-scale social networks and online controlled experiments.
  • Illustrate how consumer behaviors and trends are shaped by A/B testing.

List of Research Topics on Big Data and Local Governments

Big data offers tremendous value to grassroots governments with the ability to optimize cost through data-induced decisions that reduce the crime rate, traffic congestion and improve the environment. Below are interesting topics on big data and local governments.

  • How local governments can measure crime using big data testing.
  • Big data and algorithmic policy in local government policies.
  • Application of data science technologies to civil service in the local government.
  • Combating grassroots crime and corruption through algorithmic government.
  • Big data in the public sector: how local governments can benefit from the algorithmic policy.

Innovative Research Topics on Big Data and IoT

Big data has a lot in common with the Internet of Things (IoT). Indeed, IoT is an integral part of big data. Below are researchable IoT and big data research topics.

  • The impacts of big data and the Internet of Things (IoT) on the fourth industrial revolution.
  • The importance of big data and the Internet of Things (IoT) on public health systems.
  • Explain how big data and the Internet of Things (IoT) dictate the flow of information in the media sector.
  • Challenges of big data and the Internet of Things (IoT) on governance and sustainability.
  • The disruption of big data and its attendant effects on the Internet of Things (IoT).
  • Illustrate the surge in household smart devices and the role of big data analytics.
  • An analysis of the disruption of the supply chain of traditional goods through the Internet of Things (IoT).
  • A comprehensive evaluation of machine and deep learning for IoT-enabled healthcare systems.
  • The future evaluation of the internet of things and big data analytics in the public infrastructure systems.
  • Discuss how AI-induced security can guarantee effective data protection.
  • IoT privacy: what data protection means to households and the impacts of security infringement.
  • Discuss the role of big data and the integrity of the Internet of Things (IoT).
  • How do dedicated networks work through the Internet of Things (IoT)?
  • The threats and benefits of the Internet of Things (IoT) forensic science.
  • Big data distributed storage and impacts on IoT-enabled industries.

Most Engaging Database Big Data Research Topics

The database category of big data has some interesting data science research topics. Due to the large data, modern companies have to analyze every day, which are difficult to handle, strict managing is essential to make sure of the effective use of data. Check out some intriguing big data database research topics students and researchers can write about.

  • Explain the most inventive big data information concepts and strategies.
  • Clarify the most ideal data management strategies and techniques for present-day businesses.
  • New advancements and AI in information management.
  • What is information maintenance and for what reason is it significant?
  • Depict the essentials of information management.
  • Clarify the use of information management in e-learning.
  • Information distribution and access by present-day organizations.
  • Clarify the most common way of investigating and overseeing information for biomedical exploration.
  • Disclose how to function with 3D pictures during research.
  • How could an association guarantee secure and classified information management and security?
  • Information indexes: Describe approaches and their execution as well as their reception.
  • Talk about the effect of information quality on a business.
  • Instructions on how to advance medical examination and reach logical effort through information management.
  • The most effective method to source and oversee external data.
  • Evaluate the procedures available to organizations in ensuring information security through appropriate administration.
  • Information catalog reference model and global market study.
  • What is information valuation and what difference does it make in information management?
  • How could AI further develop database security?
  • How might an organization carry out effective data administration?
  • Database management and the cost of disruptive cybersecurity.

If you are ready to boost your grades with a little effort, j ust write a message “Please, write a custom research paper for me” to hire one of our professional writers. Contact us today to get a 100% original paper. 

Compelling Big Data Scala Research Topics

Big Data Scala is the product of algorithmic frameworks in deep and machine learning. Below are listed topics on big data Scala for students and young researchers.

  • Large information versatility dependent on Scala and Spark Machine Learning Libraries
  • Analyze versatile large information stockpiling frameworks in deep learning.
  • Dealing with Data and Model drift for practical applications.
  • Building generative systems based on conversational frameworks (Chatbot systems).
  • Adaptable designs for parallel data building.
  • Dealing with continuous video analytics in cloud computing.
  • Proficient graph processing at a machine learning scale.
  • Dimensional reduction approaches for information management.
  • Compelling anonymization of sensitive fields in computer vision.
  • Versatile security safeguarding on big data.

List of Independent Research Topics for Big Data

Independent researches are pieces of research that may be considered unorthodox in big data testing and management. These are research studies generated by individual researchers. Here is a list of the most fascinating independent research topics on big data.

  • Significance of effective data mining tools and procedures.
  • What is data-driven clustering in deep and machine learning?
  • How impactful is the graph analytics process to the Internet of Things?
  • Explain the significance of AI for present-day businesses.
  • Significance of information investigation in information examination on deep learning.
  • Evaluate the usefulness of coding in Artificial Intelligence.
  • Clarify the AI strategies in big data management.
  • Data security: what it means to computer vision.
  • Impact of open-source deep learning libraries on developers.
  • The significance of token-based authentication to data security.
  • Using big data to identify disinformation and misinformation.
  • Data management and the fundamental principles of Artificial Intelligence.
  • Big data analytics and why it should be more user-friendly.
  • Why business intelligence should focus more on privacy preservation.
  • Social networks and impact on privacy infringement.

Is Your Big Data Paper Not Coming Along?

Although we have provided you with a list of big data essays to choose from, we dare say university research topics go beyond mere writing tips. As a student, you may need quality college paper writing services and professional assistance to writing an A-graded and top-notch thesis or dissertations. Here is where we come in. You can consult our reliable and professional writing experts to ease your degree courses at a pocket-friendly price. Aside, you can also refer your colleagues online to enjoy our discounted services that will make your research experience less tacky and frustrating.

machine learning topics

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

Terms & Conditions Loyalty Program Privacy Policy Money-Back Policy

Copyright © 2013-2024 MyPaperDone.com

research questions on big data

Research Topics & Ideas: Data Science

50 Topic Ideas To Kickstart Your Research Project

Research topics and ideas about data science and big data analytics

If you’re just starting out exploring data science-related topics for your dissertation, thesis or research project, you’ve come to the right place. In this post, we’ll help kickstart your research by providing a hearty list of data science and analytics-related research ideas , including examples from recent studies.

PS – This is just the start…

We know it’s exciting to run through a list of research topics, but please keep in mind that this list is just a starting point . These topic ideas provided here are intentionally broad and generic , so keep in mind that you will need to develop them further. Nevertheless, they should inspire some ideas for your project.

To develop a suitable research topic, you’ll need to identify a clear and convincing research gap , and a viable plan to fill that gap. If this sounds foreign to you, check out our free research topic webinar that explores how to find and refine a high-quality research topic, from scratch. Alternatively, consider our 1-on-1 coaching service .

Research topic idea mega list

Data Science-Related Research Topics

  • Developing machine learning models for real-time fraud detection in online transactions.
  • The use of big data analytics in predicting and managing urban traffic flow.
  • Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.
  • The application of predictive analytics in personalizing cancer treatment plans.
  • Analyzing consumer behavior through big data to enhance retail marketing strategies.
  • The role of data science in optimizing renewable energy generation from wind farms.
  • Developing natural language processing algorithms for real-time news aggregation and summarization.
  • The application of big data in monitoring and predicting epidemic outbreaks.
  • Investigating the use of machine learning in automating credit scoring for microfinance.
  • The role of data analytics in improving patient care in telemedicine.
  • Developing AI-driven models for predictive maintenance in the manufacturing industry.
  • The use of big data analytics in enhancing cybersecurity threat intelligence.
  • Investigating the impact of sentiment analysis on brand reputation management.
  • The application of data science in optimizing logistics and supply chain operations.
  • Developing deep learning techniques for image recognition in medical diagnostics.
  • The role of big data in analyzing climate change impacts on agricultural productivity.
  • Investigating the use of data analytics in optimizing energy consumption in smart buildings.
  • The application of machine learning in detecting plagiarism in academic works.
  • Analyzing social media data for trends in political opinion and electoral predictions.
  • The role of big data in enhancing sports performance analytics.
  • Developing data-driven strategies for effective water resource management.
  • The use of big data in improving customer experience in the banking sector.
  • Investigating the application of data science in fraud detection in insurance claims.
  • The role of predictive analytics in financial market risk assessment.
  • Developing AI models for early detection of network vulnerabilities.

Research topic evaluator

Data Science Research Ideas (Continued)

  • The application of big data in public transportation systems for route optimization.
  • Investigating the impact of big data analytics on e-commerce recommendation systems.
  • The use of data mining techniques in understanding consumer preferences in the entertainment industry.
  • Developing predictive models for real estate pricing and market trends.
  • The role of big data in tracking and managing environmental pollution.
  • Investigating the use of data analytics in improving airline operational efficiency.
  • The application of machine learning in optimizing pharmaceutical drug discovery.
  • Analyzing online customer reviews to inform product development in the tech industry.
  • The role of data science in crime prediction and prevention strategies.
  • Developing models for analyzing financial time series data for investment strategies.
  • The use of big data in assessing the impact of educational policies on student performance.
  • Investigating the effectiveness of data visualization techniques in business reporting.
  • The application of data analytics in human resource management and talent acquisition.
  • Developing algorithms for anomaly detection in network traffic data.
  • The role of machine learning in enhancing personalized online learning experiences.
  • Investigating the use of big data in urban planning and smart city development.
  • The application of predictive analytics in weather forecasting and disaster management.
  • Analyzing consumer data to drive innovations in the automotive industry.
  • The role of data science in optimizing content delivery networks for streaming services.
  • Developing machine learning models for automated text classification in legal documents.
  • The use of big data in tracking global supply chain disruptions.
  • Investigating the application of data analytics in personalized nutrition and fitness.
  • The role of big data in enhancing the accuracy of geological surveying for natural resource exploration.
  • Developing predictive models for customer churn in the telecommunications industry.
  • The application of data science in optimizing advertisement placement and reach.

Recent Data Science-Related Studies

While the ideas we’ve presented above are a decent starting point for finding a research topic, they are fairly generic and non-specific. So, it helps to look at actual studies in the data science and analytics space to see how this all comes together in practice.

Below, we’ve included a selection of recent studies to help refine your thinking. These are actual studies,  so they can provide some useful insight as to what a research topic looks like in practice.

  • Data Science in Healthcare: COVID-19 and Beyond (Hulsen, 2022)
  • Auto-ML Web-application for Automated Machine Learning Algorithm Training and evaluation (Mukherjee & Rao, 2022)
  • Survey on Statistics and ML in Data Science and Effect in Businesses (Reddy et al., 2022)
  • Visualization in Data Science VDS @ KDD 2022 (Plant et al., 2022)
  • An Essay on How Data Science Can Strengthen Business (Santos, 2023)
  • A Deep study of Data science related problems, application and machine learning algorithms utilized in Data science (Ranjani et al., 2022)
  • You Teach WHAT in Your Data Science Course?!? (Posner & Kerby-Helm, 2022)
  • Statistical Analysis for the Traffic Police Activity: Nashville, Tennessee, USA (Tufail & Gul, 2022)
  • Data Management and Visual Information Processing in Financial Organization using Machine Learning (Balamurugan et al., 2022)
  • A Proposal of an Interactive Web Application Tool QuickViz: To Automate Exploratory Data Analysis (Pitroda, 2022)
  • Applications of Data Science in Respective Engineering Domains (Rasool & Chaudhary, 2022)
  • Jupyter Notebooks for Introducing Data Science to Novice Users (Fruchart et al., 2022)
  • Towards a Systematic Review of Data Science Programs: Themes, Courses, and Ethics (Nellore & Zimmer, 2022)
  • Application of data science and bioinformatics in healthcare technologies (Veeranki & Varshney, 2022)
  • TAPS Responsibility Matrix: A tool for responsible data science by design (Urovi et al., 2023)
  • Data Detectives: A Data Science Program for Middle Grade Learners (Thompson & Irgens, 2022)
  • MACHINE LEARNING FOR NON-MAJORS: A WHITE BOX APPROACH (Mike & Hazzan, 2022)
  • COMPONENTS OF DATA SCIENCE AND ITS APPLICATIONS (Paul et al., 2022)
  • Analysis on the Application of Data Science in Business Analytics (Wang, 2022)

As you can see, these research topics are a lot more focused than the generic topic ideas we presented earlier. So, for you to develop a high-quality research topic, you’ll need to get specific and laser-focused on a specific context with specific variables of interest.  In the video below, we explore some other important things you’ll need to consider when crafting your research topic.

Get 1-On-1 Help

If you’re still unsure about how to find a quality research topic, check out our Private Coaching service, the perfect starting point for developing a unique, well-justified research topic.

Private Coaching

I have to submit dissertation. can I get any help

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

research questions on big data

  • Print Friendly

Loading metrics

Open Access

Ten simple rules for responsible big data research

* E-mail: [email protected]

Affiliation Department of Geography, University of Kentucky, Lexington, Kentucky, United States of America

Affiliation Microsoft Research, New York, New York, United States of America

Affiliations Microsoft Research, New York, New York, United States of America, Data & Society, New York, New York, United States of America

Affiliations Microsoft Research, New York, New York, United States of America, Information Law Institute, New York University, New York, New York, United States of America

Affiliation Data & Society, New York, New York, United States of America

ORCID logo

Affiliation Department of Media and Communications, London School of Economics, London, United Kingdom

Affiliation Harvard-Smithsonian Center for Astrophysics, Harvard University, Cambridge, Massachusetts, United States of America

Affiliation Center for Engineering Ethics and Society, National Academy of Engineering, Washington, DC, United States of America

Affiliation Institute for Health Aging, University of California-San Francisco, San Francisco, California, United States of America

Affiliation Ethical Resolve, Santa Cruz, California, United States of America

Affiliation Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America

Affiliation Department of Sociology, Columbia University, New York, New York, United States of America

Affiliation Carey School of Law, University of Maryland, Baltimore, Maryland, United States of America

  • Matthew Zook, 
  • Solon Barocas, 
  • danah boyd, 
  • Kate Crawford, 
  • Emily Keller, 
  • Seeta Peña Gangadharan, 
  • Alyssa Goodman, 
  • Rachelle Hollander, 
  • Barbara A. Koenig, 

PLOS

Published: March 30, 2017

  • https://doi.org/10.1371/journal.pcbi.1005399
  • Reader Comments

Citation: Zook M, Barocas S, boyd d, Crawford K, Keller E, Gangadharan SP, et al. (2017) Ten simple rules for responsible big data research. PLoS Comput Biol 13(3): e1005399. https://doi.org/10.1371/journal.pcbi.1005399

Editor: Fran Lewitter, Whitehead Institute, UNITED STATES

This is an open access article, free of all copyright, and may be freely reproduced, distributed, transmitted, modified, built upon, or otherwise used by anyone for any lawful purpose. The work is made available under the Creative Commons CC0 public domain dedication.

Funding: The work for this article was supported by the National Science Foundation grant # IIS-1413864. The funders had no role in the study design, data collection and analysis, decision to publish, or preparation of the manuscript.

Competing interests: The authors have declared that no competing interests exist.

Introduction

The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of the computational and natural sciences, to more directly address sensitive aspects of human behavior, interaction, and health. The tools of big data research are increasingly woven into our daily lives, including mining digital medical records for scientific and economic insights, mapping relationships via social media, capturing individuals’ speech and action via sensors, tracking movement across space, shaping police and security policy via “predictive policing,” and much more.

The beneficial possibilities for big data in science and industry are tempered by new challenges facing researchers that often lie outside their training and comfort zone. Social scientists now grapple with data structures and cloud computing, while computer scientists must contend with human subject protocols and institutional review boards (IRBs). While the connection between individual datum and actual human beings can appear quite abstract, the scope, scale, and complexity of many forms of big data creates a rich ecosystem in which human participants and their communities are deeply embedded and susceptible to harm. This complexity challenges any normative set of rules and makes devising universal guidelines difficult.

Nevertheless, the need for direction in responsible big data research is evident, and this article provides a set of “ten simple rules” for addressing the complex ethical issues that will inevitably arise. Modeled on PLOS Computational Biology ’s ongoing collection of rules, the recommendations we outline involve more nuance than the words “simple” and “rules” suggest. This nuance is inevitably tied to our paper’s starting premise: all big data research on social, medical, psychological, and economic phenomena engages with human subjects, and researchers have the ethical responsibility to minimize potential harm.

The variety in data sources, research topics, and methodological approaches in big data belies a one-size-fits-all checklist; as a result, these rules are less specific than some might hope. Rather, we exhort researchers to recognize the human participants and complex systems contained within their data and make grappling with ethical questions part of their standard workflow. Towards this end, we structure the first five rules around how to reduce the chance of harm resulting from big data research practices; the second five rules focus on ways researchers can contribute to building best practices that fit their disciplinary and methodological approaches. At the core of these rules, we challenge big data researchers who consider their data disentangled from the ability to harm to reexamine their assumptions. The examples in this paper show how often even seemingly innocuous and anonymized data have produced unanticipated ethical questions and detrimental impacts.

This paper is a result of a two-year National Science Foundation (NSF)-funded project that established the Council for Big Data, Ethics, and Society, a group of 20 scholars from a wide range of social, natural, and computational sciences ( http://bdes.datasociety.net/ ). The Council was charged with providing guidance to the NSF on how to best encourage ethical practices in scientific and engineering research, utilizing big data research methods and infrastructures [ 1 ].

1. Acknowledge that data are people and can do harm

One of the most fundamental rules of responsible big data research is the steadfast recognition that most data represent or impact people. Simply starting with the assumption that all data are people until proven otherwise places the difficulty of disassociating data from specific individuals front and center. This logic is readily evident for “risky” datasets, e.g., social media with inflammatory language, but even seemingly benign data can contain sensitive and private information, e.g., it is possible to extract data on the exact heart rates of people from YouTube videos [ 2 ]. Even data that seemingly have nothing to do with people might impact individuals’ lives in unexpected ways, e.g., oceanographic data that change the risk profiles of communities’ and properties’ values or Exchangeable Image Format (EXIF) records from photos that contain location coordinates and reveal the photographer’s movement or even home location.

Harm can also result when seemingly innocuous datasets about population-wide effects are used to shape the lives of individuals or stigmatize groups, often without procedural recourse [ 3 , 4 ]. For example, social network maps for services such as Twitter can determine credit-worthiness [ 5 ], opaque recidivism scores can shape criminal justice decisions in a racially disparate manner [ 6 ], and categorization based on zip codes resulted in less access to Amazon Prime same-day delivery service for African-Americans in United States cities [ 7 ]. These high-profile cases show that apparently neutral data can yield discriminatory outcomes, thereby compounding social inequities.

Other cases show that “public” datasets are easily adapted for highly invasive research by incorporating other data, such as Hague et al.’s [ 8 ] use of property records and geographic profiling techniques to allegedly identify the pseudonymous artist Banksy [ 9 ]. In particular, data ungoverned by substantive consent practices, whether social media or the residual DNA we continually leave behind us, may seem public but can cause unintentional breaches of privacy and other harms [ 9 , 10 ].

Start with the assumption that data are people (until proven otherwise), and use it to guide your analysis. No one gets an automatic pass on ethics.

2. Recognize that privacy is more than a binary value

Breaches of privacy are key means by which big data research can do harm, and it is important to recognize that privacy is contextual [ 11 ] and situational [ 12 ], not reducible to a simple public/private binary. Just because something has been shared publicly does not mean any subsequent use would be unproblematic. Looking at a single Instagram photo by an individual has different ethical implications than looking at someone’s full history of all social media posts. Privacy depends on the nature of the data, the context in which they were created and obtained, and the expectations and norms of those who are affected. Understand that your attitude towards acceptable use and privacy may not correspond with those whose data you are using, as privacy preferences differ across and within societies.

For example, Tene and Polonetsky [ 13 ] explore how pushing past social norms, particularly in novel situations created by new technologies, is perceived by individuals as “creepy” even when they do not violate data protection regulations or privacy laws. Social media apps that utilize users’ locations to push information, corporate tracking of individuals’ social media and private communications to gain customer intelligence, and marketing based on search patterns have been perceived by some to be “creepy” or even outright breaches of privacy. Likewise, distributing health records is a necessary part of receiving health care, but this same sharing brings new ethical concerns when it goes beyond providers to marketers.

Privacy also goes beyond single individuals and extends to groups [ 10 ]. This is particularly resonant for communities who have been on the receiving end of discriminatory data-driven policies historically, such as the practice of redlining [ 14 , 15 ]. Other examples include community maps—made to identify problematic properties or an assertion of land rights—being reused by others to identify opportunities for redevelopment or exploitation [ 16 ]. Thus, reusing a seemingly public dataset could run counter to the original privacy intents of those who created it and raise questions about whether it represents responsible big data research.

Situate and contextualize your data to anticipate privacy breaches and minimize harm. The availability or perceived publicness of data does not guarantee lack of harm, nor does it mean that data creators consent to researchers using their data.

3. Guard against the reidentification of your data

It is problematic to assume that data cannot be reidentified. There are numerous examples of researchers with good intentions and seemingly good methods failing to anonymize data sufficiently to prevent the later identification of specific individuals [ 17 ]; in other cases, these efforts were extremely superficial [ 18 , 19 ]. When datasets thought to be anonymized are combined with other variables, it may result in unexpected reidentification, much like a chemical reaction resulting from the addition of a final ingredient.

While the identificatory power of birthdate, gender, and zip code is well known [ 20 ], there are a number of other parameters—particularly the metadata associated with digital activity—that may be as or even more useful for identifying individuals [ 21 ]. Surprising to many, unlabeled network graphs—such as location and movement, DNA profiles, call records from mobile phone data, and even high-resolution satellite images of the earth—can be used to reidentify people [ 22 ]. More important than specifying the variables that allow for reidentification, however, is the realization that it is difficult to recognize these vulnerable points a priori [ 23 ]. Factors discounted today as irrelevant or inherently harmless—such as battery usage—may very well prove to be a significant vector of personal identification tomorrow [ 24 ]. For example, the addition of spatial location can turn social media posts into a means of identifying home location [ 25 ], and Google’s reverse image search can connect previously separate personal activities—such as dating and professional profiles—in unanticipated ways [ 26 ]. Even data about groups—“aggregate statistics”—can have serious implications if they reveal that certain communities, for example, suffer from stigmatized diseases or social behavior much more than others [ 27 ].

Identify possible vectors of reidentification in your data. Work to minimize them in your published results to the greatest extent possible.

4. Practice ethical data sharing

For some projects, sharing data is an expectation of the human participants involved and thus a key part of ethical research. For example, in rare genetic disease research, biological samples are shared in the hope of finding cures, making dissemination a condition of participation. In other projects, questions of the larger public good—an admittedly difficult to define category—provide compelling arguments for sharing data, e.g., the NIH-sponsored database of Genotypes and Phenotypes (dbGaP), which makes deidentified genomic data widely available to researchers, democratizing access, or the justice claim made by the Institute of Medicine about the value of mandating that individual-level data from clinical trials be shared among researchers [ 28 ]. Asking participants for broad, as opposed to narrowly structured consent for downstream data management makes it easier to share data. Careful research design and guidance from IRBs can help clarify consent processes. However, we caution that even when broad consent was obtained upfront, researchers should consider the best interests of the human participant, proactively considering the likelihood of privacy breaches and reidentification issues. This is of particular concern for human DNA data, which is uniquely identifiable.

These types of projects, however—in which rules of use and sharing are well governed by informed consent and right of withdrawal—are increasingly the exception rather than the rule for big data. In our digital society, we are followed by data clouds composed of the trace elements of daily life—credit card transactions, medical test results, closed-circuit television (CCTV) images and video, smart phone apps, etc.—collected under mandatory terms of service rather than responsible research design overseen by university compliance officers. While we might wish to have the standards of informed consent and right of withdrawal, these informal big data sources are gathered by agents other than the researcher—private software companies, state agencies, and telecommunications firms. These data are only accessible to researchers after their creation, making it impossible to gain informed consent a priori, and contacting the human participants retroactively for permission is often forbidden by the owner of the data or is impossible to do at scale.

Of course, researchers within software companies and state institutions collecting these data have a special responsibility to address the terms under which data are collected; but that does not exempt the end-user of shared data. In short, the burden of ethical use (see Rules 1 to 3) and sharing is placed on the researcher, since the terms of service under which the human subjects’ data were produced can often be extremely broad with little protection for breaches of privacy. In these circumstances, researchers must balance the requirements from funding agencies to share data [ 29 ] with their responsibilities to the human beings behind the data they acquired. A researcher needs to inform funding agencies about possible ethical concerns before the research begins and guard against reidentification before sharing.

Share data as specified in research protocols, but proactively address concerns of potential harm from informally collected big data.

5. Consider the strengths and limitations of your data; big does not automatically mean better

In order to do both accurate and responsible big data research, it is important to ground datasets in their proper context including conflicts of interests. Context also affects every stage of research: from data acquisition, to cleaning, to interpretation of findings, and dissemination of the results. During the step of data acquisition, it is crucial to understand both the source of the data and the rules and regulations with which they were gathered. This is especially important in cases of research conducted in relatively loose regulatory environments, in which use of answers to research questions may conflict with the expectations of those who provided the data. One possible approach might be the ethical norms employed to track the provenance of artifacts, often in cooperation and collaboration with the communities from which they come (e.g., archaeologists working in indigenous communities to determine the disposition of material culture). In a similar manner, computer scientists use data lineage techniques to track the evolution of a dataset and often to trace bugs in the data.

Being mindful of the data’s context provides the foundation for clarifying when your data and analysis are working and when they are not. While it is tempting to interpret findings based on big data as a clear outcome, a key step within scientific research is clearly articulating what data or an indicator represent and what they do not. Are your findings as clear-cut if your interpretation of a social media posting switches from a recording of fact to the performance of a social identity? Given the messy, almost organic nature of many datasets derived from social actions, it is fundamental that researchers be sensitive to the potential multiple meanings of data.

For example, is a Facebook post or an Instagram photo best interpreted as an approval/disapproval of a phenomenon, a simple observation, or an effort to improve status within a friend network? While any of these interpretations are potentially valid, the lack of context makes it even more difficult to justify the choice of one understanding over another. Reflecting on the potential multiple meanings of data fosters greater clarity in research hypotheses and also makes researchers aware of the other potential uses of their data. Again, the act of interpretation is a human process, and because the judgments of those (re)using your data may differ from your own, it is essential to clarify both the strengths and shortcomings of the data.

Document the provenance and evolution of your data. Do not overstate clarity; acknowledge messiness and multiple meanings.

6. Debate the tough, ethical choices

Research involving human participants at federally funded institutions is governed by IRBs charged with preventing harm through well-established procedures and are familiar to many researchers. IRBs, however, are not the sole arbiter of ethics; many ethical issues involving big data are outside of their governance mandate. Precisely because big data researchers often encounter situations that are foreign to or outside of the mandate of IRBs, we emphasize the importance of debating the issues within groups of peers.

Rather than a bug, the lack of clear-cut solutions and governance protocols should be more appropriately understood as a feature that researchers should embrace within their own work. Discussion and debate of ethical issues is an essential part of professional development—both within and between disciplines—as it can establish a mature community of responsible practitioners. Bringing these debates into coursework and training can produce peer reviewers who are particularly well placed to raise these ethical questions and spur recognition of the need for these conversations.

A precondition of any formal ethics rules or regulations is the capacity to have such open-ended debates. As digital social scientist and ethicist Annette Markham [ 30 ] writes, “we can make [data ethics] an easier topic to broach by addressing ethics as being about choices we make at critical junctures; choices that will invariably have impact.” Given the nature of big data, bringing technical, scientific, social, and humanistic researchers together on projects enables this debate to emerge as a strength because, if done well, it provides the means to understand the ethical issues from a range of perspectives and disrupt the silos of disciplines [ 31 ]. There are a number of good models for interdisciplinary ethics research, such as the trainings offered by the Science and Justice research center at the University of California, Santa Cruz [ 32 ] and Values in Design curricula [ 33 ]. Research ethics consultation services, available at some universities as a result of the Clinical and Translational Science Award (CTSA) program of the National Institutes of Health (NIH), can also be resources for researchers [ 34 ].

Some of the better-known “big data” ethical cases—i.e., the Facebook emotional contagion study [ 35 ]—provide extremely productive venues for cross-disciplinary discussions. Why might one set of scholars see this as a relatively benign approach while other groups see significant ethical shortcomings? Where do researchers differ in drawing the line between responsible and irresponsible research and why? Understanding the different ways people discuss these challenges and processes provides an important check for researchers, especially if they come from disciplines not focused on human subject concerns.

Moreover, the high visibility surrounding these events means that (for better or worse) they represent the “public” view of big data research, and becoming an active member of this conversation ensures that researchers can give voice to their insights rather than simply being at the receiving end of policy decisions. In an effort to help these debates along, the Council for Big Data, Ethics, and Society has produced a number of case studies focused specifically on big data research and a white paper with recommendations to start these important conversations ( http://bdes.datasociety.net/output/ ).

Engage your colleagues and students about ethical practice for big data research.

7. Develop a code of conduct for your organization, research community, or industry

The process of debating tough choices inserts ethics directly into the workflow of research, making “faking ethics” as unacceptable as faking data or results. Internalizing these debates, rather than treating them as an afterthought or a problem to outsource, is key for successful research, particularly when using trace data produced by people. This is relevant for all research including those within industry who have privileged access to the data streams of digital daily life. Public attention to the ethical use of these data should not be avoided; after all, these datasets are based on an infrastructure that billions of people are using to live their lives, and there is a compelling public interest that research is done responsibly.

One of the best ways to cement this in daily practice is to develop codes of conduct for use in your organization or research community and for inclusion in formal education and ongoing training. The codes can provide guidance in peer review of publications and in funding consideration. In practice, a highly visible case of unethical research brings problems to an entire field, not just to those directly involved. Moreover, designing codes of conduct makes researchers more successful. Issues that might otherwise be ignored until they blow up—e.g., Are we abiding by the terms of service or users’ expectations? Does the general public consider our research “creepy”? [ 13 ]—can be addressed thoughtfully rather than in a scramble for damage control. This is particularly relevant to public-facing private businesses interested in avoiding potentially unfavorable attention.

An additional and longer-term advantage of developing codes of conduct is that it is clear that change is coming to big data research. The NSF funded the Council for Big Data, Ethics, and Society as a means of getting in front of a developing issue and pending regulatory changes within federal rules for the protection of human subjects that are currently under review [ 1 ]. Actively developing rules for responsible big data research within a research community is a key way researchers can join this ongoing process.

Establish appropriate codes of ethical conduct within your community. Make industry researchers and representatives of affected communities active contributors to this process.

8. Design your data and systems for auditability

Although codes of conduct will vary depending on the topic and research community, a particularly important element is designing data and systems for auditability. Responsible internal auditing processes flow easily into audit systems and also keep track of factors that might contribute to problematic outcomes. Developing automated testing processes for assessing problematic outcomes and mechanisms for auditing other's work during review processes can help strengthen research as a whole. The goal of auditability is to clearly document when decisions are made and, if necessary, backtrack to an earlier dataset and address the issue at the root (e.g., if strategies for anonymizing data are compromised).

Designing for auditability also brings direct benefits to researchers by providing a mechanism for double-checking work and forcing oneself to be explicit about decisions, increasing understandability and replicability. For example, many types of social media and other trace data are unstructured, and answers to even basic questions such as network ties, location, and randomness depend on the steps taken to collect and collate data. Systems of auditability clarify how different datasets (and the subsequent analysis) differ from each other, aiding understanding and creating better research.

Plan for and welcome audits of your big data practices.

9. Engage with the broader consequences of data and analysis practices

It is also important for responsible big data researchers to think beyond the traditional metrics of success in business and the academy. For example, the energy demands for digital daily life, a key source of big data for social science research, are significant in this era of climate change [ 36 ]. How might big data research lessen the environmental impact of data analytics work? For example, should researchers take the lead in asking cloud storage providers and data processing centers to shift to sustainable and renewable energy sources? As important and publicly visible users of the cloud, big data researchers collectively represent an interest group that could rally behind such a call for change.

The pursuit of citations, reputation, or money is a key incentive for pushing research forward, but it can also result in unintended and undesirable outcomes. In contrast, we might ask to what extent is a research project focused on enhancing the public good or the underserved of society? Are questions about equity or promoting other public values being addressed in one’s data streams, or is a big data focus rendering them invisible or irrelevant to your analysis [ 37 ]? How can increasingly vulnerable yet fundamentally important public resources—such as state-mandated cancer registries—be protected? How might research aid or inhibit different business and political actors? While all big data research need not take up social and cultural questions, a fundamental aim of research goes beyond understanding the world to considering ways to improve it.

Recognize that doing big data research has societal-wide effects.

10. Know when to break these rules

The final (and counterintuitive) rule is the charge to recognize when it is appropriate to stray from these rules. For example, in times of natural disaster or a public health emergency, it may be important to temporarily put aside questions of individual privacy in order to serve a larger public good. Likewise, the use of genetic or other biological data collected without informed consent might be vital in managing an emerging disease epidemic.

Moreover, be sure to review the regulatory expectations and legal demands associated with protection of privacy within your dataset. After all, this is an exceedingly slippery slope, so before following this rule (to break others), be cautious that the “emergency” is not simply a convenient justification. The best way to ensure this is to build experience in engaging in the tough debates (Rule 6), constructing codes of conduct (Rule 7), and developing systems for auditing (Rule 8). The more mature the community of researchers is about their processes, checks, and balances, the better equipped it is to assess when breaking the rules is acceptable. It may very well be that you do not come to a final clear set of practices. After all, just as privacy is not binary (Rule 2), neither is responsible research. Ethics is often about finding a good or better, but not perfect, answer, and it is important to ask (and try to answer) the challenging questions. Only through this engagement can a culture of responsible big data research emerge.

Understand that responsible big data research depends on more than meeting checklists.

The goal of this set of ten rules is to help researchers do better work and ultimately become more successful while avoiding larger complications, including public mistrust. To achieve this, however, scholars must shift from a mindset that is rigorous when focused on techniques and methodology and naïve when it comes to ethics. Statements to the effect that “Data is [sic] already public” [ 38 ] are unjustified simplifications of much more complex data ecosystems embedded in even more complex and contingent social practices. Data are people, and to maintain a rigorously naïve definition to the contrary [ 18 ] will end up harming research efforts in the long run as pushback comes from the people whose actions and utterances are subject to analysis.

In short, responsible big data research is not about preventing research but making sure that the work is sound, accurate, and maximizes the good while minimizing harm. The problems and choices researchers face are real, complex, and challenging and so too must be our response. We must treat big data research with the respect that it deserves and recognize that unethical research undermines the production of knowledge. Fantastic opportunities to better understand society and our world exist, but with these opportunities also come the responsibility to consider the ethics of our choices in the everyday practices and actions of our research. The Council for Big Data, Ethics, and Society ( http://bdes.datasociety.net/ ) provides an initial set of case studies, papers, and even ten simple rules for guiding this process; it is now incumbent on you to use and improve these in your research.

Acknowledgments

This article also benefitted from the input of Geoff Bowker and Helen Nissenbaum.

  • 1. Metcalf J, boyd d, Keller E. Perspectives on Big Data, Ethics, and Society. Council for Big Data, Ethics, and Society. 2016. http://bdes.datasociety.net/council-output/perspectives-on-big-data-ethics-and-society/ . Accessed 31 May 2016.
  • View Article
  • Google Scholar
  • 5. Danyllo WA, Alisson VB, Alexandre ND, Moacir LM, Jansepetrus BP, Oliveira RF. Identifying relevant users and groups in the context of credit analysis based on data from Twitter. InCloud and Green Computing (CGC), 2013 Third International Conference on 2013 Sep 30 (pp. 587–592). IEEE.
  • 6. Angwin J, Larson J, Mattu S, Kirchner L. Machine bias. Pro Publica. 23 May 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing . Accessed 4 September 2016.
  • 7. Ingold D, Spencer S. Amazon Doesn’t Consider the Race of Its Customers. Should It? Bloomberg.com 21 April 2016. http://www.bloomberg.com/graphics/2016-amazon-same-day/ . Accessed 12 June 2016.
  • 9. Metcalf J, Crawford K. Where are Human Subjects in Big Data Research? The Emerging Ethics Divide. The Emerging Ethics Divide. Big Data and Society, 2016.
  • 11. Nissenbaum H. Privacy in context: Technology, policy, and the integrity of social life. Stanford University Press; 2009.
  • 12. Marwick AE. boyd d. Networked privacy: How teenagers negotiate context in social media. New Media & Society. 2014:1461444814543995.
  • 14. Massey DS, Denton NA. American apartheid: Segregation and the making of the underclass. Harvard University Press; 1993.
  • 15. Davidow B. Redlining for the 21st Century. The Atlantic. 5 March 2014. http://www.theatlantic.com/business/archive/2014/03/redlining-for-the-21st-century/284235/ . Accessed 31 May 2016.
  • 17. Barbaro M, Zeller T, Hansell S. A face is exposed for AOL searcher no. 4417749. New York Times. 2006 Aug 9;9.
  • 18. Cox J. 70,000 OkCupid Users Just Had Their Data Published. Motherboard. 12 May 2016. http://motherboard.vice.com/read/70000-okcupid-users-just-had-their-data-published . Accessed 12 June 2016.
  • 19. Pandurangan V. On Taxis and Rainbows: Lessons from NYC’s improperly anonymized taxi logs. Medium. 2014. https://medium.com/@vijayp/of-taxis-and-rainbows-f6bc289679a1 . Accessed 10 November 2015.
  • 22. Kloumann IM, Kleinberg JM. Community membership identification from small seed sets. InProceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining 2014 Aug 24 (pp. 1366–1375). ACM.
  • 23. Narayanan A, Huey J, Felten EW. A precautionary approach to big data privacy. In Data protection on the move 2016 (pp. 357–385). Springer Netherlands.
  • 24. Michalevsky Y, Schulman A, Veerapandian GA, Boneh D, Nakibly G. Powerspy: Location tracking using mobile device power analysis. In24th USENIX Security Symposium (USENIX Security 15) 2015 (pp. 785–800).
  • PubMed/NCBI
  • 30. Markham A. OKCupid data release fiasco: It’s time to rethink ethics education. Medium.Points 18 May 2016. https://points.datasociety.net/okcupid-data-release-fiasco-ba0388348cd#.g4ofbpnc6 . Accessed 12 June 2016.
  • 36. Cook G, Dowdall T, Pomerantz D, Wang Y. Clicking clean: how companies are creating the green internet. Greenpeace Inc., Washington, DC. 2014. http://www.greenpeace.org/usa/wp-content/uploads/legacy/Global/usa/planet3/PDFs/clickingclean.pdf
  • 38. Zimmer M. OkCupid Study Reveals the Perils of Big-Data Science. Wired. 14 May 2016. https://www.wired.com/2016/05/okcupid-study-reveals-perils-big-data-science/ . Accessed 12 June 2016.

Have a language expert improve your writing

Run a free plagiarism check in 10 minutes, generate accurate citations for free.

  • Knowledge Base
  • Starting the research process
  • 10 Research Question Examples to Guide Your Research Project

10 Research Question Examples to Guide your Research Project

Published on October 30, 2022 by Shona McCombes . Revised on October 19, 2023.

The research question is one of the most important parts of your research paper , thesis or dissertation . It’s important to spend some time assessing and refining your question before you get started.

The exact form of your question will depend on a few things, such as the length of your project, the type of research you’re conducting, the topic , and the research problem . However, all research questions should be focused, specific, and relevant to a timely social or scholarly issue.

Once you’ve read our guide on how to write a research question , you can use these examples to craft your own.

Research question Explanation
The first question is not enough. The second question is more , using .
Starting with “why” often means that your question is not enough: there are too many possible answers. By targeting just one aspect of the problem, the second question offers a clear path for research.
The first question is too broad and subjective: there’s no clear criteria for what counts as “better.” The second question is much more . It uses clearly defined terms and narrows its focus to a specific population.
It is generally not for academic research to answer broad normative questions. The second question is more specific, aiming to gain an understanding of possible solutions in order to make informed recommendations.
The first question is too simple: it can be answered with a simple yes or no. The second question is , requiring in-depth investigation and the development of an original argument.
The first question is too broad and not very . The second question identifies an underexplored aspect of the topic that requires investigation of various  to answer.
The first question is not enough: it tries to address two different (the quality of sexual health services and LGBT support services). Even though the two issues are related, it’s not clear how the research will bring them together. The second integrates the two problems into one focused, specific question.
The first question is too simple, asking for a straightforward fact that can be easily found online. The second is a more question that requires and detailed discussion to answer.
? dealt with the theme of racism through casting, staging, and allusion to contemporary events? The first question is not  — it would be very difficult to contribute anything new. The second question takes a specific angle to make an original argument, and has more relevance to current social concerns and debates.
The first question asks for a ready-made solution, and is not . The second question is a clearer comparative question, but note that it may not be practically . For a smaller research project or thesis, it could be narrowed down further to focus on the effectiveness of drunk driving laws in just one or two countries.

Note that the design of your research question can depend on what method you are pursuing. Here are a few options for qualitative, quantitative, and statistical research questions.

Type of research Example question
Qualitative research question
Quantitative research question
Statistical research question

Other interesting articles

If you want to know more about the research process , methodology , research bias , or statistics , make sure to check out some of our other articles with explanations and examples.

Methodology

  • Sampling methods
  • Simple random sampling
  • Stratified sampling
  • Cluster sampling
  • Likert scales
  • Reproducibility

 Statistics

  • Null hypothesis
  • Statistical power
  • Probability distribution
  • Effect size
  • Poisson distribution

Research bias

  • Optimism bias
  • Cognitive bias
  • Implicit bias
  • Hawthorne effect
  • Anchoring bias
  • Explicit bias

Cite this Scribbr article

If you want to cite this source, you can copy and paste the citation or click the “Cite this Scribbr article” button to automatically add the citation to our free Citation Generator.

McCombes, S. (2023, October 19). 10 Research Question Examples to Guide your Research Project. Scribbr. Retrieved September 10, 2024, from https://www.scribbr.com/research-process/research-question-examples/

Is this article helpful?

Shona McCombes

Shona McCombes

Other students also liked, writing strong research questions | criteria & examples, how to choose a dissertation topic | 8 steps to follow, evaluating sources | methods & examples, "i thought ai proofreading was useless but..".

I've been using Scribbr for years now and I know it's a service that won't disappoint. It does a good job spotting mistakes”

  • Survey paper
  • Open access
  • Published: 01 October 2015

Big data analytics: a survey

  • Chun-Wei Tsai 1 ,
  • Chin-Feng Lai 2 ,
  • Han-Chieh Chao 1 , 3 , 4 &
  • Athanasios V. Vasilakos 5  

Journal of Big Data volume  2 , Article number:  21 ( 2015 ) Cite this article

148k Accesses

478 Citations

130 Altmetric

Metrics details

The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief introduction to data analytics, followed by the discussions of big data analytics. Some important open issues and further research directions will also be presented for the next step of big data analytics.

Introduction

As the information technology spreads fast, most of the data were born digital as well as exchanged on internet today. According to the estimation of Lyman and Varian [ 1 ], the new data stored in digital media devices have already been more than 92 % in 2002, while the size of these new data was also more than five exabytes. In fact, the problems of analyzing the large scale data were not suddenly occurred but have been there for several years because the creation of data is usually much easier than finding useful things from the data. Even though computer systems today are much faster than those in the 1930s, the large scale data is a strain to analyze by the computers we have today.

In response to the problems of analyzing large-scale data , quite a few efficient methods [ 2 ], such as sampling, data condensation, density-based approaches, grid-based approaches, divide and conquer, incremental learning, and distributed computing, have been presented. Of course, these methods are constantly used to improve the performance of the operators of data analytics process. Footnote 1 The results of these methods illustrate that with the efficient methods at hand, we may be able to analyze the large-scale data in a reasonable time. The dimensional reduction method (e.g., principal components analysis; PCA [ 3 ]) is a typical example that is aimed at reducing the input data volume to accelerate the process of data analytics. Another reduction method that reduces the data computations of data clustering is sampling [ 4 ], which can also be used to speed up the computation time of data analytics.

Although the advances of computer systems and internet technologies have witnessed the development of computing hardware following the Moore’s law for several decades, the problems of handling the large-scale data still exist when we are entering the age of big data . That is why Fisher et al. [ 5 ] pointed out that big data means that the data is unable to be handled and processed by most current information systems or methods because data in the big data era will not only become too big to be loaded into a single machine, it also implies that most traditional data mining methods or data analytics developed for a centralized data analysis process may not be able to be applied directly to big data. In addition to the issues of data size, Laney [ 6 ] presented a well-known definition (also called 3Vs) to explain what is the “big” data: volume, velocity, and variety. The definition of 3Vs implies that the data size is large, the data will be created rapidly, and the data will be existed in multiple types and captured from different sources, respectively. Later studies [ 7 , 8 ] pointed out that the definition of 3Vs is insufficient to explain the big data we face now. Thus, veracity, validity, value, variability, venue, vocabulary, and vagueness were added to make some complement explanation of big data [ 8 ].

Expected trend of the marketing of big data between 2012 and 2018. Note that yellow , red , and blue of different colored box represent the order of appearance of reference in this paper for particular year

The report of IDC [ 9 ] indicates that the marketing of big data is about $16.1 billion in 2014. Another report of IDC [ 10 ] forecasts that it will grow up to $32.4 billion by 2017. The reports of [ 11 ] and [ 12 ] further pointed out that the marketing of big data will be $46.34 billion and $114 billion by 2018, respectively. As shown in Fig. 1 , even though the marketing values of big data in these researches and technology reports [ 9 – 15 ] are different, these forecasts usually indicate that the scope of big data will be grown rapidly in the forthcoming future.

In addition to marketing, from the results of disease control and prevention [ 16 ], business intelligence [ 17 ], and smart city [ 18 ], we can easily understand that big data is of vital importance everywhere. A numerous researches are therefore focusing on developing effective technologies to analyze the big data. To discuss in deep the big data analytics, this paper gives not only a systematic description of traditional large-scale data analytics but also a detailed discussion about the differences between data and big data analytics framework for the data scientists or researchers to focus on the big data analytics.

Moreover, although several data analytics and frameworks have been presented in recent years, with their pros and cons being discussed in different studies, a complete discussion from the perspective of data mining and knowledge discovery in databases still is needed. As a result, this paper is aimed at providing a brief review for the researchers on the data mining and distributed computing domains to have a basic idea to use or develop data analytics for big data.

Roadmap of this paper

Figure 2 shows the roadmap of this paper, and the remainder of the paper is organized as follows. “ Data analytics ” begins with a brief introduction to the data analytics, and then “ Big data analytics ” will turn to the discussion of big data analytics as well as state-of-the-art data analytics algorithms and frameworks. The open issues are discussed in “ The open issues ” while the conclusions and future trends are drawn in “ Conclusions ”.

Data analytics

To make the whole process of knowledge discovery in databases (KDD) more clear, Fayyad and his colleagues summarized the KDD process by a few operations in [ 19 ], which are selection, preprocessing, transformation, data mining, and interpretation/evaluation. As shown in Fig. 3 , with these operators at hand we will be able to build a complete data analytics system to gather data first and then find information from the data and display the knowledge to the user. According to our observation, the number of research articles and technical reports that focus on data mining is typically more than the number focusing on other operators, but it does not mean that the other operators of KDD are unimportant. The other operators also play the vital roles in KDD process because they will strongly impact the final result of KDD. To make the discussions on the main operators of KDD process more concise, the following sections will focus on those depicted in Fig. 3 , which were simplified to three parts (input, data analytics, and output) and seven operators (gathering, selection, preprocessing, transformation, data mining, evaluation, and interpretation).

The process of knowledge discovery in databases

As shown in Fig. 3 , the gathering, selection, preprocessing, and transformation operators are in the input part. The selection operator usually plays the role of knowing which kind of data was required for data analysis and select the relevant information from the gathered data or databases; thus, these gathered data from different data resources will need to be integrated to the target data. The preprocessing operator plays a different role in dealing with the input data which is aimed at detecting, cleaning, and filtering the unnecessary, inconsistent, and incomplete data to make them the useful data. After the selection and preprocessing operators, the characteristics of the secondary data still may be in a number of different data formats; therefore, the KDD process needs to transform them into a data-mining-capable format which is performed by the transformation operator. The methods for reducing the complexity and downsizing the data scale to make the data useful for data analysis part are usually employed in the transformation, such as dimensional reduction, sampling, coding, or transformation.

The data extraction, data cleaning, data integration, data transformation, and data reduction operators can be regarded as the preprocessing processes of data analysis [ 20 ] which attempts to extract useful data from the raw data (also called the primary data) and refine them so that they can be used by the following data analyses. If the data are a duplicate copy, incomplete, inconsistent, noisy, or outliers, then these operators have to clean them up. If the data are too complex or too large to be handled, these operators will also try to reduce them. If the raw data have errors or omissions, the roles of these operators are to identify them and make them consistent. It can be expected that these operators may affect the analytics result of KDD, be it positive or negative. In summary, the systematic solutions are usually to reduce the complexity of data to accelerate the computation time of KDD and to improve the accuracy of the analytics result.

Data analysis

Since the data analysis (as shown in Fig. 3 ) in KDD is responsible for finding the hidden patterns/rules/information from the data, most researchers in this field use the term data mining to describe how they refine the “ground” (i.e, raw data) into “gold nugget” (i.e., information or knowledge). The data mining methods [ 20 ] are not limited to data problem specific methods. In fact, other technologies (e.g., statistical or machine learning technologies) have also been used to analyze the data for many years. In the early stages of data analysis, the statistical methods were used for analyzing the data to help us understand the situation we are facing, such as public opinion poll or TV programme rating. Like the statistical analysis, the problem specific methods for data mining also attempted to understand the meaning from the collected data.

After the data mining problem was presented, some of the domain specific algorithms are also developed. An example is the apriori algorithm [ 21 ] which is one of the useful algorithms designed for the association rules problem. Although most definitions of data mining problems are simple, the computation costs are quite high. To speed up the response time of a data mining operator, machine learning [ 22 ], metaheuristic algorithms [ 23 ], and distributed computing [ 24 ] were used alone or combined with the traditional data mining algorithms to provide more efficient ways for solving the data mining problem. One of the well-known combinations can be found in [ 25 ], Krishna and Murty attempted to combine genetic algorithm and k -means to get better clustering result than k -means alone does.

Data mining algorithm

As Fig. 4 shows, most data mining algorithms contain the initialization, data input and output, data scan, rules construction, and rules update operators [ 26 ]. In Fig. 4 , D represents the raw data, d the data from the scan operator, r the rules, o the predefined measurement, and v the candidate rules. The scan, construct, and update operators will be performed repeatedly until the termination criterion is met. The timing to employ the scan operator depends on the design of the data mining algorithm; thus, it can be considered as an optional operator. Most of the data algorithms can be described by Fig. 4 in which it also shows that the representative algorithms— clustering , classification , association rules , and sequential patterns —will apply these operators to find the hidden information from the raw data. Thus, modifying these operators will be one of the possible ways for enhancing the performance of the data analysis.

Clustering is one of the well-known data mining problems because it can be used to understand the “new” input data. The basic idea of this problem [ 27 ] is to separate a set of unlabeled input data Footnote 2 to k different groups, e.g., such as k -means [ 28 ]. Classification [ 20 ] is the opposite of clustering because it relies on a set of labeled input data to construct a set of classifiers (i.e., groups) which will then be used to classify the unlabeled input data to the groups to which they belong. To solve the classification problem, the decision tree-based algorithm [ 29 ], naïve Bayesian classification [ 30 ], and support vector machine (SVM) [ 31 ] are widely used in recent years.

Unlike clustering and classification that attempt to classify the input data to k groups, association rules and sequential patterns are focused on finding out the “relationships” between the input data. The basic idea of association rules [ 21 ] is find all the co-occurrence relationships between the input data. For the association rules problem, the apriori algorithm [ 21 ] is one of the most popular methods. Nevertheless, because it is computationally very expensive, later studies [ 32 ] have attempted to use different approaches to reducing the cost of the apriori algorithm, such as applying the genetic algorithm to this problem [ 33 ]. In addition to considering the relationships between the input data, if we also consider the sequence or time series of the input data, then it will be referred to as the sequential pattern mining problem [ 34 ]. Several apriori-like algorithms were presented for solving it, such as generalized sequential pattern [ 34 ] and sequential pattern discovery using equivalence classes [ 35 ].

Output the result

Evaluation and interpretation are two vital operators of the output. Evaluation typically plays the role of measuring the results. It can also be one of the operators for the data mining algorithm, such as the sum of squared errors which was used by the selection operator of the genetic algorithm for the clustering problem [ 25 ].

To solve the data mining problems that attempt to classify the input data, two of the major goals are: (1) cohesion—the distance between each data and the centroid (mean) of its cluster should be as small as possible, and (2) coupling—the distance between data which belong to different clusters should be as large as possible. In most studies of data clustering or classification problems, the sum of squared errors (SSE), which was used to measure the cohesion of the data mining results, can be defined as

where k is the number of clusters which is typically given by the user; \(n_i\) the number of data in the i th cluster; \(x_{ij}\) the j th datum in the i th cluster; \(c_i\) is the mean of the i th cluster; and \(n= \sum ^k_{i=1} n_i\) is the number of data. The most commonly used distance measure for the data mining problem is the Euclidean distance, which is defined as

where \(p_i\) and \(p_j\) are the positions of two different data. For solving different data mining problems, the distance measurement \(D(p_i, p_j)\) can be the Manhattan distance, the Minkowski distance, or even the cosine similarity [ 36 ] between two different documents.

Accuracy (ACC) is another well-known measurement [ 37 ] which is defined as

To evaluate the classification results, precision ( p ), recall ( r ), and F -measure can be used to measure how many data that do not belong to group A are incorrectly classified into group A ; and how many data that belong to group A are not classified into group A . A simple confusion matrix of a classifier [ 37 ] as given in Table 1 can be used to cover all the situations of the classification results.

In Table 1 , TP and TN indicate the numbers of positive examples and negative examples that are correctly classified, respectively; FN and FP indicate the numbers of positive examples and negative examples that are incorrectly classified, respectively. With the confusion matrix at hand, it is much easier to describe the meaning of precision ( p ), which is defined as

and the meaning of recall ( r ), which is defined as

The F -measure can then be computed as

In addition to the above-mentioned measurements for evaluating the data mining results, the computation cost and response time are another two well-known measurements. When two different mining algorithms can find the same or similar results, of course, how fast they can get the final mining results will become the most important research topic.

After something (e.g., classification rules) is found by data mining methods, the two essential research topics are: (1) the work to navigate and explore the meaning of the results from the data analysis to further support the user to do the applicable decision can be regarded as the interpretation operator [ 38 ], which in most cases, gives useful interface to display the information [ 39 ] and (2) a meaningful summarization of the mining results [ 40 ] can be made to make it easier for the user to understand the information from the data analysis. The data summarization is generally expected to be one of the simple ways to provide a concise piece of information to the user because human has trouble of understanding vast amounts of complicated information. A simple data summarization can be found in the clustering search engine, when a query “oasis” is sent to Carrot2 ( http://search.carrot2.org/stable/search ), it will return some keywords to represent each group of the clustering results for web links to help us recognize which category needed by the user, as shown in the left side of Fig. 5 .

Screenshot of the results of clustering search engine

A useful graphical user interface is another way to provide the meaningful information to an user. As explained by Shneiderman in [ 39 ], we need “overview first, zoom and filter, then retrieve the details on demand”. The useful graphical user interface [ 38 , 41 ] also makes it easier for the user to comprehend the meaning of the results when the number of dimensions is higher than three. How to display the results of data mining will affect the user’s perspective to make the decision. For instance, data mining can help us find “type A influenza” at a particular region, but without the time series and flu virus infected information of patients, the government could not recognize what situation (pandemic or controlled) we are facing now so as to make appropriate responses to that. For this reason, a better solution to merge the information from different sources and mining algorithm results will be useful to let the user make the right decision.

Since the problems of handling and analyzing large-scale and complex input data always exist in data analytics, several efficient analysis methods were presented to accelerate the computation time or to reduce the memory cost for the KDD process, as shown in Table 2 . The study of [ 42 ] shows that the basic mathematical concepts (i.e., triangle inequality) can be used to reduce the computation cost of a clustering algorithm. Another study [ 43 ] shows that the new technologies (i.e., distributed computing by GPU) can also be used to reduce the computation time of data analysis method. In addition to the well-known improved methods for these analysis methods (e.g., triangle inequality or distributed computing), a large proportion of studies designed their efficient methods based on the characteristics of mining algorithms or problem itself, which can be found in [ 32 , 44 , 45 ], and so forth. This kind of improved methods typically was designed for solving the drawback of the mining algorithms or using different ways to solve the mining problem. These situations can be found in most association rules and sequential patterns problems because the original assumption of these problems is for the analysis of large-scale dataset. Since the earlier frequent pattern algorithm (e.g., apriori algorithm) needs to scan the whole dataset many times which is computationally very expensive. How to reduce the number of times the whole dataset is scanned so as to save the computation cost is one of the most important things in all the frequent pattern studies. The similar situation also exists in data clustering and classification studies because the design concept of earlier algorithms, such as mining the patterns on-the-fly [ 46 ], mining partial patterns at different stages [ 47 ], and reducing the number of times the whole dataset is scanned [ 32 ], are therefore presented to enhance the performance of these mining algorithms. Since some of the data mining problems are NP-hard [ 48 ] or the solution space is very large, several recent studies [ 23 , 49 ] have attempted to use metaheuristic algorithm as the mining algorithm to get the approximate solution within a reasonable time.

Abundant research results of data analysis [ 20 , 27 , 63 ] show possible solutions for dealing with the dilemmas of data mining algorithms. It means that the open issues of data analysis from the literature [ 2 , 64 ] usually can help us easily find the possible solutions. For instance, the clustering result is extremely sensitive to the initial means, which can be mitigated by using multiple sets of initial means [ 65 ]. According to our observation, most data analysis methods have limitations for big data, that can be described as follows:

Unscalability and centralization Most data analysis methods are not for large-scale and complex dataset. The traditional data analysis methods cannot be scaled up because their design does not take into account large or complex datasets. The design of traditional data analysis methods typically assumed they will be performed in a single machine, with all the data in memory for the data analysis process. For this reason, the performance of traditional data analytics will be limited in solving the volume problem of big data.

Non-dynamic Most traditional data analysis methods cannot be dynamically adjusted for different situations, meaning that they do not analyze the input data on-the-fly. For example, the classifiers are usually fixed which cannot be automatically changed. The incremental learning [ 66 ] is a promising research trend because it can dynamically adjust the the classifiers on the training process with limited resources. As a result, the performance of traditional data analytics may not be useful to the problem of velocity problem of big data.

Uniform data structure Most of the data mining problems assume that the format of the input data will be the same. Therefore, the traditional data mining algorithms may not be able to deal with the problem that the formats of different input data may be different and some of the data may be incomplete. How to make the input data from different sources the same format will be a possible solution to the variety problem of big data.

Because the traditional data analysis methods are not designed for large-scale and complex data, they are almost impossible to be capable of analyzing the big data. Redesigning and changing the way the data analysis methods are designed are two critical trends for big data analysis. Several important concepts in the design of the big data analysis method will be given in the following sections.

Big data analytics

Nowadays, the data that need to be analyzed are not just large, but they are composed of various data types, and even including streaming data [ 67 ]. Since big data has the unique features of “massive, high dimensional, heterogeneous, complex, unstructured, incomplete, noisy, and erroneous,” which may change the statistical and data analysis approaches [ 68 ]. Although it seems that big data makes it possible for us to collect more data to find more useful information, the truth is that more data do not necessarily mean more useful information. It may contain more ambiguous or abnormal data. For instance, a user may have multiple accounts, or an account may be used by multiple users, which may degrade the accuracy of the mining results [ 69 ]. Therefore, several new issues for data analytics come up, such as privacy, security, storage, fault tolerance, and quality of data [ 70 ].

The comparison between traditional data analysis and big data analysis on wireless sensor network

The big data may be created by handheld device, social network, internet of things, multimedia, and many other new applications that all have the characteristics of volume, velocity, and variety. As a result, the whole data analytics has to be re-examined from the following perspectives:

From the volume perspective, the deluge of input data is the very first thing that we need to face because it may paralyze the data analytics. Different from traditional data analytics, for the wireless sensor network data analysis, Baraniuk [ 71 ] pointed out that the bottleneck of big data analytics will be shifted from sensor to processing, communications, storage of sensing data, as shown in Fig. 6 . This is because sensors can gather much more data, but when uploading such large data to upper layer system, it may create bottlenecks everywhere.

In addition, from the velocity perspective, real-time or streaming data bring up the problem of large quantity of data coming into the data analytics within a short duration but the device and system may not be able to handle these input data. This situation is similar to that of the network flow analysis for which we typically cannot mirror and analyze everything we can gather.

From the variety perspective, because the incoming data may use different types or have incomplete data, how to handle them also bring up another issue for the input operators of data analytics.

In this section, we will turn the discussion to the big data analytics process.

Big data input

The problem of handling a vast quantity of data that the system is unable to process is not a brand-new research issue; in fact, it appeared in several early approaches [ 2 , 21 , 72 ], e.g., marketing analysis, network flow monitor, gene expression analysis, weather forecast, and even astronomy analysis. This problem still exists in big data analytics today; thus, preprocessing is an important task to make the computer, platform, and analysis algorithm be able to handle the input data. The traditional data preprocessing methods [ 73 ] (e.g., compression, sampling, feature selection, and so on) are expected to be able to operate effectively in the big data age. However, a portion of the studies still focus on how to reduce the complexity of the input data because even the most advanced computer technology cannot efficiently process the whole input data by using a single machine in most cases. By using domain knowledge to design the preprocessing operator is a possible solution for the big data. In [ 74 ], Ham and Lee used the domain knowledge, B -tree, divide-and-conquer to filter the unrelated log information for the mobile web log analysis. A later study [ 75 ] considered that the computation cost of preprocessing will be quite high for massive logs, sensor, or marketing data analysis. Thus, Dawelbeit and McCrindle employed the bin packing partitioning method to divide the input data between the computing processors to handle this high computations of preprocessing on cloud system. The cloud system is employed to preprocess the raw data and then output the refined data (e.g., data with uniform format) to make it easier for the data analysis method or system to preform the further analysis work.

Sampling and compression are two representative data reduction methods for big data analytics because reducing the size of data makes the data analytics computationally less expensive, thus faster, especially for the data coming to the system rapidly. In addition to making the sampling data represent the original data effectively [ 76 ], how many instances need to be selected for data mining method is another research issue [ 77 ] because it will affect the performance of the sampling method in most cases.

To avoid the application-level slow-down caused by the compression process, in [ 78 ], Jun et al. attempted to use the FPGA to accelerate the compression process. The I/O performance optimization is another issue for the compression method. For this reason, Zou et al. [ 79 ] employed the tentative selection and predictive dynamic selection and switched the appropriate compression method from two different strategies to improve the performance of the compression process. To make it possible for the compression method to efficiently compress the data, a promising solution is to apply the clustering method to the input data to divide them into several different groups and then compress these input data according to the clustering information. The compression method described in [ 80 ] is one of this kind of solutions, it first clusters the input data and then compresses these input data via the clustering results while the study [ 81 ] also used clustering method to improve the performance of the compression process.

In summary, in addition to handling the large and fast data input, the research issues of heterogeneous data sources, incomplete data, and noisy data may also affect the performance of the data analysis. The input operators will have a stronger impact on the data analytics at the big data age than it has in the past. As a result, the design of big data analytics needs to consider how to make these tasks (e.g., data clean, data sampling, data compression) work well.

Big data analysis frameworks and platforms

Various solutions have been presented for the big data analytics which can be divided [ 82 ] into (1) Processing/Compute: Hadoop [ 83 ], Nvidia CUDA [ 84 ], or Twitter Storm [ 85 ], (2) Storage: Titan or HDFS, and (3) Analytics: MLPACK [ 86 ] or Mahout [ 87 ]. Although there exist commercial products for data analysis [ 83 – 86 ], most of the studies on the traditional data analysis are focused on the design and development of efficient and/or effective “ways” to find the useful things from the data. But when we enter the age of big data, most of the current computer systems will not be able to handle the whole dataset all at once; thus, how to design a good data analytics framework or platform Footnote 3 and how to design analysis methods are both important things for the data analysis process. In this section, we will start with a brief introduction to data analysis frameworks and platforms, followed by a comparison of them.

The basic idea of big data analytics on cloud system

Researches in frameworks and platforms

To date, we can easily find tools and platforms presented by well-known organizations. The cloud computing technologies are widely used on these platforms and frameworks to satisfy the large demands of computing power and storage. As shown in Fig. 7 , most of the works on KDD for big data can be moved to cloud system to speed up the response time or to increase the memory space. With the advance of these works, handling and analyzing big data within a reasonable time has become not so far away. Since the foundation functions to handle and manage the big data were developed gradually; thus, the data scientists nowadays do not have to take care of everything, from the raw data gathering to data analysis, by themselves if they use the existing platforms or technologies to handle and manage the data. The data scientists nowadays can pay more attention to finding out the useful information from the data even thought this task is typically like looking for a needle in a haystack. That is why several recent studies tried to present efficient and effective framework to analyze the big data, especially on find out the useful things.

Performance-oriented From the perspective of platform performance, Huai [ 88 ] pointed out that most of the traditional parallel processing models improve the performance of the system by using a new larger computer system to replace the old computer system, which is usually referred to as “scale up”, as shown in Fig. 8 a. But for the big data analytics, most researches improve the performance of the system by adding more similar computer systems to make it possible for a system to handle all the tasks that cannot be loaded or computed in a single computer system (called “scale out”), as shown in Fig. 8 b where M1, M2, and M3 represent computer systems that have different computing power, respectively. For the scale up based solution, the computing power of the three systems is in the order of \(\text {M3}>\text {M2}>\text {M1}\) ; but for the scale out based system, all we have to do is to keep adding more similar computer systems to to a system to increase its ability. To build a scalable and fault-tolerant manager for big data analysis, Huai et al. [ 88 ] presented a matrix model which consists of three matrices for data set (D), concurrent data processing operations (O), and data transformations (T), called DOT. The big data is divided into n subsets each of which is processed by a computer node (worker) in such a way that all the subsets are processed concurrently, and then the results from these n computer nodes are collected and transformed to a computer node. By using this framework, the whole data analysis framework is composed of several DOT blocks. The system performance can be easily enhanced by adding more DOT blocks to the system.

The comparisons between scale up and scale out

Another efficient big data analytics was presented in [ 89 ], called generalized linear aggregates distributed engine (GLADE). The GLADE is a multi-level tree-based data analytics system which consists of two types of computer nodes that are a coordinator and workers. The simulation results [ 90 ] show that the GLADE can provide a better performance than Hadoop in terms of the execution time. Because Hadoop requires large memory and storage for data replication and it is a single master, Footnote 4 Essa et al. [ 91 ] presented a mobile agent based framework to solve these two problems, called the map reduce agent mobility (MRAM). The main reason is that each mobile agent can send its code and data to any other machine; therefore, the whole system will not be down if the master failed. Compared to Hadoop, the architecture of MRAM was changed from client/server to a distributed agent. The load time for MRAM is less than Hadoop even though both of them use the map-reduce solution and Java language. In [ 92 ], Herodotou et al. considered issues of the user needs and system workloads. They presented a self-tuning analytics system built on Hadoop for big data analysis. Since one of the major goals of their system is to adjust the system based on the user needs and system workloads to provide good performance automatically, the user usually does not need to understand and manipulate the Hadoop system. The study [ 93 ] was from the perspectives of data centric architecture and operational models to presented a big data architecture framework (BDAF) which includes: big data infrastructure, big data analytics, data structures and models, big data lifecycle management, and big data security. According to the observations of Demchenko et al. [ 93 ], cluster services, Hadoop related services, data analytics tools, databases, servers, and massively parallel processing databases are typically the required applications and services in big data analytics infrastructure.

Result-oriented Fisher et al. [ 5 ] presented a big data pipeline to show the workflow of big data analytics to extract the valuable knowledge from big data, which consists of the acquired data, choosing architecture, shaping data into architecture, coding/debugging, and reflecting works. From the perspectives of statistical computation and data mining, Ye et al. [ 94 ] presented an architecture of the services platform which integrates R to provide better data analysis services, called cloud-based big data mining and analyzing services platform (CBDMASP). The design of this platform is composed of four layers: the infrastructure services layer, the virtualization layer, the dataset processing layer, and the services layer. Several large-scale clustering problems (the datasets are of size from 0.1 G up to 25.6 G) were also used to evaluate the performance of the CBDMASP. The simulation results show that using map-reduce is much faster than using a single machine when the input data become too large. Although the size of the test dataset cannot be regarded as a big dataset, the performance of the big data analytics using map-reduce can be sped up via this kind of testings. In this study, map-reduce is a better solution when the dataset is of size more than 0.2 G, and a single machine is unable to handle a dataset that is of size more than 1.6 G.

Another study [ 95 ] presented a theorem to explain the big data characteristics, called HACE: the characteristics of big data usually are large-volume, Heterogeneous, Autonomous sources with distributed and decentralized control, and we usually try to find out some useful and interesting things from complex and evolving relationships of data. Based on these concerns and data mining issues, Wu and his colleagues [ 95 ] also presented a big data processing framework which includes data accessing and computing tier, data privacy and domain knowledge tier, and big data mining algorithm tier. This work explains that the data mining algorithm will become much more important and much more difficult; thus, challenges will also occur on the design and implementation of big data analytics platform. In addition to the platform performance and data mining issues, the privacy issue for big data analytics was a promising research in recent years. In [ 96 ], Laurila et al. explained that the privacy is an essential problem when we try to find something from the data that are gathered from mobile devices; thus, data security and data anonymization should also be considered in analyzing this kind of data. Demirkan and Delen [ 97 ] presented a service-oriented decision support system (SODSS) for big data analytics which includes information source, data management, information management, and operations management.

Comparison between the frameworks/platforms of big data

In [ 98 ], Talia pointed out that cloud-based data analytics services can be divided into data analytics software as a service, data analytics platform as a service, and data analytics infrastructure as a service. A later study [ 99 ] presented a general architecture of big data analytics which contains multi-source big data collecting, distributed big data storing, and intra/inter big data processing. Since many kinds of data analytics frameworks and platforms have been presented, some of the studies attempted to compare them to give a guidance to choose the applicable frameworks or platforms for relevant works. To give a brief introduction to big data analytics, especially the platforms and frameworks, in [ 100 ], Cuzzocrea et al. first discuss how recent studies responded the “computational emergency” issue of big data analytics. Some open issues, such as data source heterogeneity and uncorrelated data filtering, and possible research directions are also given in the same study. In [ 101 ], Zhang and Huang used the 5Ws model to explain what kind of framework and method we need for different big data approaches. Zhang and Huang further explained that the 5Ws model represents what kind of data, why we have these data, where the data come from, when the data occur, who receive the data, and how the data are transferred. A later study [ 102 ] used the features (i.e., owner, workload, source code, low latency, and complexity) to compare the frameworks of Hadoop [ 83 ], Storm [ 85 ] and Drill [ 103 ]. Thus, it can be easily seen that the framework of Apache Hadoop has high latency compared with the other two frameworks. To better understand the strong and weak points of solutions of big data, Chalmers et al. [ 82 ] then employed the volume, variety, variability, velocity, user skill/experience, and infrastructure to evaluate eight solutions of big data analytics.

In [ 104 ], in addition to defining that a big data system should include data generation, data acquisition, data storage, and data analytics modules, Hu et al. also mentioned that a big data system can be decomposed into infrastructure, computing, and application layers. Moreover, a promising research for NoSQL storage systems was also discussed in this study which can be divided into key-value , column , document , and row databases. Since big data analysis is generally regarded as a high computation cost work, the high performance computing cluster system (HPCC) is also a possible solution in early stage of big data analytics. Sagiroglu and Sinanc [ 105 ] therefore compare the characteristics between HPCC and Hadoop. They then emphasized that HPCC system uses the multikey and multivariate indexes on distributed file system while Hadoop uses the column-oriented database. In [ 17 ], Chen et al. give a brief introduction to the big data analytics of business intelligence (BI) from the perspective of evolution, applications, and emerging research topics. In their survey, Chen et al. explained that the revolution of business intelligence and analytics (BI&I) was from BI&I 1.0, BI&I 2.0, to BI&I 3.0 which are DBMS-based and structured content, web-based and unstructured content, and mobile and sensor based content, respectively.

Big data analysis algorithms

Mining algorithms for specific problem.

Because the big data issues have appeared for nearly ten years, in [ 106 ], Fan and Bifet pointed out that the terms “big data” [ 107 ] and “big data mining” [ 108 ] were first presented in 1998, respectively. The big data and big data mining almost appearing at the same time explained that finding something from big data will be one of the major tasks in this research domain. Data mining algorithms for data analysis also play the vital role in the big data analysis, in terms of the computation cost, memory requirement, and accuracy of the end results. In this section, we will give a brief discussion from the perspective of analysis and search algorithms to explain its importance for big data analytics.

Clustering algorithms In the big data age, traditional clustering algorithms will become even more limited than before because they typically require that all the data be in the same format and be loaded into the same machine so as to find some useful things from the whole data. Although the problem [ 64 ] of analyzing large-scale and high-dimensional dataset has attracted many researchers from various disciplines in the last century, and several solutions [ 2 , 109 ] have been presented presented in recent years, the characteristics of big data still brought up several new challenges for the data clustering issues. Among them, how to reduce the data complexity is one of the important issues for big data clustering. In [ 110 ], Shirkhorshidi et al. divided the big data clustering into two categories: single-machine clustering (i.e., sampling and dimension reduction solutions), and multiple-machine clustering (parallel and MapReduce solutions). This means that traditional reduction solutions can also be used in the big data age because the complexity and memory space needed for the process of data analysis will be decreased by using sampling and dimension reduction methods. More precisely, sampling can be regarded as reducing the “amount of data” entered into a data analyzing process while dimension reduction can be regarded as “downsizing the whole dataset” because irrelevant dimensions will be discarded before the data analyzing process is carried out.

CloudVista [ 111 ] is a representative solution for clustering big data which used cloud computing to perform the clustering process in parallel. BIRCH [ 44 ] and sampling method were used in CloudVista to show that it is able to handle large-scale data, e.g., 25 million census records. Using GPU to enhance the performance of a clustering algorithm is another promising solution for big data mining. The multiple species flocking (MSF) [ 112 ] was applied to the CUDA platform from NVIDIA to reduce the computation time of clustering algorithm in [ 113 ]. The simulation results show that the speedup factor can be increased from 30 up to 60 by using GPU for data clustering. Since most traditional clustering algorithms (e.g, k -means) require a computation that is centralized, how to make them capable of handling big data clustering problems is the major concern of Feldman et al. [ 114 ] who use a tree construction for generating the coresets in parallel which is called the “merge-and-reduce” approach. Moreover, Feldman et al. pointed out that by using this solution for clustering, the update time per datum and memory of the traditional clustering algorithms can be significantly reduced.

Classification algorithms Similar to the clustering algorithm for big data mining, several studies also attempted to modify the traditional classification algorithms to make them work on a parallel computing environment or to develop new classification algorithms which work naturally on a parallel computing environment. In [ 115 ], the design of classification algorithm took into account the input data that are gathered by distributed data sources and they will be processed by a heterogeneous set of learners. Footnote 5 In this study, Tekin et al. presented a novel classification algorithm called “classify or send for classification” (CoS). They assumed that each learner can be used to process the input data in two different ways in a distributed data classification system. One is to perform a classification function by itself while the other is to forward the input data to another learner to have them labeled. The information will be exchanged between different learners. In brief, this kind of solutions can be regarded as a cooperative learning to improve the accuracy in solving the big data classification problem. An interesting solution uses the quantum computing to reduce the memory space and computing cost of a classification algorithm. For example, in [ 116 ], Rebentrost et al. presented a quantum-based support vector machine for big data classification and argued that the classification algorithm they proposed can be implemented with a time complexity \(O(\log NM)\) where N is the number of dimensions and M is the number of training data. There are bright prospects for big data mining by using quantum-based search algorithm when the hardware of quantum computing has become mature.

Frequent pattern mining algorithms Most of the researches on frequent pattern mining (i.e., association rules and sequential pattern mining) were focused on handling large-scale dataset at the very beginning because some early approaches of them were attempted to analyze the data from the transaction data of large shopping mall. Because the number of transactions usually is more than “tens of thousands”, the issues about how to handle the large scale data were studied for several years, such as FP-tree [ 32 ] using the tree structure to include the frequent patterns to further reduce the computation time of association rule mining. In addition to the traditional frequent pattern mining algorithms, of course, parallel computing and cloud computing technologies have also attracted researchers in this research domain. Among them, the map-reduce solution was used for the studies [ 117 – 119 ] to enhance the performance of the frequent pattern mining algorithm. By using the map-reduce model for frequent pattern mining algorithm, it can be easily expected that its application to “cloud platform” [ 120 , 121 ] will definitely become a popular trend in the forthcoming future. The study of [ 119 ] no only used the map-reduce model, it also allowed users to express their specific interest constraints in the process of frequent pattern mining. The performance of these methods by using map-reduce model for big data analysis is, no doubt, better than the traditional frequent pattern mining algorithms running on a single machine.

Machine learning for big data mining

The potential of machine learning for data analytics can be easily found in the early literature [ 22 , 49 ]. Different from the data mining algorithm design for specific problems, machine learning algorithms can be used for different mining and analysis problems because they are typically employed as the “search” algorithm of the required solution. Since most machine learning algorithms can be used to find an approximate solution for the optimization problem, they can be employed for most data analysis problems if the data analysis problems can be formulated as an optimization problem. For example, genetic algorithm, one of the machine learning algorithms, can not only be used to solve the clustering problem [ 25 ], it can also be used to solve the frequent pattern mining problem [ 33 ]. The potential of machine learning is not merely for solving different mining problems in data analysis operator of KDD; it also has the potential of enhancing the performance of the other parts of KDD, such as feature reduction for the input operators [ 72 ].

A recent study [ 68 ] shows that some traditional mining algorithms, statistical methods, preprocessing solutions, and even the GUI’s have been applied to several representative tools and platforms for big data analytics. The results show clearly that machine learning algorithms will be one of the essential parts of big data analytics. One of the problems in using current machine learning methods for big data analytics is similar to those of most traditional data mining algorithms which are designed for sequential or centralized computing. However, one of the most possible solutions is to make them work for parallel computing. Fortunately, some of the machine learning algorithms (e.g., population-based algorithms) can essentially be used for parallel computing, which have been demonstrated for several years, such as parallel computing version of genetic algorithm [ 122 ]. Different from the traditional GA, as shown in Fig. 9 a, the population of island model genetic algorithm, one of the parallel GA’s, can be divided into several sub-populations, as shown in Fig. 9 b. This means that the sub-populations can be assigned to different threads or computer nodes for parallel computing, by a simple modification of the GA.

The comparison between basic idea of traditional GA (TGA) and parallel genetic algorithm (PGA)

For this reason, in [ 123 ], Kiran and Babu explained that the framework for distributed data mining algorithm still needs to aggregate the information from different computer nodes. As shown in Fig. 10 , the common design of distributed data mining algorithm is as follows: each mining algorithm will be performed on a computer node (worker) which has its locally coherent data, but not the whole data. To construct a globally meaningful knowledge after each mining algorithm finds its local model, the local model from each computer node has to be aggregated and integrated into a final model to represent the complete knowledge. Kiran and Babu [ 123 ] also pointed out that the communication will be the bottleneck when using this kind of distributed computing framework.

A simple example of distributed data mining framework [ 86 ]

Bu et al. [ 124 ] found some research issues when trying to apply machine learning algorithms to parallel computing platforms. For instance, the early version of map-reduce framework does not support “iteration” (i.e., recursion). But the good news is that some recent works [ 87 , 125 ] have paid close attention to this problem and tried to fix it. Similar to the solutions for enhancing the performance of the traditional data mining algorithms, one of the possible solutions to enhancing the performance of a machine learning algorithm is to use CUDA, i.e., a GPU, to reduce the computing time of data analysis. Hasan et al. [ 126 ] used CUDA to implement the self-organizing map (SOM) and multiple back-propagation (MBP) for the classification problem. The simulation results show that using GPU is faster than using CPU. More precisely, SOM running on a GPU is three times faster than SOM running on a CPU, and MPB running on a GPU is twenty-seven times faster than MPB running on a. Another study [ 127 ] attempted to apply the ant-based algorithm to grid computing platform. Since the proposed mining algorithm is extended by the ant clustering algorithm of Deneubourg et al. [ 128 ], Footnote 6 Ku-Mahamud modified the ant behavior of this ant clustering algorithm for big data clustering. That is, each ant will be randomly placed on the grid. This means that the ant clustering algorithm then can be used on a parallel computing environment.

The trends of machine learning studies for big data analytics can be divided into twofold: one attempts to make machine learning algorithms run on parallel platforms, such as Radoop [ 129 ], Mahout [ 87 ], and PIMRU [ 124 ]; the other is to redesign the machine learning algorithms to make them suitable for parallel computing or to parallel computing environment, such as neural network algorithms for GPU [ 126 ] and ant-based algorithm for grid [ 127 ]. In summary, both of them make it possible to apply the machine learning algorithms to big data analytics although still many research issues need to be solved, such as the communication cost for different computer nodes [ 86 ] and the large computation cost most machine learning algorithms require [ 126 ].

Output the result of big data analysis

The benchmarks of PigMix [ 130 ], GridMix [ 131 ], TeraSort and GraySort [ 132 ], TPC-C, TPC-H, TPC-DS [ 133 ], and yahoo cloud serving benchmark (YCSB) [ 134 ] have been presented for evaluating the performance of the cloud computing and big data analytics systems. Ghazal et al. [ 135 ] presented another benchmark (called BigBench) to be used as an end-to-end big data benchmark which covers the characteristics of 3V of big data and uses the loading time, time for queries, time for procedural processing queries, and time for the remaining queries as the metrics. By using these benchmarks, the computation time is one of the intuitive metrics for evaluating the performance of different big data analytics platforms or algorithms. That is why Cheptsov [ 136 ] compered the high performance computing (HPC) and cloud system by using the measurement of computation time to understand their scalability for text file analysis. In addition to the computation time, the throughput (e.g., the number of operations per second) and read/write latency of operations are the other measurements of big data analytics [ 137 ]. In the study of [ 138 ], Zhao et al. believe that the maximum size of data and the maximum number of jobs are the two important metrics to understand the performance of the big data analytics platform. Another study described in [ 139 ] presented a systematic evaluation method which contains the data throughput, concurrency during map and reduce phases, response times, and the execution time of map and reduce. Moreover, most benchmarks for evaluating the performance of big data analytics typically can only provide the response time or the computation cost; however, the fact is that several factors need to be taken into account at the same time when building a big data analytics system. The hardware, bandwidth for data transmission, fault tolerance, cost, power consumption of these systems are all issues [ 70 , 104 ] to be taken into account at the same time when building a big data analytics system. Several solutions available today are to install the big data analytics on a cloud computing system or a cluster system. Therefore, the measurements of fault tolerance, task execution, and cost of cloud computing systems can then be used to evaluate the performance of the corresponding factors of big data analytics.

How to present the analysis results to a user is another important work in the output part of big data analytics because if the user cannot easily understand the meaning of the results, the results will be entirely useless. Business intelligent and network monitoring are the two common approaches because their user interface plays the vital role of making them workable. Zhang et al. [ 140 ] pointed out that the tasks of the visual analytics for commercial systems can be divided into four categories which are exploration, dashboards, reporting, and alerting. The study [ 141 ] showed that the interface for electroencephalography (EEG) interpretation is another noticeable research issue in big data analytics. The user interface for cloud system [ 142 , 143 ] is the recent trend for big data analytics. This usually plays vital roles in big data analytics system, one of which is to simplify the explanation of the needed knowledge to the users while the other is to make it easier for the users to handle the data analytics system to work with their opinions. According to our observations, a flexible user interface is needed because although the big data analytics can help us to find some hidden information, the information found usually is not knowledge. This situation is just like the example we mentioned in “ Output the result ”. The mining or statistical techniques can be employed to know the flu situation of each region, but data scientists sometimes need additional ways to display the information to find out the knowledge they need or to prove their assumption. Thus, the user interface can be adjusted by the user to display the knowledge that is needed urgently for big data analytics.

Summary of process of big data analytics

This discussion of big data analytics in this section was divided into input, analysis, and output for mapping the data analysis process of KDD. For the input (see also in “ Big data input ”) and output (see also “ Output the result of big data analysis ”) of big data, several methods and solutions proposed before the big data age (see also “ Data input ”) can also be employed for big data analytics in most cases.

However, there still exist some new issues of the input and output that the data scientists need to confront. A representative example we mentioned in “ Big data input ” is that the bottleneck will not only on the sensor or input devices, it may also appear in other places of data analytics [ 71 ]. Although we can employ traditional compression and sampling technologies to deal with this problem, they can only mitigate the problems instead of solving the problems completely. Similar situations also exist in the output part. Although several measurements can be used to evaluate the performance of the frameworks, platforms, and even data mining algorithms, there still exist several new issues in the big data age, such as information fusion from different information sources or information accumulation from different times.

Several studies attempted to present an efficient or effective solution from the perspective of system (e.g., framework and platform) or algorithm level. A simple comparison of these big data analysis technologies from different perspectives is described in Table 3 , to give a brief introduction to the current studies and trends of data analysis technologies for the big data. The “Perspective” column of this table explains that the study is focused on the framework or algorithm level; the “Description” column gives the further goal of the study; and the “Name” column is an abbreviated names of the methods or platform/framework. From the analysis framework perspective, this table shows that big data framework , platform , and machine learning are the current research trends in big data analytics system. For the mining algorithm perspective, the clustering , classification , and frequent pattern mining issues play the vital role of these researches because several data analysis problems can be mapped to these essential issues.

A promising trend that can be easily found from these successful examples is to use machine learning as the search algorithm (i.e., mining algorithm) for the data mining problems of big data analytics system. The machine learning-based methods are able to make the mining algorithms and relevant platforms smarter or reduce the redundant computation costs. That parallel computing and cloud computing technologies have a strong impact on the big data analytics can also be recognized as follows: (1) most of the big data analytics frameworks and platforms are using Hadoop and Hadoop relevant technologies to design their solutions; and (2) most of the mining algorithms for big data analysis have been designed for parallel computing via software or hardware or designed for Map-Reduce-based platform.

From the results of recent studies of big data analytics, it is still at the early stage of Nolan’s stages of growth model [ 146 ] which is similar to the situations for the research topics of cloud computing, internet of things, and smart grid. This is because several studies just attempted to apply the traditional solutions to the new problems/platforms/environments. For example, several studies [ 114 , 145 ] used k -means as an example to analyze the big data, but not many studies applied the state-of-the-art data mining algorithms and machine learning algorithms to the analysis the big data. This explains that the performance of the big data analytics can be improved by data mining algorithms and metaheuristic algorithms presented in recent years [ 147 ]. The relevant technologies for compression, sampling, or even the platform presented in recent years may also be used to enhance the performance of the big data analytics system. As a result, although these research topics still have several open issues that need to be solved, these situations, on the contrary, also illustrate that everything is possible in these studies.

The open issues

Although the data analytics today may be inefficient for big data caused by the environment, devices, systems, and even problems that are quite different from traditional mining problems, because several characteristics of big data also exist in the traditional data analytics. Several open issues caused by the big data will be addressed as the platform/framework and data mining perspectives in this section to explain what dilemmas we may confront because of big data. Here are some of the open issues:

Platform and framework perspective

Input and output ratio of platform.

A large number of reports and researches mentioned that we will enter the big data age in the near future. Some of them insinuated to us that these fruitful results of big data will lead us to a whole new world where “everything” is possible; therefore, the big data analytics will be an omniscient and omnipotent system. From the pragmatic perspective, the big data analytics is indeed useful and has many possibilities which can help us more accurately understand the so-called “things.” However, the situation in most studies of big data analytics is that they argued that the results of big data are valuable, but the business models of most big data analytics are not clear. The fact is that assuming we have infinite computing resources for big data analytics is a thoroughly impracticable plan, the input and output ratio (e.g., return on investment) will need to be taken into account before an organization constructs the big data analytics center.

Communication between systems

Since most big data analytics systems will be designed for parallel computing, and they typically will work on other systems (e.g., cloud platform) or work with other systems (e.g., search engine or knowledge base), the communication between the big data analytics and other systems will strongly impact the performance of the whole process of KDD. The first research issue for the communication is that the communication cost will incur between systems of data analytics. How to reduce the communication cost will be the very first thing that the data scientists need to care. Another research issue for the communication is how the big data analytics communicates with other systems. The consistency of data between different systems, modules, and operators is also an important open issue on the communication between systems. Because the communication will appear more frequently between systems of big data analytics, how to reduce the cost of communication and how to make the communication between these systems as reliable as possible will be the two important open issues for big data analytics.

Bottlenecks on data analytics system

The bottlenecks will be appeared in different places of the data analytics for big data because the environments, systems, and input data have changed which are different from the traditional data analytics. The data deluge of big data will fill up the “input” system of data analytics, and it will also increase the computation load of the data “analysis” system. This situation is just like the torrent of water (i.e., data deluge) rushed down the mountain (i.e., data analytics), how to split it and how to avoid it flowing into a narrow place (e.g., the operator is not able to handle the input data) will be the most important things to avoid the bottlenecks in data analytics system. One of the current solutions to the avoidance of bottlenecks on a data analytics system is to add more computation resources while the other is to split the analysis works to different computation nodes. A complete consideration for the whole data analytics to avoid the bottlenecks of that kind of analytics system is still needed for big data.

Security issues

Since much more environment data and human behavior will be gathered to the big data analytics, how to protect them will also be an open issue because without a security way to handle the collected data, the big data analytics cannot be a reliable system. In spite of the security that we have to tighten for big data analytics before it can gather more data from everywhere, the fact is that until now, there are still not many studies focusing on the security issues of the big data analytics. According to our observation, the security issues of big data analytics can be divided into fourfold: input, data analysis, output, and communication with other systems. For the input, it can be regarded as the data gathering which is relevant to the sensor, the handheld devices, and even the devices of internet of things. One of the important security issues on the input part of big data analytics is to make sure that the sensors will not be compromised by the attacks. For the analysis and input, it can be regarded as the security problem of such a system. For communication with other system, the security problem is on the communications between big data analytics and other external systems. Because of these latent problems, security has become one of the open issues of big data analytics.

Data mining perspective

Data mining algorithm for map-reduce solution.

As we mentioned in the previous sections, most of the traditional data mining algorithms are not designed for parallel computing; therefore, they are not particularly useful for the big data mining. Several recent studies have attempted to modify the traditional data mining algorithms to make them applicable to Hadoop-based platforms. As long as porting the data mining algorithms to Hadoop is inevitable, making the data mining algorithms work on a map-reduce architecture is the first very thing to do to apply traditional data mining methods to big data analytics. Unfortunately, not many studies attempted to make the data mining and soft computing algorithms work on Hadoop because several different backgrounds are needed to develop and design such algorithms. For instance, the researcher and his or her research group need to have the background in data mining and Hadoop so as to develop and design such algorithms. Another open issue is that most data mining algorithms are designed for centralized computing; that is, they can only work on all the data at the same time. Thus, how to make them work on a parallel computing system is also a difficult work. The good news is that some studies [ 145 ] have successfully applied the traditional data mining algorithms to the map-reduce architecture. These results imply that it is possible to do so. According to our observation, although the traditional mining or soft computing algorithms can be used to help us analyze the data in big data analytics, unfortunately, until now, not many studies are focused on it. As a consequence, it is an important open issue in big data analytics.

Noise, outliers, incomplete and inconsistent data

Although big data analytics is a new age for data analysis, because several solutions adopt classical ways to analyze the data on big data analytics, the open issues of traditional data mining algorithms also exist in these new systems. The open issues of noise, outliers, incomplete, and inconsistent data in traditional data mining algorithms will also appear in big data mining algorithms. More incomplete and inconsistent data will easily appear because the data are captured by or generated from different sensors and systems. The impact of noise, outliers, incomplete and inconsistent data will be enlarged for big data analytics. Therefore, how to mitigate the impact will be the open issues for big data analytics.

Bottlenecks on data mining algorithm

Most of the data mining algorithms in big data analytics will be designed for parallel computing. However, once data mining algorithms are designed or modified for parallel computing, it is the information exchange between different data mining procedures that may incur bottlenecks. One of them is the synchronization issue because different mining procedures will finish their jobs at different times even though they use the same mining algorithm to work on the same amount of data. Thus, some of the mining procedures will have to wait until the others finished their jobs. This situation may occur because the loading of different computer nodes may be different during the data mining process, or it may occur because the convergence speeds are different for the same data mining algorithm. The bottlenecks of data mining algorithms will become an open issue for the big data analytics which explains that we need to take into account this issue when we develop and design a new data mining algorithm for big data analytics.

Privacy issues

The privacy concern typically will make most people uncomfortable, especially if systems cannot guarantee that their personal information will not be accessed by the other people and organizations. Different from the concern of the security, the privacy issue is about if it is possible for the system to restore or infer personal information from the results of big data analytics, even though the input data are anonymous. The privacy issue has become a very important issue because the data mining and other analysis technologies will be widely used in big data analytics, the private information may be exposed to the other people after the analysis process. For example, although all the gathered data for shop behavior are anonymous (e.g., buying a pistol), because the data can be easily collected by different devices and systems (e.g., location of the shop and age of the buyer), a data mining algorithm can easily infer who bought this pistol. More precisely, the data analytics is able to reduce the scope of the database because location of the shop and age of the buyer provide the information to help the system find out possible persons. For this reason, any sensitive information needs to be carefully protected and used. The anonymous, temporary identification, and encryption are the representative technologies for privacy of data analytics, but the critical factor is how to use, what to use, and why to use the collected data on big data analytics.

Conclusions

In this paper, we reviewed studies on the data analytics from the traditional data analysis to the recent big data analysis. From the system perspective, the KDD process is used as the framework for these studies and is summarized into three parts: input, analysis, and output. From the perspective of big data analytics framework and platform, the discussions are focused on the performance-oriented and results-oriented issues. From the perspective of data mining problem, this paper gives a brief introduction to the data and big data mining algorithms which consist of clustering, classification, and frequent patterns mining technologies. To better understand the changes brought about by the big data, this paper is focused on the data analysis of KDD from the platform/framework to data mining. The open issues on computation, quality of end result, security, and privacy are then discussed to explain which open issues we may face. Last but not least, to help the audience of the paper find solutions to welcome the new age of big data, the possible high impact research trends are given below:

For the computation time, there is no doubt at all that parallel computing is one of the important future trends to make the data analytics work for big data, and consequently the technologies of cloud computing, Hadoop, and map-reduce will play the important roles for the big data analytics. To handle the computation resources of the cloud-based platform and to finish the task of data analysis as fast as possible, the scheduling method is another future trend.

Using efficient methods to reduce the computation time of input, comparison, sampling, and a variety of reduction methods will play an important role in big data analytics. Because these methods typically do not consider parallel computing environment, how to make them work on parallel computing environment will be a future research trend. Similar to the input, the data mining algorithms also face the same situation that we mentioned in the previous section , how to make them work on parallel computing environment will be a very important research trend because there are abundant research results on traditional data mining algorithms.

How to model the mining problem to find something from big data and how to display the knowledge we got from big data analytics will also be another two vital future trends because the results of these two researches will decide if the data analytics can practically work for real world approaches, not just a theoretical stuff.

The methods of extracting information from external and relative knowledge resources to further reinforce the big data analytics, until now, are not very popular in big data analytics. But combining information from different resources to add the value of output knowledge is a common solution in the area of information retrieval, such as clustering search engine or document summarization. For this reason, information fusion will also be a future trend for improving the end results of big data analytics.

Because the metaheuristic algorithms are capable of finding an approximate solution within a reasonable time, they have been widely used in solving the data mining problem in recent years. Until now, many state-of-the-art metaheuristic algorithms still have not been applied to big data analytics. In addition, compared to some early data mining algorithms, the performance of metaheuristic is no doubt superior in terms of the computation time and the quality of end result. From these observations, the application of metaheuristic algorithms to big data analytics will also be an important research topic.

Because social network is part of the daily life of most people and because its data is also a kind of big data, how to analyze the data of a social network has become a promising research issue. Obviously, it can be used to predict the behavior of a user. After that, we can make applicable strategies for the user. For instance, a business intelligence system can use the analysis results to encourage particular customers to buy the goods they are interested.

The security and privacy issues that accompany the work of data analysis are intuitive research topics which contain how to safely store the data, how to make sure the data communication is protected, and how to prevent someone from finding out the information about us. Many problems of data security and privacy are essentially the same as those of the traditional data analysis even if we are entering the big data age. Thus, how to protect the data will also appear in the research of big data analytics.

In this paper, by the data analytics, we mean the whole KDD process, while by the data analysis, we mean the part of data analytics that is aimed at finding the hidden information in the data, such as data mining.

In this paper, by an unlabeled input data, we mean that it is unknown to which group the input data belongs. If all the input data are unlabeled, it means that the distribution of the input data is unknown.

In this paper, the analysis framework refers to the whole system, from raw data gathering, data reformat, data analysis, all the way to knowledge representation.

The whole system may be down when the master machine crashed for a system that has only one master.

The learner typically represented the classification function which will create the classifier to help us classify the unknown input data.

The basic idea of [ 128 ] is that each ant will pick up and drop data items in terms of the similarity of its local neighbors.

Abbreviations

principal components analysis

volume, velocity, and variety

International Data Corporation

knowledge discovery in databases

support vector machine

sum of squared errors

generalized linear aggregates distributed engine

big data architecture framework

cloud-based big data mining & analyzing services platform

service-oriented decision support system

high performance computing cluster system

business intelligence and analytics

database management system

multiple species flocking

genetic algorithm

self-organizing map

multiple back-propagation

yahoo cloud serving benchmark

high performance computing

electroencephalography

Lyman P, Varian H. How much information 2003? Tech. Rep, 2004. [Online]. Available: http://www2.sims.berkeley.edu/research/projects/how-much-info-2003/printable_report.pdf .

Xu R, Wunsch D. Clustering. Hoboken: Wiley-IEEE Press; 2009.

Google Scholar  

Ding C, He X. K-means clustering via principal component analysis. In: Proceedings of the Twenty-first International Conference on Machine Learning, 2004, pp 1–9.

Kollios G, Gunopulos D, Koudas N, Berchtold S. Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng. 2003;15(5):1170–87.

Article   Google Scholar  

Fisher D, DeLine R, Czerwinski M, Drucker S. Interactions with big data analytics. Interactions. 2012;19(3):50–9.

Laney D. 3D data management: controlling data volume, velocity, and variety, META Group, Tech. Rep. 2001. [Online]. Available: http://blogs.gartner.com/doug-laney/files/2012/01/ad949-3D-Data-Management-Controlling-Data-Volume-Velocity-and-Variety.pdf .

van Rijmenam M. Why the 3v’s are not sufficient to describe big data, BigData Startups, Tech. Rep. 2013. [Online]. Available: http://www.bigdata-startups.com/3vs-sufficient-describe-big-data/ .

Borne K. Top 10 big data challenges a serious look at 10 big data v’s, Tech. Rep. 2014. [Online]. Available: https://www.mapr.com/blog/top-10-big-data-challenges-look-10-big-data-v .

Press G. $16.1 billion big data market: 2014 predictions from IDC and IIA, Forbes, Tech. Rep. 2013. [Online]. Available: http://www.forbes.com/sites/gilpress/2013/12/12/16-1-billion-big-data-market-2014-predictions-from-idc-and-iia/ .

Big data and analytics—an IDC four pillar research area, IDC, Tech. Rep. 2013. [Online]. Available: http://www.idc.com/prodserv/FourPillars/bigData/index.jsp .

Taft DK. Big data market to reach $46.34 billion by 2018, EWEEK, Tech. Rep. 2013. [Online]. Available: http://www.eweek.com/database/big-data-market-to-reach-46.34-billion-by-2018.html .

Research A. Big data spending to reach $114 billion in 2018; look for machine learning to drive analytics, ABI Research, Tech. Rep. 2013. [Online]. Available: https://www.abiresearch.com/press/big-data-spending-to-reach-114-billion-in-2018-loo .

Furrier J. Big data market $50 billion by 2017—HP vertica comes out #1—according to wikibon research, SiliconANGLE, Tech. Rep. 2012. [Online]. Available: http://siliconangle.com/blog/2012/02/15/big-data-market-15-billion-by-2017-hp-vertica-comes-out-1-according-to-wikibon-research/ .

Kelly J, Vellante D, Floyer D. Big data market size and vendor revenues, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Market_Size_and_Vendor_Revenues .

Kelly J, Floyer D, Vellante D, Miniman S. Big data vendor revenue and market forecast 2012-2017, Wikibon, Tech. Rep. 2014. [Online]. Available: http://wikibon.org/wiki/v/Big_Data_Vendor_Revenue_and_Market_Forecast_2012-2017 .

Mayer-Schonberger V, Cukier K. Big data: a revolution that will transform how we live, work, and think. Boston: Houghton Mifflin Harcourt; 2013.

Chen H, Chiang RHL, Storey VC. Business intelligence and analytics: from big data to big impact. MIS Quart. 2012;36(4):1165–88.

Kitchin R. The real-time city? big data and smart urbanism. Geo J. 2014;79(1):1–14.

Fayyad UM, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. AI Mag. 1996;17(3):37–54.

Han J. Data mining: concepts and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Agrawal R, Imieliński T, Swami A. Mining association rules between sets of items in large databases. Proc ACM SIGMOD Int Conf Manag Data. 1993;22(2):207–16.

Witten IH, Frank E. Data mining: practical machine learning tools and techniques. San Francisco: Morgan Kaufmann Publishers Inc.; 2005.

Abbass H, Newton C, Sarker R. Data mining: a heuristic approach. Hershey: IGI Global; 2002.

Book   Google Scholar  

Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P. Distributed data mining on grids: services, tools, and applications. IEEE Trans Syst Man Cyber Part B Cyber. 2004;34(6):2451–65.

Krishna K, Murty MN. Genetic \(k\) -means algorithm. IEEE Trans Syst Man Cyber Part B Cyber. 1999;29(3):433–9.

Tsai C-W, Lai C-F, Chiang M-C, Yang L. Data mining for internet of things: a survey. IEEE Commun Surveys Tutor. 2014;16(1):77–97.

Jain AK, Murty MN, Flynn PJ. Data clustering: a review. ACM Comp Surveys. 1999;31(3):264–323.

McQueen JB. Some methods of classification and analysis of multivariate observations. In: Proceedings of the Berkeley Symposium on Mathematical Statistics and Probability, 1967. pp 281–297.

Safavian S, Landgrebe D. A survey of decision tree classifier methodology. IEEE Trans Syst Man Cyber. 1991;21(3):660–74.

Article   MathSciNet   Google Scholar  

McCallum A, Nigam K. A comparison of event models for naive bayes text classification. In: Proceedings of the National Conference on Artificial Intelligence, 1998. pp. 41–48.

Boser BE, Guyon IM, Vapnik VN. A training algorithm for optimal margin classifiers. In: Proceedings of the annual workshop on Computational learning theory, 1992. pp. 144–152.

Han J, Pei J, Yin Y. Mining frequent patterns without candidate generation. In : Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000. pp. 1–12.

Kaya M, Alhajj R. Genetic algorithm based framework for mining fuzzy association rules. Fuzzy Sets Syst. 2005;152(3):587–601.

Article   MATH   MathSciNet   Google Scholar  

Srikant R, Agrawal R. Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the International Conference on Extending Database Technology: Advances in Database Technology, 1996. pp 3–17.

Zaki MJ. Spade: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Article   MATH   Google Scholar  

Baeza-Yates RA, Ribeiro-Neto B. Modern Information Retrieval. Boston: Addison-Wesley Longman Publishing Co., Inc; 1999.

Liu B. Web data mining: exploring hyperlinks, contents, and usage data. Berlin, Heidelberg: Springer-Verlag; 2007.

d’Aquin M, Jay N. Interpreting data mining results with linked data for learning analytics: motivation, case study and directions. In: Proceedings of the International Conference on Learning Analytics and Knowledge, pp 155–164.

Shneiderman B. The eyes have it: a task by data type taxonomy for information visualizations. In: Proceedings of the IEEE Symposium on Visual Languages, 1996, pp 336–343.

Mani I, Bloedorn E. Multi-document summarization by graph search and matching. In: Proceedings of the National Conference on Artificial Intelligence and Ninth Conference on Innovative Applications of Artificial Intelligence, 1997, pp 622–628.

Kopanakis I, Pelekis N, Karanikas H, Mavroudkis T. Visual techniques for the interpretation of data mining outcomes. In: Proceedings of the Panhellenic Conference on Advances in Informatics, 2005. pp 25–35.

Elkan C. Using the triangle inequality to accelerate k-means. In: Proceedings of the International Conference on Machine Learning, 2003, pp 147–153.

Catanzaro B, Sundaram N, Keutzer K. Fast support vector machine training and classification on graphics processors. In: Proceedings of the International Conference on Machine Learning, 2008. pp 104–111.

Zhang T, Ramakrishnan R, Livny M. BIRCH: an efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 1996. pp 103–114.

Ester M, Kriegel HP, Sander J, Xu X. A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, 1996. pp 226–231.

Ester M, Kriegel HP, Sander J, Wimmer M, Xu X. Incremental clustering for mining in a data warehousing environment. In: Proceedings of the International Conference on Very Large Data Bases, 1998. pp 323–333.

Ordonez C, Omiecinski E. Efficient disk-based k-means clustering for relational databases. IEEE Trans Knowl Data Eng. 2004;16(8):909–21.

Kogan J. Introduction to clustering large and high-dimensional data. Cambridge: Cambridge Univ Press; 2007.

MATH   Google Scholar  

Mitra S, Pal S, Mitra P. Data mining in soft computing framework: a survey. IEEE Trans Neural Netw. 2002;13(1):3–14.

Mehta M, Agrawal R, Rissanen J. SLIQ: a fast scalable classifier for data mining. In: Proceedings of the 5th International Conference on Extending Database Technology: Advances in Database Technology. 1996. pp 18–32.

Micó L, Oncina J, Carrasco RC. A fast branch and bound nearest neighbour classifier in metric spaces. Pattern Recogn Lett. 1996;17(7):731–9.

Djouadi A, Bouktache E. A fast algorithm for the nearest-neighbor classifier. IEEE Trans Pattern Anal Mach Intel. 1997;19(3):277–82.

Ververidis D, Kotropoulos C. Fast and accurate sequential floating forward feature selection with the bayes classifier applied to speech emotion recognition. Signal Process. 2008;88(12):2956–70.

Pei J, Han J, Mao R. CLOSET: an efficient algorithm for mining frequent closed itemsets. In: Proceedings of the ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, 2000. pp 21–30.

Zaki MJ, Hsiao C-J. Efficient algorithms for mining closed itemsets and their lattice structure. IEEE Trans Knowl Data Eng. 2005;17(4):462–78.

Burdick D, Calimlim M, Gehrke J. MAFIA: a maximal frequent itemset algorithm for transactional databases. In: Proceedings of the International Conference on Data Engineering, 2001. pp 443–452.

Chen B, Haas P, Scheuermann P. A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 462–468.

Zaki MJ. SPADE: an efficient algorithm for mining frequent sequences. Mach Learn. 2001;42(1–2):31–60.

Yan X, Han J, Afshar R. CloSpan: mining closed sequential patterns in large datasets. In: Proceedings of the SIAM International Conference on Data Mining, 2003. pp 166–177.

Pei J, Han J, Asl MB, Pinto H, Chen Q, Dayal U, Hsu MC. PrefixSpan mining sequential patterns efficiently by prefix projected pattern growth. In: Proceedings of the International Conference on Data Engineering, 2001. pp 215–226.

Ayres J, Flannick J, Gehrke J, Yiu T. Sequential PAttern Mining using a bitmap representation. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2002. pp 429–435.

Masseglia F, Poncelet P, Teisseire M. Incremental mining of sequential patterns in large databases. Data Knowl Eng. 2003;46(1):97–121.

Xu R, Wunsch-II DC. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78.

Chiang M-C, Tsai C-W, Yang C-S. A time-efficient pattern reduction algorithm for k-means clustering. Inform Sci. 2011;181(4):716–31.

Bradley PS, Fayyad UM. Refining initial points for k-means clustering. In: Proceedings of the International Conference on Machine Learning, 1998. pp 91–99.

Laskov P, Gehl C, Krüger S, Müller K-R. Incremental support vector learning: analysis, implementation and applications. J Mach Learn Res. 2006;7:1909–36.

MATH   MathSciNet   Google Scholar  

Russom P. Big data analytics. TDWI: Tech. Rep ; 2011.

Ma C, Zhang HH, Wang X. Machine learning for big data analytics in plants. Trends Plant Sci. 2014;19(12):798–808.

Boyd D, Crawford K. Critical questions for big data. Inform Commun Soc. 2012;15(5):662–79.

Katal A, Wazid M, Goudar R. Big data: issues, challenges, tools and good practices. In: Proceedings of the International Conference on Contemporary Computing, 2013. pp 404–409.

Baraniuk RG. More is less: signal processing and the data deluge. Science. 2011;331(6018):717–9.

Lee J, Hong S, Lee JH. An efficient prediction for heavy rain from big weather data using genetic algorithm. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2014. pp 25:1–25:7.

Famili A, Shen W-M, Weber R, Simoudis E. Data preprocessing and intelligent data analysis. Intel Data Anal. 1997;1(1–4):3–23.

Zhang H. A novel data preprocessing solution for large scale digital forensics investigation on big data, Master’s thesis, Norway, 2013.

Ham YJ, Lee H-W. International journal of advances in soft computing and its applications. Calc Paralleles Reseaux et Syst Repar. 2014;6(1):1–18.

Cormode G, Duffield N. Sampling for big data: a tutorial. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014. pp 1975–1975.

Satyanarayana A. Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. In: Proceedings of the IEEE Canadian Conference on Electrical and Computer Engineering, 2014. pp 1–6.

Jun SW, Fleming K, Adler M, Emer JS. Zip-io: architecture for application-specific compression of big data. In: Proceedings of the International Conference on Field-Programmable Technology, 2012, pp 343–351.

Zou H, Yu Y, Tang W, Chen HM. Improving I/O performance with adaptive data compression for big data applications. In: Proceedings of the International Parallel and Distributed Processing Symposium Workshops, 2014. pp 1228–1237.

Yang C, Zhang X, Zhong C, Liu C, Pei J, Ramamohanarao K, Chen J. A spatiotemporal compression based approach for efficient big data processing on cloud. J Comp Syst Sci. 2014;80(8):1563–83.

Xue Z, Shen G, Li J, Xu Q, Zhang Y, Shao J. Compression-aware I/O performance analysis for big data clustering. In: Proceedings of the International Workshop on Big Data, Streams and Heterogeneous Source Mining: Algorithms, Systems, Programming Models and Applications, 2012. pp 45–52.

Pospiech M, Felden C. Big data—a state-of-the-art. In: Proceedings of the Americas Conference on Information Systems, 2012, pp 1–23. [Online]. Available: http://aisel.aisnet.org/amcis2012/proceedings/DecisionSupport/22 .

Apache Hadoop, February 2, 2015. [Online]. Available: http://hadoop.apache.org .

Cuda, February 2, 2015. [Online]. Available: URL: http://www.nvidia.com/object/cuda_home_new.html .

Apache Storm, February 2, 2015. [Online]. Available: URL: http://storm.apache.org/ .

Curtin RR, Cline JR, Slagle NP, March WB, Ram P, Mehta NA, Gray AG. MLPACK: a scalable C++ machine learning library. J Mach Learn Res. 2013;14:801–5.

Apache Mahout, February 2, 2015. [Online]. Available: http://mahout.apache.org/ .

Huai Y, Lee R, Zhang S, Xia CH, Zhang X. DOT: a matrix model for analyzing, optimizing and deploying software for big data analytics in distributed systems. In: Proceedings of the ACM Symposium on Cloud Computing, 2011. pp 4:1–4:14.

Rusu F, Dobra A. GLADE: a scalable framework for efficient analytics. In: Proceedings of LADIS Workshop held in conjunction with VLDB, 2012. pp 1–6.

Cheng Y, Qin C, Rusu F. GLADE: big data analytics made easy. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2012. pp 697–700.

Essa YM, Attiya G, El-Sayed A. Mobile agent based new framework for improving big data analysis. In: Proceedings of the International Conference on Cloud Computing and Big Data. 2013, pp 381–386.

Wonner J, Grosjean J, Capobianco A, Bechmann D Starfish: a selection technique for dense virtual environments. In: Proceedings of the ACM Symposium on Virtual Reality Software and Technology, 2012. pp 101–104.

Demchenko Y, de Laat C, Membrey P. Defining architecture components of the big data ecosystem. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2014. pp 104–112.

Ye F, Wang ZJ, Zhou FC, Wang YP, Zhou YC. Cloud-based big data mining and analyzing services platform integrating r. In: Proceedings of the International Conference on Advanced Cloud and Big Data, 2013. pp 147–151.

Wu X, Zhu X, Wu G-Q, Ding W. Data mining with big data. IEEE Trans Knowl Data Eng. 2014;26(1):97–107.

Laurila JK, Gatica-Perez D, Aad I, Blom J, Bornet O, Do T, Dousse O, Eberle J, Miettinen M. The mobile data challenge: big data for mobile computing research. In: Proceedings of the Mobile Data Challenge by Nokia Workshop, 2012. pp 1–8.

Demirkan H, Delen D. Leveraging the capabilities of service-oriented decision support systems: putting analytics and big data in cloud. Decision Support Syst. 2013;55(1):412–21.

Talia D. Clouds for scalable big data analytics. Computer. 2013;46(5):98–101.

Lu R, Zhu H, Liu X, Liu JK, Shao J. Toward efficient and privacy-preserving computing in big data era. IEEE Netw. 2014;28(4):46–50.

Cuzzocrea A, Song IY, Davis KC. Analytics over large-scale multidimensional data: The big data revolution!. In: Proceedings of the ACM International Workshop on Data Warehousing and OLAP, 2011. pp 101–104.

Zhang J, Huang ML. 5Ws model for big data analysis and visualization. In: Proceedings of the International Conference on Computational Science and Engineering, 2013. pp 1021–1028.

Chandarana P, Vijayalakshmi M. Big data analytics frameworks. In: Proceedings of the International Conference on Circuits, Systems, Communication and Information Technology Applications, 2014. pp 430–434.

Apache Drill February 2, 2015. [Online]. Available: URL: http://drill.apache.org/ .

Hu H, Wen Y, Chua T-S, Li X. Toward scalable systems for big data analytics: a technology tutorial. IEEE Access. 2014;2:652–87.

Sagiroglu S, Sinanc D, Big data: a review. In: Proceedings of the International Conference on Collaboration Technologies and Systems, 2013. pp 42–47.

Fan W, Bifet A. Mining big data: current status, and forecast to the future. ACM SIGKDD Explor Newslett. 2013;14(2):1–5.

Diebold FX. On the origin(s) and development of the term “big data”, Penn Institute for Economic Research, Department of Economics, University of Pennsylvania, Tech. Rep. 2012. [Online]. Available: http://economics.sas.upenn.edu/sites/economics.sas.upenn.edu/files/12-037.pdf .

Weiss SM, Indurkhya N. Predictive data mining: a practical guide. San Francisco: Morgan Kaufmann Publishers Inc.; 1998.

Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A. A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Topics Comp. 2014;2(3):267–79.

Shirkhorshidi AS, Aghabozorgi SR, Teh YW, Herawan T. Big data clustering: a review. In: Proceedings of the International Conference on Computational Science and Its Applications, 2014. pp 707–720.

Xu H, Li Z, Guo S, Chen K. Cloudvista: interactive and economical visual cluster analysis for big data in the cloud. Proc VLDB Endowment. 2012;5(12):1886–9.

Cui X, Gao J, Potok TE. A flocking based algorithm for document clustering analysis. J Syst Archit. 2006;52(89):505–15.

Cui X, Charles JS, Potok T. GPU enhanced parallel computing for large scale data clustering. Future Gener Comp Syst. 2013;29(7):1736–41.

Feldman D, Schmidt M, Sohler C. Turning big data into tiny data: Constant-size coresets for k-means, pca and projective clustering. In: Proceedings of the ACM-SIAM Symposium on Discrete Algorithms, 2013. pp 1434–1453.

Tekin C, van der Schaar M. Distributed online big data classification using context information. In: Proceedings of the Allerton Conference on Communication, Control, and Computing, 2013. pp 1435–1442.

Rebentrost P, Mohseni M, Lloyd S. Quantum support vector machine for big feature and big data classification. CoRR , vol. abs/1307.0471, 2014. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1307.html#RebentrostML13 .

Lin MY, Lee PY, Hsueh SC. Apriori-based frequent itemset mining algorithms on mapreduce. In: Proceedings of the International Conference on Ubiquitous Information Management and Communication, 2012. pp 76:1–76:8.

Riondato M, DeBrabant JA, Fonseca R, Upfal E. PARMA: a parallel randomized algorithm for approximate association rules mining in mapreduce. In: Proceedings of the ACM International Conference on Information and Knowledge Management, 2012. pp 85–94.

Leung CS, MacKinnon R, Jiang F. Reducing the search space for big data mining for interesting patterns from uncertain data. In: Proceedings of the International Congress on Big Data, 2014. pp 315–322.

Yang L, Shi Z, Xu L, Liang F, Kirsh I. DH-TRIE frequent pattern mining on hadoop using JPA. In: Proceedings of the International Conference on Granular Computing, 2011. pp 875–878.

Huang JW, Lin SC, Chen MS. DPSP: Distributed progressive sequential pattern mining on the cloud. In: Proceedings of the Advances in Knowledge Discovery and Data Mining, vol. 6119, 2010, pp 27–34.

Paz CE. A survey of parallel genetic algorithms. Calc Paralleles Reseaux et Syst Repar. 1998;10(2):141–71.

kranthi Kiran B, Babu AV. A comparative study of issues in big data clustering algorithm with constraint based genetic algorithm for associative clustering. Int J Innov Res Comp Commun Eng 2014; 2(8): 5423–5432.

Bu Y, Borkar VR, Carey MJ, Rosen J, Polyzotis N, Condie T, Weimer M, Ramakrishnan R. Scaling datalog for machine learning on big data, CoRR , vol. abs/1203.0160, 2012. [Online]. Available: http://dblp.uni-trier.de/db/journals/corr/corr1203.html#abs-1203-0160 .

Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G. Pregel: A system for large-scale graph processing. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010. pp 135–146.

Hasan S, Shamsuddin S,  Lopes N. Soft computing methods for big data problems. In: Proceedings of the Symposium on GPU Computing and Applications, 2013. pp 235–247.

Ku-Mahamud KR. Big data clustering using grid computing and ant-based algorithm. In: Proceedings of the International Conference on Computing and Informatics, 2013. pp 6–14.

Deneubourg JL, Goss S, Franks N, Sendova-Franks A, Detrain C, Chrétien L. The dynamics of collective sorting robot-like ants and ant-like robots. In: Proceedings of the International Conference on Simulation of Adaptive Behavior on From Animals to Animats, 1990. pp 356–363.

Radoop [Online]. https://rapidminer.com/products/radoop/ . Accessed 2 Feb 2015.

PigMix [Online]. https://cwiki.apache.org/confluence/display/PIG/PigMix . Accessed 2 Feb 2015.

GridMix [Online]. http://hadoop.apache.org/docs/r1.2.1/gridmix.html . Accessed 2 Feb 2015.

TeraSoft [Online]. http://sortbenchmark.org/ . Accessed 2 Feb 2015.

TPC, transaction processing performance council [Online]. http://www.tpc.org/ . Accessed 2 Feb 2015.

Cooper BF, Silberstein A, Tam E, Ramakrishnan R, Sears R. Benchmarking cloud serving systems with ycsb. In: Proceedings of the ACM Symposium on Cloud Computing, 2010. pp 143–154.

Ghazal A, Rabl T, Hu M, Raab F, Poess M, Crolotte A, Jacobsen HA. BigBench: Towards an industry standard benchmark for big data analytics. In: Proceedings of the ACM SIGMOD International Conference on Management of Data, 2013. pp 1197–1208.

Cheptsov A. Hpc in big data age: An evaluation report for java-based data-intensive applications implemented with hadoop and openmpi. In: Proceedings of the European MPI Users’ Group Meeting, 2014. pp 175:175–175:180.

Yuan LY, Wu L, You JH, Chi Y. Rubato db: A highly scalable staged grid database system for oltp and big data applications. In: Proceedings of the ACM International Conference on Conference on Information and Knowledge Management, 2014. pp 1–10.

Zhao JM, Wang WS, Liu X, Chen YF. Big data benchmark - big DS. In: Proceedings of the Advancing Big Data Benchmarks, 2014, pp. 49–57.

 Saletore V, Krishnan K, Viswanathan V, Tolentino M. HcBench: Methodology, development, and full-system characterization of a customer usage representative big data/hadoop benchmark. In: Advancing Big Data Benchmarks, 2014. pp 73–93.

Zhang L, Stoffel A, Behrisch M,  Mittelstadt S, Schreck T, Pompl R, Weber S, Last H, Keim D. Visual analytics for the big data era—a comparative review of state-of-the-art commercial systems. In: Proceedings of the IEEE Conference on Visual Analytics Science and Technology, 2012. pp 173–182.

Harati A, Lopez S, Obeid I, Picone J, Jacobson M, Tobochnik S. The TUH EEG CORPUS: A big data resource for automated eeg interpretation. In: Proceeding of the IEEE Signal Processing in Medicine and Biology Symposium, 2014. pp 1–5.

Thusoo A, Sarma JS, Jain N, Shao Z, Chakka P, Anthony S, Liu H, Wyckoff P, Murthy R. Hive: a warehousing solution over a map-reduce framework. Proc VLDB Endowment. 2009;2(2):1626–9.

Beckmann M, Ebecken NFF, de Lima BSLP, Costa MA. A user interface for big data with rapidminer. RapidMiner World, Boston, MA, Tech. Rep., 2014. [Online]. Available: http://www.slideshare.net/RapidMiner/a-user-interface-for-big-data-with-rapidminer-marcelo-beckmann .

Januzaj E, Kriegel HP, Pfeifle M. DBDC: Density based distributed clustering. In: Proceedings of the Advances in Database Technology, 2004; vol. 2992, 2004, pp 88–105.

Zhao W, Ma H, He Q. Parallel k-means clustering based on mapreduce. Proceedings Cloud Comp. 2009;5931:674–9.

Nolan RL. Managing the crises in data processing. Harvard Bus Rev. 1979;57(1):115–26.

Tsai CW, Huang WC, Chiang MC. Recent development of metaheuristics for clustering. In: Proceedings of the Mobile, Ubiquitous, and Intelligent Computing, 2014; vol. 274, pp. 629–636.

Download references

Authors’ contributions

CWT contributed to the paper review and drafted the first version of the manuscript. CFL contributed to the paper collection and manuscript organization. HCC and AVV double checked the manuscript and provided several advanced ideas for this manuscript. All authors read and approved the final manuscript.

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions on the paper. This work was supported in part by the Ministry of Science and Technology of Taiwan, R.O.C., under Contracts MOST103-2221-E-197-034, MOST104-2221-E-197-005, and MOST104-2221-E-197-014.

Compliance with ethical guidelines

Competing interests The authors declare that they have no competing interests.

Author information

Authors and affiliations.

Department of Computer Science and Information Engineering, National Ilan University, Yilan, Taiwan

Chun-Wei Tsai & Han-Chieh Chao

Institute of Computer Science and Information Engineering, National Chung Cheng University, Chia-Yi, Taiwan

Chin-Feng Lai

Information Engineering College, Yangzhou University, Yangzhou, Jiangsu, China

Han-Chieh Chao

School of Information Science and Engineering, Fujian University of Technology, Fuzhou, Fujian, China

Department of Computer Science, Electrical and Space Engineering, Luleå University of Technology, SE-931 87, Skellefteå, Sweden

Athanasios V. Vasilakos

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Athanasios V. Vasilakos .

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License ( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article.

Tsai, CW., Lai, CF., Chao, HC. et al. Big data analytics: a survey. Journal of Big Data 2 , 21 (2015). https://doi.org/10.1186/s40537-015-0030-3

Download citation

Received : 14 May 2015

Accepted : 02 September 2015

Published : 01 October 2015

DOI : https://doi.org/10.1186/s40537-015-0030-3

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • data analytics
  • data mining

research questions on big data

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • BMC Med Ethics

Logo of bmcmeth

Ethics review of big data research: What should stay and what should be reformed?

Agata ferretti.

1 Health Ethics and Policy Lab, Department of Health Sciences and Technology, ETH Zürich, Hottingerstrasse 10 (HOA), 8092 Zürich, Switzerland

Marcello Ienca

Mark sheehan.

2 The Ethox Centre, Department of Population Health, University of Oxford, Oxford, UK

Alessandro Blasimme

Edward s. dove.

3 School of Law, University of Edinburgh, Edinburgh, UK

Bobbie Farsides

4 Brighton and Sussex Medical School, Brighton, UK

Phoebe Friesen

5 Biomedical Ethics Unit, Department of Social Studies of Medicine, McGill University, Montreal, Canada

6 Johns Hopkins Berman Institute of Bioethics, Baltimore, USA

Walter Karlen

7 Mobile Health Systems Lab, Department of Health Sciences and Technology, ETH Zürich, Zürich, Switzerland

Peter Kleist

8 Cantonal Ethics Committee Zürich, Zürich, Switzerland

S. Matthew Liao

9 Center for Bioethics, Department of Philosophy, New York University, New York, USA

Camille Nebeker

10 Research Center for Optimal Digital Ethics in Health (ReCODE Health), Herbert Wertheim School of Public Health and Longevity Science, University of California, San Diego, USA

Gabrielle Samuel

11 Department of Global Health and Social Medicine, King’s College London, London, UK

Mahsa Shabani

12 Faculty of Law and Criminology, Ghent University, Ghent, Belgium

Minerva Rivas Velarde

13 Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland

Effy Vayena

Associated data.

Not applicable.

Ethics review is the process of assessing the ethics of research involving humans. The Ethics Review Committee (ERC) is the key oversight mechanism designated to ensure ethics review. Whether or not this governance mechanism is still fit for purpose in the data-driven research context remains a debated issue among research ethics experts.

In this article, we seek to address this issue in a twofold manner. First, we review the strengths and weaknesses of ERCs in ensuring ethical oversight. Second, we map these strengths and weaknesses onto specific challenges raised by big data research. We distinguish two categories of potential weakness. The first category concerns persistent weaknesses, i.e., those which are not specific to big data research, but may be exacerbated by it. The second category concerns novel weaknesses, i.e., those which are created by and inherent to big data projects. Within this second category, we further distinguish between purview weaknesses related to the ERC’s scope (e.g., how big data projects may evade ERC review) and functional weaknesses, related to the ERC’s way of operating. Based on this analysis, we propose reforms aimed at improving the oversight capacity of ERCs in the era of big data science.

Conclusions

We believe the oversight mechanism could benefit from these reforms because they will help to overcome data-intensive research challenges and consequently benefit research at large.

The debate about the adequacy of the Ethics Review Committee (ERC) as the chief oversight body for big data studies is partly rooted in the historical evolution of the ERC. Particularly relevant is the ERC’s changing response to new methods and technologies in scientific research. ERCs—also known as Institutional Review Boards (IRBs) or Research Ethics Committees (RECs)—came to existence in the 1950s and 1960s [ 1 ]. Their original mission was to protect the interests of human research participants, particularly through an assessment of potential harms to them (e.g., physical pain or psychological distress) and benefits that might accrue from the proposed research. ERCs expanded in scope during the 1970s, from participant protection towards ensuring valuable and ethical human subject research (e.g., having researchers implement an informed consent process), as well as supporting researchers in exploring their queries [ 2 ].

Fast forward fifty years, and a lot has changed. Today, biomedical projects leverage unconventional data sources (e.g., social media), partially inscrutable data analytics tools (e.g., machine learning), and unprecedented volumes of data [ 3 – 5 ]. Moreover, the evolution of research practices and new methodologies such as post-hoc data mining have blurred the concept of ‘ human subject’ and elicited a shift towards the concept of data subject —as attested in data protection regulations. [ 6 , 7 ]. With data protection and privacy concerns being in the spotlight of big data research review, language from data protection laws has worked its way into the vocabulary of research ethics. This terminological shift further reveals that big data, together with modern analytic methods used to interpret the data, creates novel dynamics between researchers and participants [ 8 ]. Research data repositories about individuals and aggregates of individuals are considerably expanding in size. Researchers can remotely access and use large volumes of potentially sensitive data without communicating or actively engaging with study participants. Consequently, participants become more vulnerable and subjected to the research itself [ 9 ]. As such, the nature of risk involved in this new form of research changes too. In particular, it moves from the risk of physical or psychological harm towards the risk of informational harm, such as privacy breaches or algorithmic discrimination [ 10 ]. This is the case, for instance, with projects using data collected through web search engines, mobile and smart devices, entertainment websites, and social media platforms. The fact that health-related research is leaving hospital labs and spreading into online space creates novel opportunities for research, but also raises novel challenges for ERCs. For this reason, it is important to re-examine the fit between new data-driven forms of research and existing oversight mechanisms [ 11 ].

The suitability of ERCs in the context of big data research is not merely a theoretical puzzle but also a practical concern resulting from recent developments in data science. In 2014, for example, the so-called ‘emotional contagion study’ received severe criticism for avoiding ethical oversight by an ERC, failing to obtain research consent, violating privacy, inflicting emotional harm, discriminating against data subjects, and placing vulnerable participants (e.g., children and adolescents) at risk [ 12 , 13 ]. In both public and expert opinion [ 14 ], a responsible ERC would have rejected this study because it contravened the research ethics principles of preventing harm (in this case, emotional distress) and adequately informing data subjects. However, the protocol adopted by the researchers was not required to undergo ethics review under US law [ 15 ] for two reasons. First, the data analyzed were considered non-identifiable, and researchers did not engage directly with subjects, exempting the study from ethics review. Second, the study team included both scientists affiliated with a public university (Cornell) and Facebook employees. The affiliation of the researchers is relevant because—in the US and some other countries—privately funded studies are not subject to the same research protections and ethical regulations as publicly funded research [ 16 ]. An additional example is the 2015 case in which the United Kingdom (UK) National Health Service (NHS) shared 1.6 million pieces of identifiable and sensitive data with Google DeepMind. This data transfer from the public to the private party took place legally, without the need for patient consent or ethics review oversight [ 17 ]. These cases demonstrate how researchers can pursue potentially risky big data studies without falling under the ERC’s purview. The limitations of the regulatory framework for research oversight are evident, in both private and public contexts.

The gaps in the ERC’s regulatory process, together with the increased sophistication of research contexts—which now include a variety of actors such as universities, corporations, funding agencies, public institutes, and citizens associations—has led to an increase in the range of oversight bodies. For instance, besides traditional university ethics committees and national oversight committees, funding agencies and national research initiatives have increasingly created internal ethics review boards [ 18 , 19 ]. New participatory models of governance have emerged, largely due to an increase in subjects’ requests to control their own data [ 20 ]. Corporations are creating research ethics committees as well, modelled after the institutional ERC [ 21 ]. In May 2020, for example, Facebook welcomed the first members of its Oversight Board, whose aim is to review the company’s decisions about content moderation [ 22 ]. Whether this increase in oversight models is motivated by the urge to fill the existing regulatory gaps, or whether it is just ‘ethics washing’, is still an open question. However, other types of specialized committees have already found their place alongside ERCs, when research involves international collaboration and data sharing [ 23 ]. Among others, data safety monitoring boards, data access committees, and responsible research and innovation panels serve the purpose of covering research areas left largely unregulated by current oversight [ 24 ].

The data-driven digital transformation challenges the purview and efficacy of ERCs. It also raises fundamental questions concerning the role and scope of ERCs as the oversight body for ethical and methodological soundness in scientific research. 1 Among these questions, this article will explore whether ERCs are still capable of their intended purpose, given the range of novel (maybe not categorically new, but at least different in practice) issues that have emerged in this type of research. To answer this question, we explore some of the challenges that the ERC oversight approach faces in the context of big data research and review the main strengths and weaknesses of this oversight mechanism. Based on this analysis, we will outline possible solutions to address current weaknesses and improve ethics review in the era of big data science.

Strengths of the ethics review via ERC

Historically, ERCs have enabled cross disciplinary exchange and assessment [ 27 ]. ERC members typically come from different backgrounds and bring their perspectives to the debate; when multi-disciplinarity is achieved, the mixture of expertise provides the conditions for a solid assessment of advantages and risks associated with new research. Committees which include members from a variety of backgrounds are also suited to promote projects from a range of fields, and research that cuts across disciplines [ 28 ]. Within these committees, the reviewers’ expertise can be paired with a specific type of content to be reviewed. This one-to-one match can bring timely and, ideally, useful feedback [ 29 ]. In many countries (e.g., European countries, the United States (US), Canada, Australia), ERCs are explicitly mandated by law to review many forms of research involving human participants; moreover, these laws also describe how such a body should be structured and the purview of its review [ 30 , 31 ]. In principle, ERCs also aim to be representative of society and the research enterprise, including members of the public and minorities, as well as researchers and experts [ 32 ]. And in performing a gatekeeping function to the research enterprise, ERCs play an important role: they recognize that both experts and lay people should have a say, with different views to contribute [ 33 ].

Furthermore, the ERC model strives to ensure independent assessment. The fact that ERCs assess projects “from the outside” and maintain a certain degree of objectivity towards what they are reviewing, reduces the risk of overlooking research issues and decreases the risk for conflicts of interest. Moreover, being institutionally distinct—for example, being established by an organization that is distinct from the researcher or the research sponsor—brings added value to the research itself as this lessens the risk for conflict of interest. Conflict of interest is a serious issue in research ethics because it can compromise the judgment of reviewers. Institutionalized review committees might particularly suffer from political interference. This is the case, for example, for universities and health care systems (like the NHS), which tend to engage “in house” experts as ethics boards members. However, ERCs that can prove themselves independent are considered more trustworthy by the general public and data subjects; it is reassuring to know that an independent committee is overseeing research projects [ 34 ].

The ex-ante (or pre-emptive) ethical evaluation of research studies is by many considered the standard procedural approach of ERCs [ 35 ]. Though the literature is divided on the usefulness and added value provided by this form of review [ 36 , 37 ], ex-ante review is commonly used as a mechanism to ensure the ethical validity of a study design before the research is conducted [ 38 , 39 ]. Early research scrutiny aims at risk-mitigation: the ERC evaluates potential research risks and benefits, in order to protect participants’ physical and psychological well-being, dignity, and data privacy. This practice saves researchers’ resources and valuable time by preventing the pursuit of unethical or illegal paths [ 40 ]. Finally, the ex-ante ethical assessment gives researchers an opportunity to receive feedback from ERCs, whose competence and experience may improve the research quality and increase public trust in the research [ 41 ].

All strengths mentioned in this section are strengths of the ERC model in principle. In practice, there are many ERCs that are not appropriately interdisciplinary or representative of the population and minorities, that lack independence from the research being reviewed, and that fail to improve research quality, and may in fact hinder it. We now turn to consider some of these weaknesses in more detail.

Weaknesses of the ethics review via ERC

In order to assess whether ERCs are adequately equipped to oversee big data research, we must consider the weaknesses of this model. We identify two categories of weaknesses which are described in the following section and summarized in Fig.  1 :

  • Persistent weaknesses : those existing in the current oversight system, which could be exacerbated by big data research

Within this second category of novel weaknesses, we further differentiate between:

  • Purview weaknesses : reasons why some big data projects may bypass the ERCs’ purview
  • Functional weaknesses : reasons why some ERCs may be inadequate to assess big data projects specifically

An external file that holds a picture, illustration, etc.
Object name is 12910_2021_616_Fig1_HTML.jpg

Weaknesses of the ERCs

We base the conceptual distinction between persistent and novel weaknesses on the fact that big data research diverges from traditional biomedical research in many respects. As previously mentioned, big data projects are often broad in scope, involve new actors, use unprecedented methodologies to analyze data, and require specific expertise. Furthermore, the peculiarities of big data itself (e.g., being large in volume and from a variety of sources) make data-driven research different in practice from traditional research. However, we should not consider the category of “novel weaknesses” a closed category. We do not argue that weaknesses mentioned here do not, at least partially, overlap with others which already exist. In fact, in almost all cases of ‘novelty’, (i) there is some link back to a concept from traditional research ethics, and (ii) some thought has been given to the issue outside of a big data or biomedical context (e.g., the problem of ERCs’ expertise has arisen in other fields [ 42 ]). We believe that by creating conceptual clarity about novel oversight challenges presented by big data research, we can begin to identify tailored reforms.

Persistent weaknesses

As regulation for research oversight varies between countries, ERCs often suffer from a lack of harmonization. This weakness in the current oversight mechanism is compounded by big data research, which often relies on multi-center international consortia. These consortia in turn depend on approval by multiple oversight bodies demanding different types of scrutiny [ 43 ]. Furthermore, big data research may give rise to collaborations between public bodies, universities, corporations, foundations, and citizen science cooperatives. In this network, each stakeholder has different priorities and depends upon its own rules for regulation of the research process [ 44 – 46 ]. Indeed, this expansion of regulatory bodies and aims does not come with a coordinated effort towards agreed-upon review protocols [ 47 ]. The lack of harmonization is perpetuated by academic journals and funding bodies with diverging views on the ethics of big data. If the review bodies which constitute the “ethics ecosystem” [ 19 ] do not agree to the same ethics review requirements, a big data project deemed acceptable by an ERC in one country may be rejected by another ERC, within or beyond the national borders.

In addition, there is inconsistency in the assessment criteria used within and across committees. Researchers report subjective bias in the evaluation methodology of ERCs, as well as variations in ERC judgements which are not based on morally relevant contextual considerations [ 48 , 49 ]. Some authors have argued that the probability of research acceptance among experts increases if some research peer or same-field expert sits on the evaluation committee [ 50 , 51 ]. The judgement of an ERC can also be influenced by the boundaries of the scientific knowledge of its members. These boundaries can impact the ERC’s approach towards risk taking in unexplored fields of research [ 52 ]. Big data research might worsen this problem since the field is relatively new, with no standardized metric to assess risk within and across countries [ 53 ]. The committees do not necessarily communicate with each other to clarify their specific role in the review process, or try to streamline their approach to the assessment. This results in unclear oversight mandates and inconsistent ethical evaluations [ 27 , 54 ].

Additionally, ERCs may fall short in their efforts to justly redistribute the risks and benefits of research. The current review system is still primarily tilted toward protecting the interests of individual research participants. ERCs do not consistently assess societal benefit, or risks and benefits in light of the overall conduct of research (balancing risks for the individual with collective benefits). Although demands on ERCs vary from country to country [ 55 ], the ERC approach is still generally tailored towards traditional forms of biomedical research, such as clinical trials and longitudinal cohort studies with hospital patients. These studies are usually narrow in scope and carry specific risks only for the participants involved. In contrast, big data projects can impact society more broadly. As an example, computational technologies have shown potential to determine individuals’ sexual orientation by screening facial images [ 56 ]. An inadequate assessment of the common good resulting from this type of study can be socially detrimental [ 57 ]. In this sense, big data projects resemble public health research studies, with an ethical focus on the common good over individual autonomy [ 58 ]. Within this context, ERCs have an even greater responsibility to ensure the just distribution of research benefits across the population. Accurately determining the social value of big data research is challenging, as negative consequences may be difficult to detect before research begins. Nevertheless, this task remains a crucial objective of research oversight.

The literature reports examples of the failure of ERCs to be accountable and transparent [ 59 ]. This might be the result of an already unclear role of ERCs. Indeed, the ERCs practices are an outcome of different levels of legal, ethical, and professional regulations, which largely vary across jurisdictions. Therefore, some ERCs might function as peer counselors, others as independent advisors, and still others as legal controllers. What seems to be common across countries, though, is that ERCs rarely disclose their procedures, policies, and decision-making process. The ERCs’ “secrecy” can result in an absence of trust in the ethical oversight model [ 60 ].This is problematic because ERCs rely on public acceptance as accountable and trustworthy entities [ 61 ]. In big data research, as the number of data subjects is exponentially greater, a lack of accountability and an opaque deliberative process on the part of ERCs might bring even more significant public backlash. Ensuring truthfulness of the stated benefits and risks of research is a major determinant of trust in both science and research oversight. Researchers are another category of stakeholders negatively impacted by poor communication and publicity on the part of the ERC. Commentators have shown that ERCs often do not clearly provide guidance about the ethical standards applied in the research review [ 62 ]. For instance, if researchers provide unrealistic expectations of privacy and security to data subjects, ERCs have an institutional responsibility to flag those promises (e.g., about data security and the secondary-uses of subject data), especially when the research involves personal and high sensitivity data [ 63 ]. For their part, however, ERCs should make their expectations and decision-making processes clear.

Finally, ERCs face the increasing issue of being overwhelmed by the number of studies to review [ 64 , 65 ]. Whereas ERCs originally reviewed only human subjects research happening in natural sciences and medicine, over time they also became the ethical body of reference for those conducting human research in the social sciences (e.g., in behavioral psychology, educational sciences, etc.). This increase in demand creates pressure on ERC members, who often review research pro bono and on a voluntary basis. The wide range of big data research could exacerbate this existing issue. Having more research to assess and less time to accomplish the task may negatively impact the quality of the ERC’s output, as well as increase the time needed for review [ 66 ]. Consequently, researchers might carry out potentially risky studies because the relevant ethical issues of those studies were overlooked. Furthermore, research itself could be significantly delayed, until it loses its timely scientific value.

Novel weaknesses: purview weaknesses

To determine whether the ERC is still the most fit-for-purpose entity to oversee big data research, it is important to establish under which conditions big data projects fall under the purview of ERCs.

Historically, research oversight has primarily focused on human subject research in the biomedical field, using public funding. In the US for instance, each review board is responsible for a subtype of research based on content or methodology (for example there are IRBs dedicated to validating clinical trial protocols, assessing cancer treatments, examining pediatric research, and reviewing qualitative research). This traditional ethics review structure cannot accommodate big data research [ 2 ]. Big data projects often reach beyond a single institution, cut across disciplines, involve data collected from a variety of sources, re-use data not originally collected for research purposes, combine diverse methodologies, orient towards population-level research, rely on large data aggregates, and emerge from collaboration with the private sector. Given this scenario, big data projects may likely fall beyond the purview of ERCs.

Another case in which big data research does not fall under ERC purview is when it relies on anonymized data. If researchers use data that cannot be traced back to subjects (anonymized or non-personal data), then according to both the US Common Rule and HIPAA regulations, the project is considered safe enough to be granted an ethics review waiver. If instead researchers use pseudonymized (or de-identified) data, they must apply for research ethics review, as in principle the key that links the de-identified data with subjects could be revealed or hacked, causing harm to subjects. In the European Union, it would be left to each Member State (and national laws or policies at local institutions) to define whether research using anonymized data should seek ethical review. This case shows once more that current research ethics regulation is relatively loose and disjointed across jurisdictions, and may leave areas where big data research is unregulated. In particular, the special treatment given anonymized data comes from an emphasis on risk at the individual level. So far in the big data discourse, the concept of harm has been mainly linked to vulnerability in data protection. Therefore if privacy laws are respected, and protection is built into the data system, researchers can prevent harmful outcomes [ 40 ]. However, this view is myopic as it does not include other misuses of data aggregates, such as group discrimination and dignitary harm. These types of harm are already emerging in the big data ecosystem, where anonymized data reveal health patterns of a certain sub-group, or computational technologies include strong racial biases [ 67 , 68 ]. Furthermore, studies using anonymized data should not be deemed oversight-free by default, as it is increasingly hard to anonymize data. Technological advancements might soon make it possible to re-identify individuals from aggregate data sets [ 69 ].

The risks associated with big data projects also increase due to the variety of actors involved in research alongside university researchers (e.g., private companies, citizen science associations, bio-citizen groups, community workers cooperatives, foundations, and non-profit organizations) [ 70 , 71 ]. The novel aspect of health-related big data research compared with traditional research is that anyone who can access large amounts of data about individuals and build predictive models based on that data, can now determine and infer the health status of a person without directly engaging with that person in a research program [ 72 ]. Facebook, for example, is carrying out a suicide prediction and prevention project, which relies exclusively on the information that users post on the social network [ 18 ]. Because this type of research is now possible, and the available ethics review model exempts many big data projects from ERC appraisal, gaps in oversight are growing [ 17 , 73 ]. Just as corporations can re-use publicly available datasets (such as social media data) to determine life insurance premiums [ 74 ], citizen science projects can be conducted without seeking research oversight [ 75 ]. Indeed, participant-led big data research (despite being increasingly common) is another area where the traditional overview model is not effective [ 76 ]. In addition, ERCs might consider research conducted outside academia or publicly funded institutions to be not serious. Thus ERCs may disregard review requests from actors outside the academic environment (e.g., by the citizen science or health tech start up) [ 77 ].

Novel weaknesses: functional weaknesses

Functional weaknesses are those related to the skills, composition, and operational activities of ERCs in relation to big data research.

From this functional perspective, we argue that the ex-ante review model might not be appropriate for big data research. Project assessment at the project design phase or at the data collection level is insufficient to address emerging challenges that characterize big data projects – especially as data, over time, could become useful for other purposes, and therefore be re-used or shared [ 53 ]. Limitations of the ex-ante review model have already become apparent in the field of genetic research [ 78 ]. In this context, biobanks must often undergo a second ethics assessment to authorize the specific research use on exome sequencing of their primary data samples [ 79 ]. Similarly, in a case in which an ERC approved the original collection of sensitive personal data, a data access committee would ensure that the secondary uses are in line with original consent and ethics approval. However, if researchers collect data from publicly accessible platforms, they can potentially use and re-use data for research lawfully, without seeking data subject consent or ERC review. This is often the case in social media research. Social media data, which are collected by researchers or private companies using a form of broad consent, can be re-used by researchers to conduct additional analysis without ERC approval. It is not only the re-use of data that poses unforeseeable risks. The ex-ante approach might not be suitable to assess other stages of the data lifecycle [ 80 ], such as deployment machine learning algorithms.

Rather than re-using data, some big data studies build models on existing data (using data mining and machine learning methods), creating new data, which is then used to further feed the algorithms [ 81 ]. Sometimes it is not possible to anticipate which analytic models or tools (e.g., artificial intelligence) will be leveraged in the research. And even then, the nature of computational technologies which extract meaning from big data make it difficult to anticipate all the correlations that will emerge from the analysis [ 37 ]. This is an additional reason that big data research often has a tentative approach to a research question, instead of growing from a specific research hypothesis [ 82 ].The difficulty of clearly framing the big data research itself makes it even harder for ERCs to anticipate unforeseeable risks and potential societal consequences. Given the existing regulations and the intrinsic exploratory nature of big data projects, the mandate of ERCs does not appear well placed to guarantee research oversight. It seems even less so if we consider problems that might arise after the publication of big data studies, such as repurposing or dual-use issues [ 83 ].

ERCs also face the challenge of assessing the value of informed consent for big data projects. To re-obtain consent from research subjects is impractical, particularly when using consumer generated data (e.g., social media data) for research purposes. In these cases, researchers often rely on broad consent and consent waivers. This leaves the data subjects unaware of their participation in specific studies, and therefore makes them incapable of engaging with the research progress. Therefore, the data subjects and the communities they represent become vulnerable towards potential negative research outcomes. The tool of consent has limitations in big data research—it cannot disclose all possible future uses of data, in part because these uses may be unknown at the time of data generation. Moreover, researchers can access existing datasets multiple times and reuse the same data with alternative purposes [ 84 ]. What should be the ERCs’ strategy, given the current model of informed consent leaves an ethical gap in big data projects? ERCs may be tempted to focus on the consent challenge, neglecting other pressing big data issues [ 53 ]. However, the literature reports an increasing number of authors who are against the idea of a new consent form for big data studies [ 5 ].

A final widely discussed concern is the ERC’s inadequate expertise in the area of big data research [ 85 , 86 ]. In the past, there have been questions about the technical and statistical expertise of ERC members. For example, ERCs have attempted to conform social science research to the clinical trial model, using the same knowledge and approach to review both types of research [ 87 ]. However, big data research poses further challenges to ERCs’ expertise. First, the distinct methodology of big data studies (based on data aggregation and mining) requires a specialized technical expertise (e.g., information systems, self-learning algorithms, and anonymization protocols). Indeed, big data projects have a strong technical component, due to data volume and sources, which brings specific challenges (e.g., collecting data outside traditional protocols on social media) [ 88 , 89 ]. Second, ERCs may be unfamiliar with new actors involved in big data research, such as citizen science actors or private corporations. Because of this lack of relevant expertise, ERCs may require unjustified amendments to research studies, or even reject big data projects tout-court [ 36 ]. Finally, ERCs may lose credibility as an oversight body capable of assessing ethical violations and research misconduct. In the past, ERCs solved this challenge by consulting independent experts in a relevant field when reviewing a protocol in that domain. However, this solution is not always practical as it depends upon the availability of an expert. Furthermore, experts may be researchers working and publishing in the field themselves. This scenario would be problematic because researchers would have to define the rules experts must abide by, compromising the concept of independent review [ 19 ]. Nonetheless, this problem does not disqualify the idea of expertise but requires high transparency standards regarding rule development and compliance. Other options include ad-hoc expert committees or provision of relevant training for existing committee members [ 47 , 90 , 91 ]. Given these options, which one is best to address ERCs’ lack of expertise in big data research?

Reforming the ERC

Our analysis shows that ERCs play a critical role in ensuring ethical oversight and risk–benefit evaluation [ 92 ], assessing the scientific validity of a project in its early stages, and offering an independent, critical, and interdisciplinary approach to the review. These strengths demonstrate why the ERC is an oversight model worth holding on to. Nevertheless, ERCs carry persistent big data-specific weaknesses, reducing their effectiveness and appropriateness as oversight bodies for data-driven research. To answer our initial research question, we propose that the current oversight mechanism is not as fit for purpose to assess the ethics of big data research as it could be in principle. ERCs should be improved at several levels to be able to adequately address and overcome these challenges. Changes could be introduced at the level of the regulatory framework as well as procedures. Additionally, reforming the ERC model might mean introducing complementary forms of oversight. In this section we explore these possibilities. Figure  2 offers an overview of the reforms that could aid ERCs in improving their process.

An external file that holds a picture, illustration, etc.
Object name is 12910_2021_616_Fig2_HTML.jpg

Reforms overview for the research oversight mechanism

Regulatory reforms

The regulatory design of research oversight is the first aspect which needs reform. ERCs could benefit from new guidance (e.g., in the form of a flowchart) on the ethics of big data research. This guidance could build upon a deep rethinking of the importance of data for the functioning of societies, the way we use data in society, and our justifications for this use. In the UK, for instance, individuals can generally opt out of having their data (e.g., hospital visit data, health records, prescription drugs) stored by physicians’ offices or by NHS digital services. However, exceptions to this opt-out policy apply when uses of the data are vital to the functioning of society (for example, in the case of official national statistics or overriding public interest, such as the COVID-19 pandemic) [ 93 ].

We imagine this new guidance also re-defining the scope of ERC review, from protection of individual interest to a broader research impact assessment. In other words, it will allow the ERC’s scope to expand and to address purview issues which were previously discussed. For example, less research will be oversight-free because more factors would trigger ERC purview in the first place. The new governance would impose ERC review for research involving anonymized data, or big data research within public–private partnerships. Furthermore, ERC purview could be extended beyond the initial phase of the study to other points in the data lifecycle [ 94 ]. A possible option is to assess a study after its conclusion (as is the case in the pharmaceutical industry): ERCs could then decide if research findings and results should be released and further used by the scientific community. This new ethical guidance would serve ERCs not only in deciding whether a project requires review, but also in learning from past examples and best practices how to best proceed in the assessment. Hence, this guidance could come in handy to increase transparency surrounding assessment criteria used across ERCs. Transparency could be achieved by defining a minimum global standard for ethics assessment that allows international collaboration based on open data and a homogenous evaluation model. Acceptance of a global standard would also mean that the same oversight procedures will apply to research projects with similar risks and research paths, regardless of whether they are carried on by public or private entities. Increased clarification and transparency might also streamline the review process within and across committees, rendering the entire system more efficient.

Procedural reforms

Procedural reforms might target specific aspects of the ERC model to make it more suitable for the review of big data research. To begin with, ERCs should develop new operational tools to mitigate emerging big data challenges. For example, the AI Now algorithmic impact assessment tool, which appraises the ethics of automated decision systems, and informs decisions about whether or not to deploy the systems in society, could be used [ 95 ]. Forms of broad consent [ 96 ] and dynamic consent [ 20 ] can also address some of the issues raised, by using, re-using, and sharing big data (publicly available or not). Nonetheless, informed consent should not be considered a panacea for all ethical issues in big data research—especially in the case of publicly available social media data [ 97 ]. If the ethical implications of big data studies affect the society and its vulnerable sub-groups, individual consent cannot be relied upon as an effective safeguard. For this reason, ERCs should move towards a more democratic process of review. Possible strategies include engaging research subjects and communities in the decision-making process or promoting a co-governance system. The recent Montreal Declaration for Responsible AI is an example of an ethical oversight process developed out of public involvement [ 98 ]. Furthermore, this inclusive approach could increase the trustworthiness of the ethics review mechanism itself [ 99 ]. In practice, the more that ERCs involve potential data subjects in a transparent conversation about the risks of big data research, the more socially accountable the oversight mechanism will become.

ERCs must also address their lack of big data and general computing expertise. There are several potential ways to bridge this gap. First, ERCs could build capacity with formal training on big data. ERCs are willing to learn from researchers about social media data and computational methodologies used for data mining and analysis [ 85 ]. Second, ERCs could adjust membership to include specific experts from needed fields (e.g., computer scientists, biotechnologists, bioinformaticians, data protection experts). Third, ERCs could engage with external experts for specific consultations. Despite some resistance to accepting help, recent empirical research has shown that ERCs may be inclined to rely upon external experts in case of need [ 86 ].

In the data-driven research context, ERCs must embrace their role as regulatory stewards, and walk researchers through the process of ethics review [ 40 ]. ERCs should establish an open communication channel with researchers to communicate the value of research ethics while clarifying the criteria used to assess research. If ERCs and researchers agree to mutually increase transparency, they create an opportunity to learn from past mistakes and prevent future ones [ 100 ]. Universities might seek to educate researchers on ethical issues that can arise when conducting data-driven research. In general, researchers would benefit from training on identifying issues of ethics or completing ethics self-assessment forms, particularly if they are responsible for submitting projects for review [ 101 ]. As biomedical research is trending away from hospitals and clinical trials, and towards people’s homes and private corporations, researchers should strive towards greater clarity, transparency, and responsibility. Researchers should disclose both envisioned risks and benefits, as well as the anticipated impact at the individual and population level [ 54 ]. ERCs can then more effectively assess the impact of big data research and determine whether the common good is guaranteed. Furthermore, they might examine how research benefits are distributed throughout society. Localized decision making can play a role here [ 55 ]. ERCs may take into account characteristics specific to the social context, to evaluate whether or not the research respects societal values.

Complementary reforms

An additional measure to tackle the novelty of big data research might consist in reforming the current research ethics system through regulatory and procedural tools. However, this strategy may not be sufficient: the current system might require additional support from other forms of oversight to complement its work.

One possibility is the creation of hybrid review mechanisms and norms, merging valuable aspects of the traditional ERC review model with more innovative models, which have been adopted by various partners involved in the research (e.g., corporations, participants, communities) [ 102 ]. This integrated mechanism of oversight would cover all stages of big data research and involve all relevant stakeholders [ 103 ]. Journals and the publishing industry could play a role within this hybrid ecosystem in limiting potential dual use concerns. For instance, in the research publication phase, resources could be assigned to editors so as to assess research integrity standards and promote only those projects which are ethically aligned. However, these implementations can have an impact only when there is a shared understanding of best practice within the oversight ecosystem [ 19 ].

A further option is to include specialized and distinct ethical committees alongside ERCs, whose purpose is to assess big data research and provide sectorial accreditation to researchers. In this model, ERCs would not be overwhelmed by the numbers of study proposals to review and could outsource evaluations requiring specialist knowledge in the field of big data. It is true that specialized committees (data safety monitoring boards, data access committees, and responsible research and innovation panels) already exist and support big data researchers in ensuring data protection (e.g., system security, data storage, data transfer). However, something like a “data review board” could assess research implications both for the individual and society, while reviewing a project’s technical features. Peer review could play a critical role in this model: the research community retains the expertise needed to conduct ethical research and to support each other when the path is unclear [ 101 ].

Despite their promise, these scenarios all suffer from at least one primary limitation. The former might face a backlash when attempting to bring together the priorities and ethical values of various stakeholders, within common research norms. Furthermore, while decentralized oversight approaches might bring creativity over how to tackle hard problems, they may also be very dispersive and inefficient. The latter could suffer from overlapping scope across committees, resulting in confusing procedures, and multiplying efforts while diluting liability. For example, research oversight committees have multiplied within the United States, leading to redundancy and disharmony across committees [ 47 ]. Moreover, specialized big data ethics committees working in parallel with current ERCs could lead to questions over the role of the traditional ERC, when an increasing number of studies will be big data studies.

ERCs face several challenges in the context of big data research. In this article, we sought to bring clarity regarding those which might affect the ERC’s practice, distinguishing between novel and persistent weaknesses which are compounded by big data research. While these flaws are profound and inherent in the current sociotechnical transformation, we argue that the current oversight model is still partially capable of guaranteeing the ethical assessment of research. However, we also advance the notion that introducing reform at several levels of the oversight mechanism could benefit and improve the ERC system itself. Among these reforms, we identify the urgency for new ethical guidelines and new ethical assessment tools to safeguard society from novel risks brought by big data research. Moreover, we recommend that ERCs adapt their membership to include necessary expertise for addressing the research needs of the future. Additionally, ERCs should accept external experts’ consultations and consider training in big data technical features as well as big data ethics. A further reform concerns the need for transparent engagement among stakeholders. Therefore, we recommend that ERCs involve both researchers and data subjects in the assessment of big data research. Finally, we acknowledge the existing space for a coordinated and complementary support action from other forms of oversight. However, the actors involved must share a common understanding of best practice and assessment criteria in order to efficiently complement the existing oversight mechanism. We believe that these adaptive suggestions could render the ERC mechanism sufficiently agile and well-equipped to overcome data-intensive research challenges and benefit research at large.

Acknowledgements

This article reports the ideas and the conclusions emerged during a collaborative and participatory online workshop. All authors participated in the “Big Data Challenges for Ethics Review Committees” workshop, held online the 23-24 April 2020 and organized by the Health Ethics and Policy Lab, ETH Zurich.

Abbreviations

ERC(s)Ethics Review Committee(s)
HIPAAHealth Insurance Portability and Accountability Act
IRB(s)Institutional Review Board(s)
NHSNational Health Service
REC(s)Research Ethics Committee(s)
UKUnited Kingdom
USUnited States

Authors' contributions

AF drafted the manuscript, MI, MS1 and EV contributed substantially to the writing. EV is the senior lead on the project from which this article derives. All the authors (AF, MI, MS1, AB, ESD, BF, PF, JK, WK, PK, SML, CN, GS, MS2, MRV, EV) contributed greatly to the intellectual content of this article, edited it, and approved the final version. All authors read and approved the final manuscript.

This research is supported by the Swiss National Science Foundation under award 407540_167223 (NRP 75 Big Data). MS1 is grateful for funding from the National Institute for Health Research (NIHR) Oxford Biomedical Research Centre (BRC). The funding bodies did not take part in designing this research and writing the manuscript.

Availability of data and materials

Declarations.

The authors declare that they have no competing interests.

1 There is an unsettled discussion about whether ERCs ought to play a role in evaluating both scientific and ethical aspects of research, or whether these can even come apart—but we will not go into detail here. 25.Dawson AJ, Yentis SM. Contesting the science/ethics distinction in the review of clinical research. Journal of Medical Ethics. 2007;33(3):165–7, 26.Angell EL, Bryman A, Ashcroft RE, Dixon-Woods M. An analysis of decision letters by research ethics committees: the ethics/scientific quality boundary examined. BMJ Quality & Safety. 2008;17(2):131–6.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

  • Statistical Analysis
  • Mathematical Sciences
  • Data Analysis

Critical Questions for Big Data

  • Information Communication and Society 15(5):1-18

danah boyd at Microsoft

  • This person is not on ResearchGate, or hasn't claimed this research yet.

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Angelos Kissas

  • Isabelle Donatz-Fest
  • Susan T. Hibbard
  • Jeanne McClure
  • Shaun Kellogg
  • Haoshu Zhang
  • Astri Yulia
  • Kumulative Dissertationsschrift

Annina Förschler

  • Giovanni Cerulli

Giuliano Resce

  • Zachary Schrag

Paul du Gay

  • Andrew Barry Barry
  • Georgina Born
  • Maureen Cain
  • Janet Finch
  • NEW LEFT REV
  • D.N. McCloskey
  • Lev Manovich
  • Susan M Reverby
  • Henry W Foster

Zachary Schrag

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Practical issues to consider when working with big data

  • Open access
  • Published: 05 August 2022
  • Volume 27 , pages 1117–1124, ( 2022 )

Cite this article

You have full access to this open access article

research questions on big data

  • Lorien Stice-Lawrence   ORCID: orcid.org/0000-0002-3837-7167 1  

6756 Accesses

3 Citations

2 Altmetric

Explore all metrics

Increasing access to alternative or “big data” sources has given rise to an explosion in the use of these data in economics-based research. However, in our enthusiasm to use the newest and greatest data, we as researchers may jump to use big data sources before thoroughly considering the costs and benefits of a particular dataset. This article highlights four practical issues that researchers should consider before working with a given source of big data. First, big data may not be conceptually different from traditional data. Second, big data may only be available for a limited sample of individuals, especially when aggregated to the unit of interest. Third, the sheer volume of data coupled with high levels of noise can make big data costly to process while still producing measures with low construct validity. Last, papers using big data may focus on the novelty of the data at the expense of the research question. I urge researchers, in particular PhD students, to carefully consider these issues before investing time and resources into acquiring and using big data.

Similar content being viewed by others

research questions on big data

Big Data—Technologies and Potential

research questions on big data

Big data in social and psychological science: theoretical and methodological issues

research questions on big data

Big Data in Academic Research: Challenges, Pitfalls, and Opportunities

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Researchers have increasing access to alternative or “big data” sources, giving rise to an explosion in the use of this type of data in economics-based research. However, researchers embarking on new projects using big data sometimes fail to recognize that big data, while novel, is not a panacea for the problems faced when working with more traditional data sources. This article highlights key issues that researchers should consider before working with a particular source of big data in order to ensure that any prospective project has the potential to convincingly test an economically interesting and impactful research question. Many of the issues I discuss are common to any source of new data, not just big data, and are therefore familiar to experienced data veterans. As a result, my remarks are targeted primarily at PhD students. Blankespoor et al. ( 2022 , also in this issue) is presented as one example of a study that has largely overcome the issues with big data presented here.

“Big data” refers to data that is larger and more complex than traditional machine-readable data sources; so much so that traditional approaches are often inadequate and new techniques are needed to process and analyze the data. When defining big data, Oracle, the world’s largest database management company, refers to terabytes or even petabytes of data (Oracle 2022). In actuality, much of the data frequently referred to as “big data” in business research is really not that big, especially because researchers often only use small slivers of what is available. As a result, the term “alternative data” is perhaps a more accurate way to describe the broad set of new data sources increasingly being used. For parsimony, I refer to either alternative data or big data collectively as “big data”.

In this paragraph, I briefly describe several types of big data sources that have been used by researchers in accounting and financial economics. However, the high-level issues I discuss in Sect. 2 apply to the use of big data in other disciplines as well. One of the earliest sources of big data used in accounting and finance research was textual content, for instance company disclosures, such as annual reports, press releases, and conference call transcripts, but also other sources of text, including news articles, website content, product reviews, and analyst reports (e.g., Loughran and McDonald 2016 ). Textual data can be large and is inherently unstructured, necessitating additional processing before it can be used in statistical analyses. Similarly, images , such as those of executives’ faces or their signatures (e.g., Ham et al. 2017 ), as well as audio-visual content, such as conference calls or roadshow presentations (e.g., Hobson et al. 2012 ), require specialized processing techniques. In addition, research is increasingly making use of social media data: company and customer tweets, friend networks, and content likes, to name a few (e.g., Lee et al. 2015 ). Location and movement data captured by cell phone signals, satellite images, and taxi data can approximate physical movements of customers, shipments, or investors (e.g., Kang et al. 2021 ) while transaction and point-of-sale data can approximate what is being purchased, how much, and when (e.g., Blankespoor et al. 2022 ). Similarly, digital footprint data from web search and browsing, app usage, and online transactions can track users’ online movement and behavior (e.g., Froot et al. 2017 ). More broadly, IoT (Internet of Things) devices such as smart and connected devices in home, healthcare, manufacturing, retail, and transportation settings generate a vast amount of data. While most IoT data in accounting and finance up to this point is restricted to cell phone data, this data will likely play a much larger role in our research going forward. Although this summary of big data sources is incomplete, and many of these categories are non-mutually exclusive, an overall takeaway is that “big data” is an umbrella term that encompasses many different types of data.

This article does not intend to give technical software tips or in-depth analysis of each of the many types of big data, and I refer readers interested in more extensive descriptions of alternative data sources and their potential applications to Cong et al. ( 2021 ) and Teoh ( 2018 ). Instead, I focus on the common issues that can arise when dealing with any of these big data sources. Because the issues that I discuss, if ignored, can limit the econometric rigor and contribution of research, successfully published papers using big data are those that have overcome these issues through careful research designs and well-thought-out research questions. This sample selection of papers in the public record may give the false impression to researchers considering working in this area, in particular PhD students, that these sorts of issues may simply “fall into place” after the researcher has found a sufficiently interesting source of big data. However, what are not as visible are the many projects that were eventually abandoned because the authors were not able to overcome the hurdles discussed below. As a result, this article is not a review of published research using big data but instead highlights the high-level issues researchers should consider at the initial project planning and selection stage. In Sect. 2, I outline the four key issues to consider when working with big data and apply this approach to Blankespoor et al. ( 2022 ).

2 Key issues to consider when conducting research using big data

2.1 is it conceptually different.

The first question to ask when considering whether to use a particular source of big data is whether the data is conceptually different from more traditional data. This is especially important if prior research has already examined a similar research question. In other words, prospective researchers need to decide whether they believe the use of big data allows them to answer new questions, or if instead it allows them to answer old questions with more precision, for example because of increased statistical power. In most cases, the potential contribution and impact of the former is greater than the latter. Although a new source of big data may be unique in many ways, it may capture a very similar construct as data used in prior research. In that case, a researcher may inadvertently end up testing an old question without realizing it. The similarities with prior data may not just be at the construct level; big data can still suffer from the same empirical pitfalls as traditional data. For example, Teoh ( 2018 ) points out that studies using alternative data are still subject to endogeneity concerns. As a result, researchers should carefully review the prior literature and clearly articulate to themselves how their big data source differs from previously used data at the construct level, or how it overcomes an empirical challenge.

The largest contribution of using big data comes when researchers can address questions that were previously unanswerable with more traditional data sources. As a case in point, Blankespoor et al. ( 2022 ) in this issue uses real-time transaction data to estimate the performance information available to managers within a given fiscal quarter. They aggregate individual credit and debit card transaction to the firm-week level so that they can estimate a firm’s performance during the quarter as of a given week. They then examine how managers’ disclosure choices dynamically respond to performance throughout the quarter, which managers observe in real time as the quarter progresses. This research question would be impossible to address using traditional performance data, which provides only a summary measure of aggregate firm performance for the entire quarter. Big data can provide us with the opportunity to study behaviors that may have occurred before its existence but were previously unobservable.

In addition, big data may allow us to answer new questions if individuals and firms behave differently in the presence of big data. For example, Zhu ( 2019 ) examines how the availability of big data about a given firm disciplines managers to make better decisions on behalf of shareholders. However, in some cases, examining whether individuals behave differently in the presence of big data is conceptually identical to examining whether individuals behave differently in the presence of data. Big data is just one type of data that is made up of many signals rather than just a few. As a result, the new availability of alternative data sources in the last two decades is conceptually similar to the introduction of the internet and widely accessible electronic databases (such as EDGAR) in the 1990s or even the original introduction of CRSP and Compustat in the 1960s. That being said, big data is more timely and more granular than traditional databases and has the ability to track behavior at the individual level in ways that have never been seen before. Consequently, its presence can have different effects on behavior than coarser data. Zhu ( 2019 ) exploits this aspect of big data by showing that firm outsiders can monitor firms more effectively when they have access to real-time big data than when they had access only to traditional financial reports.

2.2 Lots of data, few subjects

In spite of the name, big data can still suffer from small sample problems. Researchers often end up with a surprisingly small number of subjects when the data is aggregated to the unit of interest (e.g., to the firm level). For example, Blankespoor et al. ( 2022 ) begins with 1.6 billion individual transactions but aggregates these to measure the performance of only 243 firms, all in the retail sector. The result is that studies using big data can suffer from a lack of statistical power. Sometimes such sample limitations are inherent to the data: only retail firms have point-of-sale data, and only healthcare companies have patient medical records. These sorts of sample limitations aren’t problematic if the sample is the population of interest (i.e., retail or healthcare, respectively). If the goal is to predict hospital readmissions, it makes sense to only examine healthcare facilities. The problem comes when trying to make inferences outside of the original sample. Studies that wish to answer research questions about broad economic phenomena need to ensure that the economic forces of interest are not affected by unique attributes of the sample in which these research questions are tested.

Blankespoor et al. ( 2022 ) uses a sample of retail firms to document that managers respond to real-time information by suppressing negative information at the beginning, but not the end, of the fiscal quarter. However, if disclosure responses to real-time information vary across industries, then the results of this study may not generalize. This could be the case if the disclosure response to real-time information varies based on operating cycle length. Firms with relatively short operating cycles (such as retail firms) may initially choose to suppress negative real-time performance information if they expect that they may be able to take corrective actions and change performance during the quarter. On the other hand, firms in industries with relatively long operating cycles (for example, manufacturers of large machinery like Boeing) may not be able to respond quickly enough to materially change performance within a single quarter. As a result, those firms would have less incentive to delay bad news. Retail firms might also have more incentive to suppress bad news if their greater brand visibility leads to disproportionately negative investor reactions to bad news.

Generalizability is not a new problem, but it often gets little attention by authors in the face of novel data. After all, 1.6 billion transactions is a lot! Additionally, while authors should acknowledge them, generalizability problems in a single paper may not be critical if readers are careful when making inferences, especially if follow-on work can validate the results in other settings. However, generalizability problems can compound if research within a given area repeatedly uses big data sources with similar sample selection issues, giving a misleading impression of generalizability. Many sources of big data currently used by researchers in accounting and finance are somewhat or completely focused on retail firms, namely credit card transactions, brand awareness, parking lot images, and foot traffic data. Hopefully this disproportionate representation of retail firms will eventually decrease as access to new sources of big data expands. In an ideal world, researchers could avoid generalizability problems by using big data sources without sample selection issues. Practically speaking, researchers should identify ways in which their sample selection might limit generalizability and design tests to mitigate these concerns.

2.3 Noise and cleaning

Unfortunately, quantity does not always equal quality when it comes to data, and big data is no exception. Many sources of alternative data suffer from high levels of noise or require a large amount of processing before any useable information can be extracted. For example, textual, image, and audiovisual data are all unstructured and require careful analysis that often relies heavily on subjective research design or processing choices (Loughran and McDonald 2016 ). Unlike when they use CRSP or Compustat, researchers using big data sources are often the first to use the data and entirely responsible to process and vet it. The resources required to do so can be significant. Although the authors of Blankespoor et al. ( 2022 ) are modest in describing the difficulty of their data cleaning process, analyzing 1.6 billion transactions was undoubtedly a headache. Unlike corporate users of big data, research teams at academic institutions often consist of only three or four individuals, usually with access to either high-powered personal computers (often inadequate for large amounts of data) or university-hosted computing servers subject to a variety of restrictions and bottlenecks. In other words, the costs to clean and process big data are far from trivial. Further, processing and cleaning choices can materially impact the results of statistical tests (Denny and Spirling 2018 ) and yet are often undisclosed, study-specific, and difficult for others to replicate.

A considerable risk that researchers bear when they decide to incur the costs of processing big data is that the resulting data may still suffer from noise and measurement error. Unfortunately, it is frequently unclear what the quality of the data will be before these costs are incurred. A bigger problem is that the data, even if perfectly measured, might still only be loosely tied to the construct of interest. For example, consider a hypothetical scenario in which a researcher uses Google Trends data to gauge sentiment about a topic that will be voted on at a shareholder meeting in order to predict how management might preemptively respond. Unfortunately for this hypothetical researcher, Google Trends is based on searches by all users, not just those who own stock in a given company, and a large proportion of votes are actually cast by proxy voters. As a result, Google Trends data might not be a good proxy for the attitudes of those who will ultimately cast votes. Although Google Trends can be an excellent barometer for attitudes and interest in many settings, it is not necessarily suited to every research question. Ultimately, no matter how carefully researchers clean and vet their data, big data is not a cure-all for construct validity problems.

2.4 Interesting data, boring questions

Unfortunately, it is difficult to reverse engineer an interesting (or impactful) research question. Frequently researchers are so enthusiastic about a new source of big data that they jump into cleaning and validating the data before they have clearly articulated a research question. Sometimes researchers may even have an idea for a specific construct of interest that they think they can measure with the available data. However, if it was difficult to generate a research question before processing the data, it is often just as difficult generating a research question after processing the data. As a result, some researchers end up writing papers that focus entirely on validating and describing a given data source in the absence of economic intuition. A related tendency is to write papers that develop methodological approaches without discussing applications or testing new research questions. If the researcher cannot generate any ideas on how a particular source of big data or a particular innovative methodology could be used to test interesting economic questions, then it is unlikely that follow-on researchers will be able to do so either.

The temptation to jump immediately to data collection is especially high for PhD students who hope to develop their research question as they go and are intent on obtaining data with high barriers to entry. However, this is the population in which the costs of wasted effort are potentially the highest. Researchers should carefully consider their research question before collecting data. If possible, they should generate the research question first, and then search for the best data. It is much easier to scale back a grandiose research question to meet the limits of the data than to scale up an uninteresting research question to match an interesting dataset.

2.5 Other issues

The analysis of big data relies more heavily on black box methodologies such as machine learning than does traditional data. These models can provide statistical power at the expense of economic interpretability (Loughran and McDonald 2016 ; Rudin 2019 ). Further, the use of some alternative and big data sources can lead to ethical concerns not applicable to traditional data. For example, the use of highly detailed data at the individual level, such as geolocation, web browsing, app usage, and social media data, raises concerns about individual privacy. This is compounded by the fact that such data is most available for those that are the least literate in data privacy. One could argue that tracking such behavior at an aggregate level protects individual identities. But is there ever a point at which certain behaviors are invasive to study, even in aggregate? And what stops other researchers from targeting and revealing individuals? Another ethical consideration centers around the use of leaked or hacked data. In some cases, we may feel an ethical duty to study corporate behavior that firms have purposefully tried to conceal. On the other hand, using this data may serve to further the unclear agenda of those who leaked or hacked the data in the first place.

3 Conclusions

In many ways, big data is not inherently different from other types of data. However, researchers, especially PhD students, can forget this in the excitement of learning about a new dataset. This article highlights four practical issues to consider when conducting economics-based research using big data. First, a particular source of big data may not be conceptually different from traditional data. As a result, studies that simply replicate prior results using big data may lack contribution, especially if the new data suffers from the same issues as prior data (e.g., endogeneity). Second, big data sources may only be available for a limited number of entities, especially when aggregated to the unit of interest, leading to limited statistical power and generalizability. Third, high levels of noise and the large volume of data can make big data costly to process; however, an arduous data cleaning process itself does not ensure that empirical proxies are tied to the constructs of interest. Last, interesting research questions are difficult to reverse engineer after the fact, and researchers who invest heavily in big data before generating a research question may end up with a paper that focuses on validating data in the absence of economic intuition. Researchers who keep these four issues in mind can potentially save themselves considerable resources (not to mention heartache!) by avoiding low-impact, high-cost projects and by instead focusing on research questions with the greatest potential for contribution.

Blankespoor, E., B. E. Hendricks, J. D. Piotroski, and C. Synn. 2022. Real-time revenue and firm disclosure. Review of Accounting Studies 27 (3).

Denny, M. J., and A. Spirling. 2018. Text preprocessing for unsupervised learning: Why it matters, when it misleads, and what to do about it. Political Analysis 26 (2): 168–189.

Article   Google Scholar  

Cong, L. W., B. Li, and Q. T. Zhang. 2021. Alternative data in fintech and business intelligence. In The Palgrave Handbook of FinTech and Blockchain , eds. M. Pompella, and R. Matousek, 217–242. Cham: Palgrave Macmillan.

Chapter   Google Scholar  

Froot, K., N. Kang, G. Ozik, and R. Sadka. 2017. What do measures of real-time corporate sales say about earnings surprises and post-announcement returns? Journal of Financial Economics 125 (1): 143–162.

Ham, C., M. Lang, N. Seybert, and S. Wang. 2017. CFO narcissism and financial reporting quality. Journal of Accounting Research 55 (5): 1089–1135.

Hobson, J. L., W. J. Mayew, and M. Venkatachalam. 2012. Analyzing speech to detect financial misreporting. Journal of Accounting Research 50 (2): 349–392.

Kang, J. K., L. Stice-Lawrence, and Y. T. F. Wong. 2021. The firm next door: Using satellite images to study local information advantage. Journal of Accounting Research 59 (2): 713–750.

Lee, L. F., A. P. Hutton, and S. Shu. 2015. The role of social media in the capital market: Evidence from consumer product recalls. Journal of Accounting Research 53 (2): 367–404.

Loughran, T., and B. McDonald. 2016. Textual analysis in accounting and finance: A survey. Journal of Accounting Research 54 (4): 1187–1230.

Oracle Corporation. 2022. What is Big Data? Oracle.com. https://www.oracle.com/big-data/what-is-big-data/ . Accessed June 23, 2022.

Rudin, C. 2019. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence 1 (5): 206–215.

Teoh, S. H. 2018. The promise and challenges of new datasets for accounting research. Accounting Organizations and Society 68: 109–117.

Zhu, C. 2019. Big data as a governance mechanism. The Review of Financial Studies 32 (5): 2021–2061.

Download references

Acknowledgements

This article is an expansion of discussant remarks delivered at the 2021 Review of Accounting Studies Conference. Many thanks to Patricia Dechow, Earl K. Stice, Forester Wong, and conference participants.

Author information

Authors and affiliations.

University of Southern California, Los Angeles, CA, USA

Lorien Stice-Lawrence

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Lorien Stice-Lawrence .

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Stice-Lawrence, L. Practical issues to consider when working with big data. Rev Account Stud 27 , 1117–1124 (2022). https://doi.org/10.1007/s11142-022-09708-x

Download citation

Received : 01 July 2022

Revised : 20 July 2022

Accepted : 20 July 2022

Published : 05 August 2022

Issue Date : September 2022

DOI : https://doi.org/10.1007/s11142-022-09708-x

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Alternative data
  • Emerging technologies
  • Research design

JEL classifications:

  • Find a journal
  • Publish with us
  • Track your research
  • IT career paths

Top 35 big data interview questions with answers for 2024

Big data is a hot field and organizations are looking for talent at all levels. get ahead of the competition for that big data job with these top interview questions and answers..

  • Robert Sheldon
  • Elizabeth Davies

Increasingly, organizations across the globe are seeing the wisdom of embracing big data. The careful analysis and synthesis of massive data sets can provide invaluable insights to help them make informed and timely strategic business decisions .

For example, big data analytics can help determine what new products to develop based on a deep understanding of customer behaviors, preferences and buying patterns. Analytics can also reveal untapped potential, such as new territories or nontraditional market segments.

As organizations race to augment their big data capabilities and skills , the demand for qualified candidates in the field is reaching new heights. If you aspire to pursue a career path in this domain, a world of opportunity awaits. Today's most challenging -- yet rewarding and in-demand -- big data roles include data analysts, data scientists, database administrators, big data engineers and Hadoop specialists. Knowing what big data questions an interviewer will likely ask and how to answer such questions is essential to success.

This article will provide some direction to help set you up for success in your next big data interview -- whether you are a recent graduate in data science or information management or already have experience working in big data-related roles or other technology fields. This piece will also provide you with some of the most commonly asked big data interview questions that prospective employers might ask.

This article is part of

The ultimate guide to big data for businesses

  • Which also includes:
  • 8 benefits of using big data for businesses
  • What a big data strategy includes and how to build one
  • 10 big data challenges and how to address them

Download this entire guide for FREE now!

How to prepare for a big data interview

Before delving into the specific big data interview questions and answers, here are the basics of interview preparation.

  • Prepare a tailored and compelling resume. Ideally, you should tailor your resume (and cover letter) to the particular role and position for which you are applying. Not only should these documents demonstrate your qualifications and experience, but they should also convince your prospective employer that you've researched the organization's history, financials, strategy, leadership, culture and vision. Also, don't be shy to call out what you believe to be your strongest soft skills that would be relevant to the role. These might include communication and presentation capabilities; tenacity and perseverance; an eye for detail and professionalism; and respect, teamwork and collaboration.
  • Remember, an interview is a two-way street. Of course, it is essential to provide correct and articulate answers to an interviewer's technical questions, but don't overlook the value of asking your own questions. Prepare a shortlist of these questions in advance of the appointment to ask at opportune moments.
  • The Q&A: prepare, prepare, prepare. Invest the time necessary to research and prepare your answers to the most commonly asked questions, then rehearse your answers before the interview. Be yourself during the interview. Look for ways to show your personality and convey your responses authentically and thoughtfully. Monosyllabic, vague or bland answers won't serve you well.

Now, here are the top 35 big data interview questions. These include a specific focus on the Hadoop framework, given its widespread adoption and ability to solve the most difficult big data challenges, thereby delivering on core business requirements.

Top 35 big data interview questions and answers

Each of the following 35 big data interview questions includes an answer. However, don't rely solely on these answers when preparing for your interview. Instead, use them as a launching point for digging more deeply into each topic.

1. What is big data?

As basic as this question might seem, you should have a clear and concise answer that demonstrates your understanding of this term and its full scope, making it clear that big data can include just about any type of data from any number of sources. The data might come from sources such as the following:

  • server logs
  • social media
  • medical records
  • temporary files
  • machinery sensors
  • automobiles
  • industrial equipment
  • internet of things (IoT) devices

Big data can include structured, semi-structured and unstructured data -- in any combination -- collected from a range of heterogeneous sources. Once collected, the data must be carefully managed so it can be mined for information and transformed into actionable insights. When mining data, data scientists and other professionals often use advanced technologies such as machine learning, deep learning, predictive modeling or other advanced analytics to gain a deeper understanding of the data.

2. How can big data analytics benefit business?

There are a number of ways that big data can benefit organizations , as long as they can extract value from the data, gain actionable insights and put those insights to work. Although you won't be expected to list every possible outcome of a big data project, you should be able to cite several examples that demonstrate what can be achieved with an effective big data project. For example, you might include any of the following:

  • Improve customer service.
  • Personalize marketing campaigns.
  • Increase worker productivity.
  • Improve daily operations and service delivery.
  • Reduce operational expenses.
  • Identify new revenue streams.
  • Improve products and services.
  • Gain a competitive advantage in your industry.
  • Gain deeper insights into customers and markets.
  • Optimize supply chains and delivery routes.

Organizations within specific industries can also gain from big data analytics . For example, a utility company might use big data to better track and manage electrical grids. Or governments might use big data to improve emergency response, help prevent crime and support smart city initiatives.

3. What are your experiences in big data?

If you have had previous roles in the field of big data, outline your title, functions, responsibilities and career path. Include any specific challenges and how you met those challenges. Also mention any highlights or achievements related either to a specific big data project or to big data in general. Be sure to include any programming languages you've worked with, especially as they pertain to big data.

4. What are some of the challenges that come with a big data project?

No big data project is without its challenges . Some of those challenges might be specific to the project itself or to big data in general. You should be aware of what some of these challenges are -- even if you haven't experienced them yourself. Below are some of the more common challenges:

  • Many organizations don't have the in-house skill sets they need to plan, deploy, manage and mine big data.
  • Managing a big data environment is a complex and time-consuming undertaking that must consider both the infrastructure and data, while ensuring that all the pieces fit together.
  • Securing data and protecting personally identifiable information is complicated by the types of data, amounts of data and the diverse origins of that data.
  • Scaling infrastructure to meet performance and storage requirements can be a complex and costly process.
  • Ensuring data quality and integrity can be difficult to achieve when working with large quantities of heterogeneous data.
  • Analyzing large sets of heterogeneous data can be time-consuming and resource-intensive, and it does not always lead to actionable insights or predictable outcomes.
  • Ensuring that you have the right tools in place and that they all work together brings its own set of challenges.
  • The cost of infrastructure, software and personnel can quickly add up, and those costs can be difficult to keep under control.

5. What are the five Vs of big data?

Big data is often discussed in terms of the following five Vs :

  • The vast amounts of data that are collected from multiple heterogeneous sources.
  • The various formats of structured, semi-structured and unstructured data, from social media, IoT devices, database tables, web applications, streaming services, machinery, business software and other sources.
  • The ever-increasing rate at which data is being generated on all fronts in all industries.
  • The degree of accuracy of collected data, which can vary significantly from one source to the next.
  • The potential business value of the collected data.

Interviewers might ask for only four Vs, rather than five. In which case, they're usually looking for the first four (volume, variety, velocity and veracity). If this happens in your interview, you might also mention that there is sometimes a fifth V: value. To impress your interviewer even further, you can mention yet another V: variability, which refers to the ways in which the data can be used and formatted.

Figure: The six Vs of big data

6. What are the key steps in deploying a big data platform?

There is no one formula that defines exactly how a big data platform should be implemented. However, it's generally accepted that rolling out a big data platform follows these three basic steps:

  • Data ingestion. Start out by collecting data from multiple sources, such as social media platforms, log files or business documentation. Data ingestion might be an ongoing process in which data is continuously collected to support real-time analytics, or it might be collected at defined intervals to meet specific business requirements.
  • Data storage. After extracting the data, store it in a database, which might be the Hadoop Distributed File System ( HDFS ), Apache HBase or another NoSQL database .
  • Data processing. The final step is to prepare the data so it is readily available for analysis. For this, you'll need to implement one or more frameworks that have the capacity handle massive data sets, such as Hadoop, Apache Spark, Flink, Pig or MapReduce, to name a few.

7. What is Hadoop and what are its main components?

Hadoop is an open source distributed processing framework for handling large data sets across computer clusters. It can scale up to thousands of machines, each supporting local computation and storage. Hadoop can process large amounts of different data types and distribute the workloads across multiple nodes, which makes it a good fit for big data initiatives.

The Hadoop platform includes the following four primary modules (components):

  • Hadoop Common. A collection of utilities that support the other modules.
  • Hadoop Distributed File System (HDFS). A key component of the Hadoop ecosystem that serves as the platform's primary data storage system, while providing high-throughput access to application data.
  • Hadoop YARN (Yet Another Resource Negotiator). A resource management framework that schedules jobs and allocates system resources across the Hadoop ecosystem.
  • Hadoop MapReduce. A YARN-based system for parallel processing large data sets.

8. Why is Hadoop so popular in big data analytics?

Hadoop is effective in dealing with large amounts of structured, unstructured and semi-structured data. Analyzing unstructured data isn't easy, but Hadoop's storage, processing and data collection capabilities make it less onerous. In addition, Hadoop is open source and runs on commodity hardware, so it is less costly than systems that rely on proprietary hardware and software.

One of Hadoop's biggest selling points is that it can scale up to support thousands of hardware nodes. Its use of HDFS facilitates rapid data access across all nodes in a cluster, and its inherent fault tolerance makes it possible for applications to continue to run even if individual nodes fail. Hadoop also stores data in its raw form, without imposing any schemas. This allows each team to decide later how to process and filter the data, based on their specific requirements at the time.

As a follow-on from this question, please define the following four terms, specifically in the context of Hadoop:

9. Open source

Hadoop is an Open Source platform. As a result, users can access and modify the source code to meet their specific needs. Hadoop is licensed under Apache License 2.0 , which grants users a "perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form." Because Hadoop is open source and has been so widely implemented, it has a large and active user community for helping to resolve issues and improving the product.

10. Scalability

Hadoop can be scaled out to support thousands of hardware nodes, using only commodity hardware. Organizations can start out with smaller systems and then scale out by adding more nodes to their clusters. They can also scale up by adding resources to the individual nodes. This scalability makes it possible to ingest, store and process the vast amounts of data typical of a big data initiative.

11. Data recovery

Hadoop replication provides built-in fault tolerance capabilities that protect against system failure. Even if a node fails, applications can keep running while avoiding any loss of data. HDFS stores files in blocks that are replicated to ensure fault tolerance, helping to improve both reliability and performance. Administrators can configure block sizes and replication factors on a per-file basis.

12. Data locality

Hadoop moves the computation close to where data resides, rather than moving large sets of data to computation. This helps to reduce network congestion while improving the overall throughput.

13. What are some vendor-specific distributions of Hadoop?

Several vendors now offer Hadoop-based products. Some of the more notable products include the following:

  • Amazon EMR (Elastic MapReduce)
  • Microsoft Azure HDInsight
  • IBM InfoSphere Information Server
  • Hortonworks Data Platform

14. What are some of the main configuration files used in Hadoop?

The Hadoop platform provides multiple configuration files for controlling cluster settings, including the following:

  • 7adoop-env.sh. Site-specific environmental variables for controlling Hadoop scripts in the bin directory.
  • yarn-env.sh. Site-specific environmental variables for controlling YARN scripts in the bin directory.
  • mapred-site.xml. Configuration settings specific to MapReduce, such as the MapReduce.framework.name setting.
  • core-site.xml. Core configuration settings, such as the I/O configurations common to HDFS and MapReduce.
  • yarn-site.xml. Configuration settings specific to YARN's ResourceManager and NodeManager.
  • hdfs-site.xml. Configuration settings specific to HDFS, such as the file path where the NameNode stores the namespace and transactions logs.

Figure: Know core Hadoop components when entering a big data interview.

15. What is HDFS and what are its main components?

HDFS is a distributed file system that serves as Hadoop's default storage environment. It can run on low-cost commodity hardware, while providing a high degree of fault tolerance. HDFS stores the various types of data in a distributed environment that offers high throughput to applications with large data sets. HDFS is deployed in a primary/secondary architecture, with each cluster supporting the following two primary node types:

  • NameNode. A single primary node that manages the file system namespace, regulates client access to files and processes the metadata information for all the data blocks in the HDFS.
  • DataNode. A secondary node that manages the storage attached to each node in the cluster. A cluster typically contains many DataNode instances, but there is usually only one DataNode per physical node. Each DataNode serves read and write requests from the file system's clients.

16. What is Hadoop YARN and what are its main components?

Hadoop YARN manages resources and provides an execution environment for required processes, while allocating system resources to the applications running in the cluster. It also handles job scheduling and monitoring. YARN decouples resource management and scheduling from the data processing component in MapReduce.

YARN separates resource management and job scheduling into the following two daemons :

  • ResourceManager. This daemon arbitrates resources for the cluster's applications. It includes two main components: Scheduler and ApplicationsManager. The Scheduler allocates resources to running applications. The ApplicationsManager has multiple roles: accepting job submissions, negotiating the execution of the application-specific ApplicationMaster and providing a service for restarting the ApplicationMaster container on failure.
  • NodeManager. This daemon launches and manages containers on a node and uses them to run specified tasks. NodeManager also runs services that determine the health of the node, such as performing disk checks. Moreover, NodeManager can execute user-specified tasks.

17. What are Hadoop's primary operational modes?

Hadoop supports three primary operational nodes.

  • Standalone. Also referred to as Local mode, the Standalone mode is the default mode. It runs as a single Java process on a single node. It also uses the local file system and requires no configuration changes. The Standalone mode is used primarily for debugging purposes.
  • Pseudo-distributed. Also referred to as a single-node cluster, the Pseudo-distributed mode runs on a single machine, but each Hadoop daemon runs in a separate Java process. This mode also uses HDFS, rather than the local file system, and it requires configuration changes. This mode is often used for debugging and testing purposes.
  • Fully distributed. This is the full production mode, with all daemons running on separate nodes in a primary/secondary configuration. Data is distributed across the cluster, which can range from a few nodes to thousands of nodes. This mode requires configuration changes but offers the scalability, reliability and fault tolerance expected of a production system.

18. What are three common input formats in Hadoop?

Hadoop supports multiple input formats, which determine the shape of the data when it is collected into the Hadoop platform. The following input formats are three of the most common:

  • Text. This is the default input format. Each line within a file is treated as a separate record. The records are saved as key/value pairs, with the line of text treated as the value.
  • Key-Value Text. This input format is similar to the Text format, breaking each line into separate records. Unlike the Text format, which treats the entire line as the value, the Key-Value Text format breaks the line itself into a key and a value, using the tab character as a separator.
  • Sequence File . This format reads binary files that store sequences of user-defined key-value pairs as individual records.

Hadoop supports other input formats as well, so you also should have a good understanding of them, in addition to the ones described here.

19. What makes an HDFS environment fault-tolerant?

HDFS can be easily set up to replicate data to different DataNodes. HDFS breaks files down into blocks that are distributed across nodes in the cluster. Each block is also replicated to other nodes. If one node fails, the other nodes take over, allowing applications to access the data through one of the backup nodes.

20. What is rack awareness in Hadoop clusters?

Rack awareness is one of the mechanisms used by Hadoop to optimize data access when processing client read and write requests. When a request comes in, the NameNode identifies and selects the nearest DataNodes, preferably those on the same rack or on nearby racks. Rack awareness can help improve performance and reliability, while reducing network traffic. Rack awareness can also play a role in fault tolerance. For example, the NameNode might place data block replicas on separate racks to help ensure availability in case a network switch fails or a rack becomes unavailable for other reasons.

21. How does Hadoop protect data against unauthorized access?

Hadoop uses the Kerberos network authentication protocol to protect data from unauthorized access. Kerberos uses secret-key cryptography to provide strong authentication for client/server applications. A client must undergo the following three basic steps to prove its identity to a server (each of which involves message exchanges with the server):

  • Authentication. The client sends an authentication request to the Kerberos authentication server. The server verifies the client and sends the client a ticket granting ticket (TGT) and a session key.
  • Authorization. Once authenticated, the client requests a service ticket from the ticket granting server (TGS). The TGT must be included with the request. If the TGS can authenticate the client, it sends the service ticket and credentials necessary to access the requested resource.
  • Service request. The client sends its request to the Hadoop resource it is trying to access. The request must include the service ticket issued by the TGS.

22. What is speculative execution in Hadoop?

Speculative execution is an optimization technique that Hadoop uses when it detects that a DataNode is executing a task too slowly. There can be many reasons for a slowdown, and it can be difficult to determine its actual cause. Rather than trying to diagnose and fix the problem, Hadoop identifies the task in question and launches an equivalent task -- the speculative task -- as a backup. If the original task completes before the speculative task, Hadoop kills that speculative task.

23. What is the purpose of the JPS command in Hadoop?

JPS, which is short for Java Virtual Machine Process Status, is a command used to check the status of the Hadoop daemons, specifically NameNode, DataNode, ResourceManager and NodeManager. Administrators can use the command to verify whether the daemons are up and running. The tool returns the process ID and process name of each Java Virtual Machine (JVM) running on the target system.

24. What commands can you use to start and stop all the Hadoop daemons at one time?

You can use the following command to start all the Hadoop daemons:

./sbin/start-all.sh

You can use the following command to stop all the Hadoop daemons:

./sbin/stop-all.sh

Figure: Hadoop YARN's architecture

25. What is an edge node in Hadoop?

An edge node is a computer that acts as an end-user portal for communicating with other nodes in a Hadoop cluster. An edge node provides an interface between the Hadoop cluster and an outside network. For this reason, it is also referred to as a gateway node or edge communication node. Edge nodes are often used to run administration tools or client applications. They typically do not run any Hadoop services.

26. What are the key differences between NFS and HDFS?

NFS , which stands for Network File System, is a widely implemented distributed file system protocol used extensively in network-attached storage ( NAS ) systems. It is one of the oldest distributed file storage systems and is well-suited to smaller data sets. NAS makes data available over a network but accessible like files on a local machine.

HDFS is a more recent technology. It is designed for handling big data workloads, providing high throughput and high capacity, far beyond the capabilities of an NFS-based system. HDFS also offers integrated data protections that safeguard against node failures. NFS is typically implemented on single systems that do not include the inherent fault tolerance that comes with HDFS. However, NFS-based systems are usually much less complicated to deploy and maintain than HDFS-based systems.

27. What is commodity hardware?

Commodity hardware is a device or component that is widely available, relatively inexpensive and can typically be used interchangeably with other components . Commodity hardware is sometimes referred to as off-the-shelf hardware because of its ready availability and ease of acquisition. Organizations often choose commodity hardware over proprietary hardware because it is cheaper, simpler and faster to acquire, and it is easier to replace all or some of the components in the event of hardware failure. Commodity hardware might include servers, storage systems, network equipment or other components.

28. What is MapReduce?

MapReduce is a software framework in Hadoop that's used for processing large data sets across a cluster of computers in which each node includes its own storage. MapReduce can process data in parallel on these nodes, making it possible to distribute input data and collate the results. In this way, Hadoop can run jobs split across a massive number of servers. MapReduce also provides its own level of fault tolerance, with each node periodically reporting its status to a primary node. In addition, MapReduce offers native support for writing Java applications, although you can also write MapReduce applications in other programming languages.

29. What are the two main phases of a MapReduce operation?

A MapReduce operation can be divided into the following two primary phases :

  • Map phase. MapReduce processes the input data, splits it into chunks and maps those chunks in preparation for analysis. MapReduce runs these processes in parallel.
  • Reduce phase. MapReduce processes the mapped chunks, aggregating the data based on the defined logic. The output of these phases is then written to HDFS.

MapReduce operations are sometimes divided into phases other than these two. For example, the Reduce phase might be split into the Shuffle phase and the Reduce phase. In some cases, you might also see a Combiner phase, which is an optional phase used to optimize MapReduce operations.

30. What is feature selection in big data?

Feature selection refers to the process of extracting only specific information from a data set. This can reduce the amount of data that needs to be analyzed, while improving the quality of that data used for analysis. Feature selection makes it possible for data scientists to refine the input variables they use to model and analyze the data, leading to more accurate results, while reducing the computational overhead.

Data scientists use sophisticated algorithms for feature selection, which usually fall into one of the following three categories:

  • Filter methods. A subset of input variables is selected during a preprocessing stage by ranking the data based on such factors as importance and relevance.
  • Wrapper methods. This approach is a resource-intensive operation that uses machine learning and predictive analytics  to try to determine which input variables to keep, usually providing better results than filter methods.
  • Embedded methods. Embedded methods combine attributes of both the file and wrapper methods, using fewer computational resources than wrapper methods, while providing better results than filter methods. However, embedded methods are not always as effective as wrapper methods.

31. What is an "outlier" in the context of big data?

An outlier is a data point that's abnormally distant from others in a group of random samples. The presence of outliers can potentially mislead the process of machine learning and result in inaccurate models or substandard outcomes. In fact, an outlier can potentially bias an entire result set. That said, outliers can sometimes contain nuggets of valuable information.

32. What are two common techniques for detecting outliers?

Analysts often use the following two techniques to detect outliers:

  • Extreme value analysis. This is the most basic form of outlier detection and is limited to one-dimensional data. Extreme value analysis determines the statistical tails of the data distribution. The Altman Z-score is a good example of extreme value analysis.
  • Probabilistic and statistical models. The models determine the unlikely instances from a probabilistic model of data. Data points with a low probability of membership are marked as outliers. However, these models assume that the data adheres to specific distributions. A common example of this type of outlier detection is the Bayesian probabilistic model .

These are only two of the core methods used to detect outliers. Other approaches include linear regression models, information theoretic models, high-dimensional outlier detection methods and other approaches.

33. What is the FSCK command used for?

FSCK, which stands for file system consistency check, is an HDFS filesystem checking utility that can be used to generate a summary report about the file system's status. However, the report merely identifies the presence of errors; it does not correct them. The FSCK command can be executed against an entire system or a select subset of files.

Figure: YARN vs. MapReduce

34. Are you open to gaining additional learning and qualifications that could help you advance your career with us?

Here's your chance to demonstrate your enthusiasm and career ambitions. Of course, your answer will depend on your current level of academic qualifications and certifications, as well as your personal circumstances, which might include family responsibilities and financial considerations. Therefore, respond forthrightly and honestly. Bear in mind that many courses and learning modules are readily available online. Moreover, analytics vendors have established training courses aimed at those seeking to upskill themselves in this domain. You can also inquire about the company's policy on mentoring and coaching.

35. Do you have any questions for us?

As mentioned earlier, it's a good rule of thumb to go to interviews with a few prepared questions. But depending on how the conversation has unfolded during the interview, you might choose not to ask them. For instance, if they've already been answered or the discussion has sparked other, more pertinent queries in your mind, you can put your original questions aside.

You should also be aware of how you time your questions, taking your cue from the interviewer. Depending on the circumstances, it might be acceptable to ask questions during the interview, although it's generally more common to hold off on your questions until the end of the interview. That said, you should never hesitate to ask for clarification on a question the interviewer asks.

A final word on big data interview questions

Remember, the process doesn't end after an interview has ended. After the session, send a note of thanks to the interviewer(s) or your point(s) of contact. Follow this up with a secondary message if you haven't received any feedback within a few days.

The world of big data is expanding continuously and exponentially . If you're serious and passionate about the topic and prepared to roll up your sleeves and work hard, the sky's the limit.

Related Resources

  • The Buyer’s Guide For AI & Cloud Skills Development –QA
  • How To Set Up An HR Function: The Essential Guide And Checklist –Sage
  • 3 STEPS TO DRIVE EMPLOYEE RETENTION AND BUSINESS GROWTH –ServiceNow
  • Endace Video 14 –Endace

Dig Deeper on IT career paths

research questions on big data

Hadoop Distributed File System (HDFS)

KinzaYasar

Hadoop vs. Spark: An in-depth big data framework comparison

GeorgeLawton

Compare Hadoop vs. Spark vs. Kafka for your big data strategy

DanielRobinson

In general, asynchronous -- from Greek asyn- ('not with/together') and chronos ('time') -- describes objects or events not ...

A URL (Uniform Resource Locator) is a unique identifier used to locate a resource on the internet.

File Transfer Protocol (FTP) is a network protocol for transmitting files between computers over TCP/IP connections.

Network detection and response (NDR) technology continuously scrutinizes network traffic to identify suspicious activity and ...

Identity threat detection and response (ITDR) is a collection of tools and best practices aimed at defending against cyberattacks...

Managed extended detection and response (MXDR) is an outsourced service that collects and analyzes threat data from across an ...

Data storytelling is the process of translating complex data analyses into understandable terms to inform a business decision or ...

Demand shaping is an operational supply chain management (SCM) strategy where a company uses tactics such as price incentives, ...

Data monetization is the process of measuring the economic benefit of corporate data.

Employee self-service (ESS) is a widely used human resources technology that enables employees to perform many job-related ...

Diversity, equity and inclusion is a term used to describe policies and programs that promote the representation and ...

Payroll software automates the process of paying salaried, hourly and contingent employees.

Salesforce Commerce Cloud is a cloud-based suite of products that enable e-commerce businesses to set up e-commerce sites, drive ...

Multichannel marketing refers to the practice of companies interacting with customers via multiple direct and indirect channels ...

A contact center is a central point from which organizations manage all customer interactions across various channels.

IMAGES

  1. How to Write a Research Question in 2024: Types, Steps, and Examples

    research questions on big data

  2. 110 Best Big Data Research Topics and Project Ideas

    research questions on big data

  3. 166 Big Data Research Topics To Ace Your Paper

    research questions on big data

  4. (PDF) A Survey on Big Data Analytics: Challenges, Open Research Issues

    research questions on big data

  5. (PDF) Critical Questions for Big Data

    research questions on big data

  6. Research Question: Definition, Types, Examples, Quick Tips

    research questions on big data

VIDEO

  1. How to tackle interview questions

  2. Hadoop and Big Data Interview Questions: Answers and Insights

  3. How to learn Big Data in less than 6 months

  4. Hadoop Interview Questions : Big data use case

  5. Big Data important questions aktu 2024

  6. Big Data Analytics Most Important Questions for Exams #zeenathasanacademy

COMMENTS

  1. 214 Big Data Research Topics: Interesting Ideas To Try

    These 15 topics will help you to dive into interesting research. You may even build on research done by other scholars. Evaluate the data mining process. The influence of the various dimension reduction methods and techniques. The best data classification methods. The simple linear regression modeling methods.

  2. Top 20 Latest Research Problems in Big Data and Data Science

    E ven though Big data is in the mainstream of operations as of 2020, there are still potential issues or challenges the researchers can address. Some of these issues overlap with the data science field. In this article, the top 20 interesting latest research problems in the combination of big data and data science are covered based on my personal experience (with due respect to the ...

  3. 1057 questions with answers in BIG DATA

    This data is then integrated into big data systems to provide a holistic view of the supply chain. 4. **Blockchain Technology:** Blockchain is being explored for enhancing transparency and ...

  4. 15 years of Big Data: a systematic literature review

    Big Data is still gaining attention as a fundamental building block of the Artificial Intelligence and Machine Learning world. Therefore, a lot of effort has been pushed into Big Data research in the last 15 years. The objective of this Systematic Literature Review is to summarize the current state of the art of the previous 15 years of research about Big Data by providing answers to a set of ...

  5. 166 Big Data Research Topics To Ace Your Paper

    166 Latest Big Data Research Topics And Fascinating Ideas. Big data refers to a huge volume of data, whether organized or unorganized, whose analysis shapes technologies and methodologies. Big data is so massive and complicated that it cannot be handled using ordinary application software. For instance, some frameworks, such as Hadoop, are ...

  6. Research Topics & Ideas: Data Science

    Data Science-Related Research Topics. Developing machine learning models for real-time fraud detection in online transactions. The use of big data analytics in predicting and managing urban traffic flow. Investigating the effectiveness of data mining techniques in identifying early signs of mental health issues from social media usage.

  7. CRITICAL QUESTIONS FOR BIG DATA

    Abstract. The era of Big Data has begun. Computer scientists, physicists, economists, mathematicians, political scientists, bio-informaticists, sociologists, and other scholars are clamoring for access to the massive quantities of information produced by and about people, things, and their interactions. Diverse groups argue about the potential ...

  8. What are the threats and potentials of big data for qualitative research?

    The potentials of big data for qualitative research are examined, providing recommendations to bring together complementary research endeavors that map large scale social patterns using big data with qualitative questions about participants' subjective perceptions, rich expression of feelings, and reasons for human action.

  9. Big Data Research

    About the journal. The journal aims to promote and communicate advances in big data research by providing a fast and high quality forum for researchers, practitioners and policy makers from the very many different communities working on, and with, this topic. The journal will accept papers on foundational aspects in dealing with big data, as ...

  10. Frontiers in Big Data

    Foundation Models for Healthcare: Innovations in Generative AI, Computer Vision, Language Models, and Multimodal Systems. This innovative journal focuses on the power of big data - its role in machine learning, AI, and data mining, and its practical application from cybersecurity to climate science and public health.

  11. 343 questions with answers in BIG DATA ANALYSIS

    May 21, 2023. Answer. The architecture of an online fakenews detection platform built with Big Data Analytics and Artificial Intelligence should be designed to handle large volumes of data. The ...

  12. Big Data in Academic Research: Challenges, Pitfalls, and ...

    These new frontiers indicate that Big Data allow us—compel us—to study heretofore inaccessible research questions. Big Data can perhaps best be employed to answer questions where conventional methods are failing (although the caveat may be that the failure of conventional methods may only become apparent when their results are contrasted ...

  13. Data Security in Big Data: Challenges, Strategies, and Future Trends

    In the dynamic landscape of big data analytics, this paper explores the critical dimension of data. security, addressing challenges, strategies, and emerging trends. Recognizing the exponential ...

  14. Big data and human resource management research: An integrative review

    HRM research is undergoing a paradigm shift from the traditional era of small data to the emerging era of big data. As this trend continues, HRM research will be informed by a significant re-examination and extension of previous research methods and findings (Shah et al., 2017, Sivarajah et al., 2017).In a recent review of the use of big data in HRM research, Mcabee et al. (2017) argued that ...

  15. Ten Research Challenge Areas in Data Science

    Abstract. To drive progress in the field of data science, we propose 10 challenge areas for the research community to pursue. Since data science is broad, with methods drawing from computer science, statistics, and other disciplines, and with applications appearing in all sectors, these challenge areas speak to the breadth of issues spanning ...

  16. Ten simple rules for responsible big data research

    The use of big data research methods has grown tremendously over the past five years in both academia and industry. As the size and complexity of available datasets has grown, so too have the ethical questions raised by big data research. These questions become increasingly urgent as data and research agendas move well beyond those typical of ...

  17. 10 Research Question Examples to Guide your Research Project

    The first question asks for a ready-made solution, and is not focused or researchable. The second question is a clearer comparative question, but note that it may not be practically feasible. For a smaller research project or thesis, it could be narrowed down further to focus on the effectiveness of drunk driving laws in just one or two countries.

  18. Big data analytics: a survey

    The age of big data is now coming. But the traditional data analytics may not be able to handle such large quantities of data. The question that arises now is, how to develop a high performance platform to efficiently analyze big data and how to design an appropriate mining algorithm to find the useful things from big data. To deeply discuss this issue, this paper begins with a brief ...

  19. Ethics review of big data research: What should stay and what should be

    This is an additional reason that big data research often has a tentative approach to a research question, instead of growing from a specific research hypothesis .The difficulty of clearly framing the big data research itself makes it even harder for ERCs to anticipate unforeseeable risks and potential societal consequences. Given the existing ...

  20. (PDF) Critical Questions for Big Data

    We define Big Data 1 as a cultural, technological, and scholarly phenomenon that rests on. the interplay of: 1) Technology: maximizing computation power and algorithmic accuracy to gather, analyze ...

  21. Social media and the social sciences: How researchers employ Big Data

    While some researchers hope to breech or prevent a Big Data digital divide between academics and large corporations, others are critical of 'Big Data', questioning the ethics of such research and whether controlled access to large data sets will lead to 'data haves' and 'data have-nots' among other concerns (boyd and Crawford, 2012 ...

  22. Practical issues to consider when working with big data

    Increasing access to alternative or "big data" sources has given rise to an explosion in the use of these data in economics-based research. However, in our enthusiasm to use the newest and greatest data, we as researchers may jump to use big data sources before thoroughly considering the costs and benefits of a particular dataset. This article highlights four practical issues that ...

  23. Top 35 big data interview questions with answers for 2024

    Top 35 big data interview questions and answers. Each of the following 35 big data interview questions includes an answer. However, don't rely solely on these answers when preparing for your interview. Instead, use them as a launching point for digging more deeply into each topic. 1.

  24. Developing Data-Based Questions: A Guide to Data Science

    Margaret Kipp - [email protected] 582: Introduction to Data Science Developing Data Based Questions "Working with data is a process that you lose yourself in. There is a natural tension between going into exploration as quickly as possible and spending more time thinking and planning up front. When balanced properly, they are mutually beneficial. However, diving in quickly and getting lost in ...

  25. Presidential debate live updates: Trump and Harris face off for the

    Latest news and live updates on the Trump-Harris presidential debate. Follow live news as the two candidates debate for the first time ahead of the 2024 election.