data mining Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Distance Based Pattern Driven Mining for Outlier Detection in High Dimensional Big Dataset

Detection of outliers or anomalies is one of the vital issues in pattern-driven data mining. Outlier detection detects the inconsistent behavior of individual objects. It is an important sector in the data mining field with several different applications such as detecting credit card fraud, hacking discovery and discovering criminal activities. It is necessary to develop tools used to uncover the critical information established in the extensive data. This paper investigated a novel method for detecting cluster outliers in a multidimensional dataset, capable of identifying the clusters and outliers for datasets containing noise. The proposed method can detect the groups and outliers left by the clustering process, like instant irregular sets of clusters (C) and outliers (O), to boost the results. The results obtained after applying the algorithm to the dataset improved in terms of several parameters. For the comparative analysis, the accurate average value and the recall value parameters are computed. The accurate average value is 74.05% of the existing COID algorithm, and our proposed algorithm has 77.21%. The average recall value is 81.19% and 89.51% of the existing and proposed algorithm, which shows that the proposed work efficiency is better than the existing COID algorithm.

Implementation of Data Mining Technology in Bonded Warehouse Inbound and Outbound Goods Trade

For the taxed goods, the actual freight is generally determined by multiplying the allocated freight for each KG and actual outgoing weight based on the outgoing order number on the outgoing bill. Considering the conventional logistics is insufficient to cope with the rapid response of e-commerce orders to logistics requirements, this work discussed the implementation of data mining technology in bonded warehouse inbound and outbound goods trade. Specifically, a bonded warehouse decision-making system with data warehouse, conceptual model, online analytical processing system, human-computer interaction module and WEB data sharing platform was developed. The statistical query module can be used to perform statistics and queries on warehousing operations. After the optimization of the whole warehousing business process, it only takes 19.1 hours to get the actual freight, which is nearly one third less than the time before optimization. This study could create a better environment for the development of China's processing trade.

Multi-objective economic load dispatch method based on data mining technology for large coal-fired power plants

User activity classification and domain-wise ranking through social interactions.

Twitter has gained a significant prevalence among the users across the numerous domains, in the majority of the countries, and among different age groups. It servers a real-time micro-blogging service for communication and opinion sharing. Twitter is sharing its data for research and study purposes by exposing open APIs that make it the most suitable source of data for social media analytics. Applying data mining and machine learning techniques on tweets is gaining more and more interest. The most prominent enigma in social media analytics is to automatically identify and rank influencers. This research is aimed to detect the user's topics of interest in social media and rank them based on specific topics, domains, etc. Few hybrid parameters are also distinguished in this research based on the post's content, post’s metadata, user’s profile, and user's network feature to capture different aspects of being influential and used in the ranking algorithm. Results concluded that the proposed approach is well effective in both the classification and ranking of individuals in a cluster.

A data mining analysis of COVID-19 cases in states of United States of America

Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches. As a result of the analysis some rules and insights have been discovered and performances of the data mining algorithms have been evaluated. According to the analysis results, JRip algorithmic technique had the most correct classification rate and the lowest root mean squared error (RMSE). Considering classification rate and RMSE measure, JRip can be considered as an effective method in understanding factors that are related with corona virus caused deaths.

Exploring distributed energy generation for sustainable development: A data mining approach

A comprehensive guideline for bengali sentiment annotation.

Sentiment Analysis (SA) is a Natural Language Processing (NLP) and an Information Extraction (IE) task that primarily aims to obtain the writer’s feelings expressed in positive or negative by analyzing a large number of documents. SA is also widely studied in the fields of data mining, web mining, text mining, and information retrieval. The fundamental task in sentiment analysis is to classify the polarity of a given content as Positive, Negative, or Neutral . Although extensive research has been conducted in this area of computational linguistics, most of the research work has been carried out in the context of English language. However, Bengali sentiment expression has varying degree of sentiment labels, which can be plausibly distinct from English language. Therefore, sentiment assessment of Bengali language is undeniably important to be developed and executed properly. In sentiment analysis, the prediction potential of an automatic modeling is completely dependent on the quality of dataset annotation. Bengali sentiment annotation is a challenging task due to diversified structures (syntax) of the language and its different degrees of innate sentiments (i.e., weakly and strongly positive/negative sentiments). Thus, in this article, we propose a novel and precise guideline for the researchers, linguistic experts, and referees to annotate Bengali sentences immaculately with a view to building effective datasets for automatic sentiment prediction efficiently.

Capturing Dynamics of Information Diffusion in SNS: A Survey of Methodology and Techniques

Studying information diffusion in SNS (Social Networks Service) has remarkable significance in both academia and industry. Theoretically, it boosts the development of other subjects such as statistics, sociology, and data mining. Practically, diffusion modeling provides fundamental support for many downstream applications (e.g., public opinion monitoring, rumor source identification, and viral marketing). Tremendous efforts have been devoted to this area to understand and quantify information diffusion dynamics. This survey investigates and summarizes the emerging distinguished works in diffusion modeling. We first put forward a unified information diffusion concept in terms of three components: information, user decision, and social vectors, followed by a detailed introduction of the methodologies for diffusion modeling. And then, a new taxonomy adopting hybrid philosophy (i.e., granularity and techniques) is proposed, and we made a series of comparative studies on elementary diffusion models under our taxonomy from the aspects of assumptions, methods, and pros and cons. We further summarized representative diffusion modeling in special scenarios and significant downstream tasks based on these elementary models. Finally, open issues in this field following the methodology of diffusion modeling are discussed.

The Influence of E-book Teaching on the Motivation and Effectiveness of Learning Law by Using Data Mining Analysis

This paper studies the motivation of learning law, compares the teaching effectiveness of two different teaching methods, e-book teaching and traditional teaching, and analyses the influence of e-book teaching on the effectiveness of law by using big data analysis. From the perspective of law student psychology, e-book teaching can attract students' attention, stimulate students' interest in learning, deepen knowledge impression while learning, expand knowledge, and ultimately improve the performance of practical assessment. With a small sample size, there may be some deficiencies in the research results' representativeness. To stimulate the learning motivation of law as well as some other theoretical disciplines in colleges and universities has particular referential significance and provides ideas for the reform of teaching mode at colleges and universities. This paper uses a decision tree algorithm in data mining for the analysis and finds out the influencing factors of law students' learning motivation and effectiveness in the learning process from students' perspective.

Intelligent Data Mining based Method for Efficient English Teaching and Cultural Analysis

The emergence of online education helps improving the traditional English teaching quality greatly. However, it only moves the teaching process from offline to online, which does not really change the essence of traditional English teaching. In this work, we mainly study an intelligent English teaching method to further improve the quality of English teaching. Specifically, the random forest is firstly used to analyze and excavate the grammatical and syntactic features of the English text. Then, the decision tree based method is proposed to make a prediction about the English text in terms of its grammar or syntax issues. The evaluation results indicate that the proposed method can effectively improve the accuracy of English grammar or syntax recognition.

Export Citation Format

Share document.

Information

  • Author Services

Initiatives

You are accessing a machine-readable page. In order to be human-readable, please install an RSS reader.

All articles published by MDPI are made immediately available worldwide under an open access license. No special permission is required to reuse all or part of the article published by MDPI, including figures and tables. For articles published under an open access Creative Common CC BY license, any part of the article may be reused without permission provided that the original article is clearly cited. For more information, please refer to https://www.mdpi.com/openaccess .

Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications.

Feature papers are submitted upon individual invitation or recommendation by the scientific editors and must receive positive feedback from the reviewers.

Editor’s Choice articles are based on recommendations by the scientific editors of MDPI journals from around the world. Editors select a small number of articles recently published in the journal that they believe will be particularly interesting to readers, or important in the respective research area. The aim is to provide a snapshot of some of the most exciting work published in the various research areas of the journal.

Original Submission Date Received: .

  • Active Journals
  • Find a Journal
  • Proceedings Series
  • For Authors
  • For Reviewers
  • For Editors
  • For Librarians
  • For Publishers
  • For Societies
  • For Conference Organizers
  • Open Access Policy
  • Institutional Open Access Program
  • Special Issues Guidelines
  • Editorial Process
  • Research and Publication Ethics
  • Article Processing Charges
  • Testimonials
  • Preprints.org
  • SciProfiles
  • Encyclopedia

Topic Information

Participating journals, topic editors.

research paper for data mining

Find support for a specific problem in the support section of our website.

Please let us know what you think of our products and services.

Visit our dedicated information section to learn more about MDPI.

Recent Advances in Data Mining

Dear Colleagues,

Data mining is the procedure of identifying valid, potentially suitable, and understandable information; detecting patterns; building knowledge graphs; and finding anomalies and relationships in big data with Artificial-Intelligence-enabled IoT (AIoT). This process is essential for advancing knowledge in various fields dealing with raw data from web, text, numeric, media, or financial transactions. Its scope has expanded through hybridizing various data mining algorithms for use in financial technology and cryptocurrency, the blockchain, data sciences, sentiment analysis, and recommender systems. Moreover, data mining provides advantages in many practical fields, such as in preserving the privacy of health data analysis and mining, biology, data security, smart cities, and smart grids. It is also necessary to investigate the recent advances in data mining involving the incorporation of machine learning algorithms and artificial neural networks. Among other fields of artificial intelligence, machine and deep learning are certainly some of the most studied in recent years. There has been a massive shift in the last few decades due to the advent of deep learning, which has opened up unprecedented theoretic and application-based opportunities for data mining. This Topic will present a collection of articles reflecting the latest developments in data mining and related fields, investigating both practical and theoretical applications; knowledge discovery and extraction; image analysis; classification and clustering; FinTech and cryptocurrency; the blockchain and data security; privacy-preserving data mining; and many others. Contributions focused on both theoretical and practical models are welcome. Papers will be selected for inclusion based on their formal and technical soundness, experimental support, and relevance.

Prof. Dr. Qingshan Jiang Dr. John (Junhu) Wang Dr. Min Yang Topic Editors

  • data mining
  • text mining
  • graph mining
  • classification
  • machine learning
  • deep learning
  • knowledge graph
  • knowledge discovery and extraction
  • artificial intelligence
  • statistical modeling
  • privacy-preserving data mining
  • social networks analysis
  • natural language processing applications
  • recommendation systems
  • big data storage systems
  • big data analysis
  • data management and analysis
  • FinTech data analysis and cryptocurrency
  • blockchain data security
Journal Name Impact Factor CiteScore Launched Year First Decision (median) APC
algorithms 2008 15 Days CHF 1600
applsci 2011 17.8 Days CHF 2400
electronics 2012 16.8 Days CHF 2400
energies 2008 17.5 Days CHF 2600
mathematics 2013 17.1 Days CHF 2600

research paper for data mining

  • Immediately share your ideas ahead of publication and establish your research priority;
  • Protect your idea from being stolen with this time-stamped preprint article;
  • Enhance the exposure and impact of your research;
  • Receive feedback from your peers in advance;
  • Have it indexed in Web of Science (Preprint Citation Index), Google Scholar, Crossref, SHARE, PrePubMed, Scilit and Europe PMC.

Published Papers (7 papers)

research paper for data mining

Further Information

Mdpi initiatives, follow mdpi.

MDPI

Subscribe to receive issue release notifications and newsletters from MDPI journals

Submit your Manuscript

Submit your abstract.

Wiley Online Library

  • Search term Advanced Search Citation Search
  • Individual login
  • Institutional login

Statistical Analysis and Data Mining: The ASA Data Science Journal

Statistical Analysis and Data Mining: The ASA Data Science Journal

Edited By: Cinzia Viroli

Journal list menu

  • Journal More from this journal -->

research paper for data mining

Follow journal

Statistical Analysis and Data Mining: The ASA Data Science Journal addresses the broad area of data analysis, including data mining algorithms, statistical approaches, and practical applications. Topics include problems involving massive and complex datasets, solutions utilizing innovative data mining algorithms and/or novel statistical approaches.

Read the journal's full aims and scope .

  • Most Recent

A new logarithmic multiplicative distortion for correlation analysis

  • First Published:  23 August 2024
  • Request permissions

Revisiting Winnow: A modified online feature selection algorithm for efficient binary classification

  • First Published:  30 July 2024

A random forest approach for interval selection in functional regression

  • First Published:  24 July 2024

Characterizing climate pathways using feature importance on echo state networks

  • First Published:  23 July 2024

Neural interval‐censored survival regression with feature selection

  • First Published:  16 July 2024

Latest news

Recent issues, sign up for email alerts.

Enter your email to receive alerts when new articles and issues are published.

Please select your location and accept the terms of use. Country or location * Open this select menu Afghanistan Åland Islands Albania Algeria American Samoa Andorra Angola Anguilla Antarctica Antigua And Barbuda Argentina Armenia Aruba Australia Austria Azerbaijan Bahamas Bahrain Bangladesh Barbados Belarus Belgium Belize Benin Bermuda Bhutan Bolivia Bonaire, Sint Eustatius And Saba Bosnia And Herzegovina Botswana Bouvet Island Brazil British Indian Ocean Territory Brunei Darussalam Bulgaria Burkina Faso Burundi Cambodia Cameroon Canada Cape Verde Cayman Islands Central African Republic Chad Chile China Christmas Island Cocos (keeling) Islands Colombia Comoros Congo Congo, The Democratic Republic Of The Cook Islands Costa Rica Côte D'ivoire Croatia Cuba Curaçao Cyprus Czech Republic Denmark Djibouti Dominica Dominican Republic Ecuador Egypt El Salvador Equatorial Guinea Eritrea Estonia Ethiopia Falkland Islands (malvinas) Faroe Islands Fiji Finland France French Guiana French Polynesia French Southern Territories Gabon Gambia Georgia Germany Ghana Gibraltar Greece Greenland Grenada Guadeloupe Guam Guatemala Guernsey Guinea Guinea-bissau Guyana Haiti Heard Island And Mcdonald Islands Holy See (vatican City State) Honduras Hong Kong Hungary Iceland India Indonesia Iran, Islamic Republic Of Iraq Ireland Isle Of Man Israel Italy Jamaica Japan Jersey Jordan Kazakhstan Kenya Kiribati Korea, Democratic People's Republic Of Korea, Republic Of Kosovo, Republic Of Kuwait Kyrgyzstan Lao People's Democratic Republic Latvia Lebanon Lesotho Liberia Libyan Arab Jamahiriya Liechtenstein Lithuania Luxembourg Macao Macedonia, The Former Yugoslav Republic Of Madagascar Malawi Malaysia Maldives Mali Malta Marshall Islands Martinique Mauritania Mauritius Mayotte Mexico Micronesia, Federated States Of Moldova, Republic Of Monaco Mongolia Montenegro Montserrat Morocco Mozambique Myanmar Namibia Nauru Nepal Netherlands Netherlands Antilles New Caledonia New Zealand Nicaragua Niger Nigeria Niue Norfolk Island Northern Mariana Islands Norway Oman Pakistan Palau Palestinian Territory, Occupied Panama Papua New Guinea Paraguay Peru Philippines Pitcairn Poland Portugal Puerto Rico Qatar Reunion Romania Russian Federation Rwanda Saint Barthélemy Saint Helena Saint Kitts And Nevis Saint Lucia Saint Martin Saint Pierre And Miquelon Saint Vincent And The Grenadines Samoa San Marino Sao Tome And Principe Saudi Arabia Senegal Serbia Seychelles Sierra Leone Singapore Sint Maarten Slovakia Slovenia Solomon Islands Somalia South Africa South Georgia And The South Sandwich Islands South Sudan Spain Sri Lanka Sudan Suriname Svalbard And Jan Mayen Swaziland Sweden Switzerland Syrian Arab Republic Taiwan Tajikistan Tanzania, United Republic Of Thailand Timor-leste Togo Tokelau Tonga Trinidad And Tobago Tunisia Türkiye Turkmenistan Turks And Caicos Islands Tuvalu Uganda Ukraine United Arab Emirates United Kingdom United States United States Minor Outlying Islands Uruguay Uzbekistan Vanuatu Venezuela, Bolivarian Republic Of Viet Nam Virgin Islands, British Virgin Islands, U.s. Wallis And Futuna Western Sahara Yemen Zambia Zimbabwe

  • Submit an article
  • Journal Metrics
  • Subscribe to this journal

Published on behalf of the American Statistical Association

research paper for data mining

More from this journal

  • Review Articles
  • Special Issues
  • Wiley Job Network

research paper for data mining

Click here to view the latest trending articles from

Statistical Analysis and Data Mining

Related Titles

research paper for data mining

Log in to Wiley Online Library

Change password, your password must have 10 characters or more:.

  • a lower case character, 
  • an upper case character, 
  • a special character 

Password Changed Successfully

Your password has been changed

Create a new account

Forgot your password.

Enter your email address below.

Please check your email for instructions on resetting your password. If you do not receive an email within 10 minutes, your email address may not be registered, and you may need to create a new Wiley Online Library account.

Request Username

Can't sign in? Forgot your username?

Enter your email address below and we will send you your username

If the address matches an existing account you will receive an email with instructions to retrieve your username

IEEE Account

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

Data Mining Methods and Obstacles: A Comprehensive Analysis

  • February 2024

Sunila Fatima Ahmad at Quaid-i-Azam University

  • Quaid-i-Azam University

Asad Hussain at International Islamic University, Islamabad

  • International Islamic University, Islamabad

Muhammad Asgher Nadeem at Hanyang University

  • Hanyang University
  • This person is not on ResearchGate, or hasn't claimed this research yet.

Abstract and Figures

Structure of Analysis

Discover the world's research

  • 25+ million members
  • 160+ million publication pages
  • 2.3+ billion citations

Kiarash Diba

  • Ching-Seh Mike Wu
  • Mustafa Badshah
  • Vishwa Bhagwat
  • Luis Fernando Carvalho Dias

Fernando Silva Parreiras

  • Godwin Raju
  • Pavol Zavarsky
  • Adetokunbo Makanju
  • Yasir Malik

Agnes Suguitan

  • Andrew J. Ko

Donald Adjeroh

  • S. Sitharama Iyengar
  • Abdullah H. Alsharif

Nada Philip

  • Marc Brünink

David S. Rosenblum

  • Recruit researchers
  • Join for free
  • Login Email Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google Welcome back! Please log in. Email · Hint Tip: Most researchers use their institutional email address as their ResearchGate login Password Forgot password? Keep me logged in Log in or Continue with Google No account? Sign up

Review Paper on Data Mining Techniques and Applications

International Journal of Innovative Research in Computer Science & Technology (IJIRCST), Volume-7, Issue-2, March 2019

5 Pages Posted: 2 Mar 2020

GVMGC Sonipat

Date Written: MARCH 31, 2019

Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help decision makers to make better decisions. Practically, data mining is really useful for any organization which has huge amount of data. Data mining help regular databases to perform faster. They also help to increase the profit, because of the correct decisions made with the help of data mining. This paper shows the various steps performed during the process of data mining and how it can be used by various industries to get better answers from huge amount of data.

Keywords: Data Mining, Regression, Time Series, Prediction, Association

Suggested Citation: Suggested Citation

Anshu (Contact Author)

Do you have a job opening that you would like to promote on ssrn, paper statistics, related ejournals, econometrics: econometric & statistical methods - special topics ejournal.

Subscribe to this fee journal for more curated articles on this topic

Web Technology eJournal

Decision-making & management science ejournal, data science, data analytics & informatics ejournal.

ORIGINAL RESEARCH article

Data mining techniques in analyzing process data: a didactic.

\r\nXin Qiao*

  • University of Maryland, College Park, College Park, MD, United States

Due to increasing use of technology-enhanced educational assessment, data mining methods have been explored to analyse process data in log files from such assessment. However, most studies were limited to one data mining technique under one specific scenario. The current study demonstrates the usage of four frequently used supervised techniques, including Classification and Regression Trees (CART), gradient boosting, random forest, support vector machine (SVM), and two unsupervised methods, Self-organizing Map (SOM) and k -means, fitted to one assessment data. The USA sample ( N = 426) from the 2012 Program for International Student Assessment (PISA) responding to problem-solving items is extracted to demonstrate the methods. After concrete feature generation and feature selection, classifier development procedures are implemented using the illustrated techniques. Results show satisfactory classification accuracy for all the techniques. Suggestions for the selection of classifiers are presented based on the research questions, the interpretability and the simplicity of the classifiers. Interpretations for the results from both supervised and unsupervised learning methods are provided.

Introduction

With the advance of technology incorporated in educational assessment, researchers have been intrigued by a new type of data, process data, generated from computer-based assessment, or new sources of data, such as keystroke or eye tracking data. Most often, such data, often referred to as “data ocean,” is of very large volume and with few ready-to-use features. How to explore, discover and extract useful information from such an ocean has been challenging.

What analyses should be performed on such process data? Even though specific analytic methods are to be used for different data sources with specific features, some common analysis methods can be performed based on the generic characteristics of log files. Hao et al. (2016) have summarized several common analytic actions when introducing the package in Python, glassPy. These include summary information about the log file, such as the number of sessions, the time duration of each session, and the frequency of each event, can be obtained through a summary function. In addition, event n-grams, or event sequences of different lengths, can be formed for further utilization of similarity measures to classify and compare persons' performances. To take the temporal information into account, hierarchical vectorization of the rank ordered time intervals and the time interval distribution of event pairs were also introduced. In addition to these common analytic techniques, other existing data analytic methods for process data are Social Network Analysis (SNA; Zhu et al., 2016 ), Bayesian Networks/Bayes nets (BNs; Levy, 2014 ), Hidden Markov Model ( Jeong et al., 2010 ), Markov Item Response Theory ( Shu et al., 2017 ), diagraphs ( DiCerbo et al., 2011 ) and process mining ( Howard et al., 2010 ). Further, modern data mining techniques, including cluster analysis, decision trees, and artificial neural networks, have been used to reveal useful information about students' problem-solving strategies in various technology-enhanced assessments (e.g., Soller and Stevens, 2007 ; Kerr et al., 2011 ; Gobert et al., 2012 ).

The focus of the current study is about data mining techniques and this paragraph provides a brief review of related techniques that have been frequently utilized and lessons that have been learned related to analyzing process data in technology-enhanced educational assessment. Two major classes of data mining techniques are supervised and unsupervised learning methods ( Fu et al., 2014 ; Sinharay, 2016 ). Supervised methods are used when subjects' memberships are known and the purpose is to train a classifier that can precisely classify the subjects into their own category (e.g., score) and then be efficiently generalized to new datasets. Unsupervised methods are utilized when subjects' memberships are unknown and the goal is to categorize the subjects into clearly separate groups based on features that can distinguish them apart. Decision trees, as a supervised data classification method, has been used very often in analysing process data in educational assessment. DiCerbo and Kidwai (2013) used Classification and Regression Tree (CART) methodology to create the classifier to detect a player's goal in a gaming environment. The authors demonstrated the building of the classifier, including feature generation, pruning process, and evaluated the results using precision, recall, Cohen's Kappa and A' ( Hanley and McNeil, 1982 ). This study proved that the CART could be a reliable automated detector and illustrated the process of how to build such a detector with a relative small sample size ( n = 527). On the other hand, cluster analysis and Self-Organizing Maps (SOMs; Kohonen, 1997 ) are two well-established unsupervised techniques that categorize students' problem-solving strategies. Kerr et al. (2011) showed that cluster analysis can consistently identify key features in 155 students' performances in log files extracted from an educational gaming and simulation environment called Save Patch ( Chung et al., 2010 ), which measures mathematical competence. The authors described how they manipulated the data for the application of clustering algorithms and showed evidence that fuzzy cluster analysis is more appropriate than hard cluster analysis in analyzing log file process data from game/simulation environment. Most importantly, the authors demonstrated how cluster analysis can identify both effective strategies and misconceptions students have with respect to the related construct. Soller and Stevens (2007) showed the power of SOM in terms of pattern recognition. They used SOM to categorize 5284 individual problem-solving performances into 36 different problem-solving strategies, each exhibiting different solution frequencies. The authors noted that the 36 strategy classifications can be used as input to a test-level scoring process or externally validated by associating them with other measures. Such detailed classifications can also serve as valuable feedback to students and instructors. Chapters in Williamson et al. (2006) also discussed extensively the promising future of using data mining techniques, like SOM, as an automated scoring method. Fossey (2017) has evaluated three unsupervised methods, including k -means, SOM and Robust Clustering Using Links (ROCK) on analyzing process data in log files from a game-based assessment scenario.

To date, however, no study has demonstrated the utilization of both supervised and unsupervised data mining techniques for the analysis of the same process data. This study aims at filling this gap and provides a didactic of analyzing process data from the 2012 PISA log files retrieved from one of the problem-solving items using both types of data mining methods. This log file is well-structured and representative of what researchers may encounter in complex assessments, thus, suitable for demonstration purposes. The goal of the current study is 3-fold: (1) to demonstrate the use of data mining methods on process data in a systematic way; (2) to evaluate the consistency of the classification results from different data mining techniques, either supervised or unsupervised, with one data file; (3) to illustrate how the results from supervised and unsupervised data mining techniques can be used to deal with psychometric issues and challenges.

The subsequent sections are organized as follows. First, the PISA 2012 public dataset, including participants and the problem-solving item analyzed, is introduced. Second, the data analytic methods used in the current study are elaborated and the concrete classifier development processes are illustrated. Third, the results from data analyses are reported. Lastly, the interpretations of the results, limitations of the current study and future research directions are discussed.

Participants

The USA sample ( N = 429) was extracted from the 2012 PISA public dataset. Students were from 15 years 3 months old to 16 years 2 months old, representing 15-year-olds in USA ( Organisation for Economic Co-operation Development, 2014 ). Three students with missing student IDs and school IDs were deleted, yielding a sample of 426 students. There were no missing responses. The dataset was randomly partitioned into a training dataset ( n = 320, 75.12%) and a test dataset ( n = 106, 24.88%). The size of the training dataset is usually about 2 to 3 times of the size of the test dataset to increase the precision in prediction (e.g., Sinharay, 2016 ; Fossey, 2017 ).

Instrumentation

There are 42 problem-solving questions in 16 units in 2012 PISA. These items assess cognitive process in solving real-life problems in computer-based simulated scenarios ( Organisation for Economic Co-operation Development, 2014 ). The problem-solving item, TICKETS task2 (CP038Q01), was analyzed in the current study. It is a level-5 question (there were six levels in total) that requires a higher level of exploring and understanding ability in solving this complex problem ( Organisation for Economic Co-operation Development, 2014 ). This interactive question requires students explore and collect necessary information to make a decision. The main cognitive processes involved in this task are planning and executing. Given the problem-solving scenario, students need to come up with a plan and test it and modify it if needed. The item asks students to use their concession fare to find and buy the cheapest ticket that allows them to take 4 trips around the city on the subway within 1 day. One possible solution is to choose 4 individual concession tickets for city subway, which costs 8 zeds while the other is to choose one daily concession ticket for city subway, which costs 9 zeds. Figure 1 includes these two options. Students can always use “CANCEL” button before “BUY” to make changes. Correctly completing this task requires students to consider these two alternative solutions, then make comparisons in terms of the costs and end up choosing the cheaper one.

www.frontiersin.org

Figure 1 . PISA 2012 problem-solving question TICKETS task2 (CP038Q01) screenshots. (For more clear view, please see http://www.oecd.org/pisa/test-2012/testquestions/question5/ ).

This item is scored polytomously with three score points, 0, 1, or 2. Students who derive only one solution and fail to compare with the other get partial credits. Students who do not come up with either of the two solutions, but rather buy the wrong ticket, get no credit on this item. For example, the last picture in Figure 1 illustrates the tickets for four individual full fare for country trains, which cost 72 zeds. “COUNTRY TRAINS” and “FULL FARE” are considered as unrelated actions because they are not the necessary actions to accomplish the task this item requires. In terms of scoring, unrelated actions are allowed as long as the students buy the correct ticket in the end and make comparisons during the action process.

Data Description

The PISA 2012 log file dataset for the problem-solving item was downloaded at http://www.oecd.org/pisa/pisaproducts/database-cbapisa2012.htm . The dataset consists of 4722 actions from 426 students as rows and 11 variables as columns. Eleven variables (see Figure 2 ) include: cnt indicates country, which is USA in the present study; schoolid and StIDStd indicate the unique school and student IDs, respectively; event_number (ranging from 1 to 47) indicates the cumulative number of actions the student took; event_value (see raw event_values presented in Table 1 ) tells the specific action the student took at one time stamp and time indicates the exact time stamp (in seconds) corresponding to the event_value . Event notifies the nature of the action (start item, end item, or actions in process). Lastly, network, fare_type, ticket_type , and number_trips all describe the current choice the student had made. The variables used were schoolid, StIDStd, event_value and time . ID variables helped to identify students, while event_value and time variables were used to generate features. The scores for all students were not provided in the log file, thus, hand coded and carefully double checked based on the scoring rule. Among the 426 students, 121 (28.4%) got full credit, 224 (52.6%) got partial credit and 81 (19.0%) did not get any credit. Full, partial, and no credit were coded as 2, 1, and 0, respectively.

www.frontiersin.org

Figure 2 . The screenshot of the log file for one student.

www.frontiersin.org

Table 1 . 15 raw event values and 36 generated features.

Feature Generation and Selection

Feature generation.

Features generated can be categorized into time features and action features, as summarized in Table 1 . Four Time features were created: T_time, A_time, S_time, and E_time, indicating total response time, action time spent in process, starting time spent on first action, and ending time spent on last action, respectively. It was assumed that students with different ability levels may differ in the time they read the question (starting time spent on first action), the time they spent during the response (action time spent in process), and the time they used to make final decision (ending time spent on last action). Different researchers have proposed various joint modeling approaches for both response accuracy and response times, which explain the relationship between the two (e.g., van der Linden, 2007 ; Bolsinova et al., 2017 ). Thus, the total response times are expected to differ as well.

However, in this study, action features were created by coding different lengths of adjacent action sequences together. Thus, this study generated 12 action features consisting of only one action (unigrams), 18 action features containing two ordered adjacent actions (bigrams), and 2 action features created from four sequential actions (four-grams). Further, all action sequences generated were assumed to have equal importance and no weights were assigned to each action sequence. In Table 1 , “concession” is a unigram, consisting of only one action, that is, the student bought the concession fare; on the other hand, “S_city” is a bigram, consisting of two actions, which are “Start” and “city subway,” representing the student selected the city subway ticket after starting the item.

Sao Pedro et al. (2012) showed that features generated should be theoretically important to the construct to achieve better interpretability and efficiency. Following their suggestion, features were generated as the indicators of the problem-solving ability measured by this item, which is supported by the scoring rubric. For example, one action sequence consisted of four actions, which was coded as “city_con_daily_cancel,” is crucial to scoring. If the student first chose “city_subway” to tour the city, then used the student's concession fare (“concession”), looked at the price of daily pass (“daily”) next and lastly, he/she clicked “Cancel” to see the other option, this action sequence is necessary but not sufficient for a full credit.

The final recoded dataset for analysis is made up of 426 students as rows and 36 features (including 32 action sequence features and 4 time features) as columns. Scores for each student served as known labels when applying supervised learning methods. The frequency of each generated action feature was calculated for each student.

Feature Selection

The selection of features should base on both theoretical framework and the algorithms used. As features were generated from a purely theoretical perspective in this study, no such consideration is needed in feature selection.

Two other issues that need consideration are redundant variables and variables with little variance. Tree-based methods handle these two issues well and have built-in mechanisms for feature selection. The feature importance indicated by tree-based methods are shown in Figure 3 . In both random forest and gradient boosting, the most important one is “city_con_daily_cancel.” The next important one is “other_buy,” which means the student did not choose trip_4 before the action “Buy.” The feature importance indicated by tree-based methods is especially helpful when selection has to be made among hundreds of features. It can help to narrow down the number of features to track, analyze, and interpret. The classification accuracy of the support vector machine (SVM) is reduced due to redundant variables. However, given the number of features (36) is relatively small in the current study, deleting highly correlated variables (ρ≥ 0.8) did not improve classification accuracy for SVM.

www.frontiersin.org

Figure 3 . Feature importance indicated by tree-based methods.

Clustering algorithms are affected by variables with near zero variance. Fossey (2017) and Kerr et al. (2011) discarded variables with 5 or fewer attempts in their studies. However, their data were binary and no clear-cut criterion exists for feature elimination when using cluster algorithms in the analysis of process data. In the current study, 5 features with variance no >0.09 in both training and test dataset were removed to achieve optimal classification results. Descriptive statistics for all 36 features can be found in Table A1 in Appendix A.

In summary, a full set of features (36) were retained in the tree-based methods and SVM while 31 features were selected for SOM and k -means after the deletion of features with little variance.

Data Mining Techniques

This study demonstrates how to utilize data mining techniques to map the selected features (both action and time) to students' item performance on this problem-solving item in 2012 PISA. Given students' item scores are available in the data file, supervised learning algorithms can be trained to help classify students based on their known item performance (i.e., score category) in the training dataset while unsupervised learning algorithms categorize students into groups based on input variables without knowing their item performance. No assumptions about the data distribution are made on these data mining techniques.

Four supervised learning methods: Classification and Regression Tree (CART), gradient boosting, random forest, and SVM are explored to develop classifiers while, two unsupervised learning methods, Self-organizing Map (SOM) and k -means, are utilized to further examine different strategies used by students in both the same and different score categories. CART was chosen because it worked effectively in a previous study ( DiCerbo and Kidwai, 2013 ) and is known for its quick computation and simple interpretation. However, it might not have the optimal performance compared with other methods. Furthermore, small changes in the data can change the tree structure dramatically ( Kuhn, 2013 ). Thus, gradient boosting and random forest, which can improve the performance of trees via ensemble methods, were also used for comparison. Though SVM has not been used much in the analysis of process data yet, it has been applied as one of the most popular and flexible supervised learning techniques for other psychometric analysis such as automatic scoring ( Vapnik, 1995 ). The two clustering algorithms, SOM and k -means, have been applied in the analysis of process data in log files ( Stevens and Casillas, 2006 ; Fossey, 2017 ). Researchers have suggested to use more than one clustering methods to validate the clustering solutions ( Xu et al., 2013 ). All the analyses were conducted in the software program Rstudio ( RStudio Team, 2017 ).

Classifier Development

The general classifier building process for the supervised learning methods consists of three steps: (1) train the classifier through estimating model parameters; (2) determine the values of tuning parameters to avoid issues such as “overfitting” (i.e., the statistical model fits too closely to one dataset but fails to generalize to other datasets) and finalize the classifier; (3) calculate the accuracy of the classifier based on the test dataset. In general, training and tuning are often conducted based on the same training dataset. However, some studies may further split the training dataset into two parts, one for training while the other for tuning. Though tree-based methods are not affected by the scaling issue, training and test datasets are scaled for SVM, SOM, and k -means.

Given the relatively small sample size of the current dataset, training, and tuning processes were both conducted on the training dataset. Classification accuracy was evaluated with the test dataset. For the CART technique, the cost-complexity parameter ( cp ) was tuned to find the optimal tree depth using R package rpart . Gradient boosting was carried out using R package gbm . The tuning parameters for gradient boosting were the number of trees, the complexity of trees, the learning rate and the minimum number of observations in the tree's terminal nodes. Random forest was tuned over its number of predictors sampled for splitting at each node ( m try ) using R package randomForest . A radial basis function kernel SVM, carried out in R package kernlab , was tuned through two parameters: scale function σ and the cost value C, which determine the complexity of the decision boundary. After the parameters were tuned, the classifiers were trained fitting to the training dataset. 10-fold-validation was conducted for supervised learning methods in the training processes. Cross-validation is not necessary for random forest when estimating test error due to its statistical properties ( Sinharay, 2016 ).

For the unsupervised learning methods, SOM was carried out in the R package kohonen . Learning rate declined from 0.05 to 0.01 over the updates from 2000 iterations. k -means was carried out using the kmeans function in the stats R package with 2000 iterations. Euclidian distance was used as a distance measure for both methods. The number of clusters ranged from 3 to 10. The lower bound was set to be 3 due to the three score categories in this dataset. The upper bound was set to be 10 given the relative small number of features and small sample size in the current study. The R code for the usage of both supervised and unsupervised methods can be found in Appendix B .

Evaluation Criterion

For the supervised methods, students in the test dataset are classified based on the classifier developed based on the training dataset. The performance of supervised learning techniques was evaluated in terms of classification accuracy. Outcome measures include overall accuracy, balanced accuracy, sensitivity, specificity, and Kappa. Since item scores are three categories, 0, 1, and 2, sensitivity, specificity and balanced accuracy were calculated as follows.

where sensitivity measures the ability to predict positive cases, specificity measures the ability to predict negative cases and balanced accuracy is the average of the two. Overall accuracy and Kappa were calculated for each method based on the following formula:

where overall accuracy measures the proportion of all correct predictions. Kappa statistic is a measure of concordance for categorical data. In its formula, p o is the observed proportion of agreement, p e is the proportion of agreement expected by chance. The larger these five statistics are, the better classification decisions.

For the two unsupervised learning methods, the better fitting method and the number of clusters were determined for the training dataset by the following criteria:

1. Davies-Bouldin Index (DBI; Davies and Bouldin, 1979 ) calculated as in Equation 6, can be applied to compare the performance of multiple clustering algorithms ( Fossey, 2017 ). The algorithm with the lower DBI is considered the better fitting one which has the higher between-cluster variance and smaller within-cluster variance.

where k is the number of clusters, S i and S j are the average distances from the cluster center to each case in cluster i and cluster j . M ij is the distance between the centers of cluster i and cluster j . Cluster j has the smallest between-cluster distance with cluster i or has the highest within-cluster variance, or both ( Davies and Bouldin, 1979 ).

2. Kappa value (see Equation 5) is a measure of classification consistency between these two unsupervised algorithms. It is usually expected not smaller than 0.8 ( Landis and Koch, 1977 ).

To check the classification stability and consistency in the training dataset, the methods were repeated in the test dataset, DBI and Kappa values were computed.

The tuning and training results for the four supervised learning techniques are first reported and then the evaluation of their performance on the test datasets. Lastly, the results for the unsupervised learning methods are presented.

Supervised Learning Methods

The tuning processes for all the classifiers reached satisfactory results. For the CART, cp was set to 0.02 to achieve minimum error and the simplest tree structure (error < 0.2, number of trees < 6), as shown in Figure 4 . The final tuning parameters for gradient boosting: the number of trees = 250, the depth of trees = 10, the learning rate = 0.01 and the minimum number of observations in the trees terminal nodes = 10. Figure 5 shows that when the maximum tree depth equaled 10, the RMSE was minimum as iteration reached 250 with the simplest tree structure. The number of predictors sampled for splitting at each node ( m try ) in the random forest was set to 4 to achieve the largest accuracy, as shown in Figure 6 . In the SVM, the scale function σ was set to 1 and the cost value C set to 4 to reach the smallest training error 0.038.

www.frontiersin.org

Figure 4 . The CART tuning results for cost-complexity parameter ( cp ).

www.frontiersin.org

Figure 5 . The Gradient Boosting tuning results.

www.frontiersin.org

Figure 6 . The random forest tuning results (peak point corresponds to m try = 4).

The performance of the four supervised techniques was summarized in Table 2 . All four methods performed satisfactorily, with almost all values larger than 0.90. The gradient boosting showed the best classification accuracy overall, exhibiting the highest Kappa and overall accuracy (Kappa = 0.94, overall accuracy = 0.96). Most of their subclass specificity and balanced accuracy values also ranked top, with only sensitivity for score = 0, specificity for score = 1 and balanced accuracy for score = 0 smaller than those from SVM. SVM, random forest, and CART performed similarly well, all with a slightly smaller Kappa and overall accuracy values (Kappa = 0.92, overall accuracy = 0.95).

www.frontiersin.org

Table 2 . Average of accuracy measures of the scores.

Among the four supervised methods, the single tree structure from CART built from the training dataset is the easiest to interpret and plotted in Figure 7 . Three colors represent three score categories: red (no credit), gray (partial credit), and green (full credit). The darker the color is, the more confident the predicted score is in that node, the more precise the classification is. In each node, we can see three lines of numbers. The first line indicates the main score category in that node. The second line represents the proportions of each score category, in the order of scores of 0, 1, and 2. The third line is the percentage of students falling into that node. CART has a built-in characteristic to automatically choose useful features. As shown in Figure 7 , only five nodes (features), “city_con_daily_cancel,” “other_buy,” “trip4_buy,” “concession,” and “daily_buy,” were used in branching before the final stage. In each branch, if the student performs the action (>0.5), he/she is classified to the right, otherwise, to the left. As a result, students with a full credit were branched into one class, in which 96% truly belonged to this class and accounted for 29% of the total data points. Students who earned a partial credit were partitioned into two classes, one purely consisted of students in this group and the other consisted of 98% students who truly got partial credit. For the no credit group, students were classified into three classes, one purely consisted of students in this group and the other two classes included 10 and 18% students from other categories. One major benefit from this plot is that we can clearly tell the specific action sequences that led students into each class.

www.frontiersin.org

Figure 7 . The CART classification.

Unsupervised Learning Methods

As shown in Table 3 , the candidates for the best clustering solution from the training dataset were k -means with 5 clusters (DBI = 0.19, kappa = 0.84) and SOM with 9 clusters (DBI = 0.25, kappa = 0.96), which satisfied the criterion of a smaller DBI value and kappa value ≥ 0.8. When validated with the test dataset, the DBI values for k -means and SOM all increased. It could be caused by the smaller sample size of the test dataset. Due to the low kappa value for the 5-cluster solution in the validation sample, the final decision on the clustering solution was SOM with 9 clusters. The percentage of students in each score category in each cluster is presented in Figure 8 . The cluster analysis results obtained based on both SOM and k -means can be found in Table A2 in Appendix A.

www.frontiersin.org

Table 3 . Clustering Algorithms' Fit (DBI) and Agreement (Cohen's Kappa).

www.frontiersin.org

Figure 8 . Percentage in each score category in the final SOM clustering solution with 9 clusters from the training dataset.

To interpret, label and group the resulting clusters, it is necessary to examine and generalize the students' features and the strategy pattern in each of the cluster. In alignment with the scoring rubrics and ease of interpretation, the nine clusters identified in the training dataset are grouped into five classes and interpreted as follows.

1. Incorrect (cluster1): students bought neither individual tickets for 4 trips nor a daily ticket.

2. Partially correct (cluster 4–5): students bought either individual tickets for 4 trips or a daily ticket but did not compare the prices.

3. Correct (cluster 7 and 8): students did compare the prices between individual tickets and a daily ticket and chose to buy the cheaper one (individual tickets for 4 trips).

4. Unnecessary actions (cluster 2, 3, and 6): students tried options not required by the question, e.g., country train ticket, other number of individual ticket.

5. Outlier (cluster 9): the student made too many attempts and is identified as an outlier.

Such grouping and labeling can help researchers better understand the common strategies used by students in each score category. It also helps to identify errors students made and can be a good source of feedback to students. For those students mislabeled above, they share the major characteristics in the cluster. For example, 4% students who got no credit in cluster 4 in the training dataset bought daily ticket for the city subway without comparing the prices, but they bought the full fare instead of using student's concession fare. These students are different from those in cluster 1 who bought neither daily tickets nor individual tickets for 4 trips. Thus, students in the same score category were classified into different clusters, indicating that they made different errors or took different actions during the problem-solving process. In summary, though students in the same score category generally share the actions they took, they can also follow distinct problem-solving processes. Students in different score categories can also share similar problem-solving process.

Summary and Discussions

This study analyzed the process data in the log file from one of the 2012 PISA problem-solving items using data mining techniques. The data mining methods used, including CART, gradient boosting, random forest, SVM, SOM, and k -means, yielded satisfactory results with this dataset. The three major purposes of the current study were summarized as follows.

First, to demonstrate the analysis of process data using both supervised and unsupervised techniques, concrete steps in feature generation, feature selection, classifier development and outcome evaluation were presented in the current study. Among all steps, feature generation was the most crucial one because the quality of features determines the classification results to a large extent. Good features should be created based on a thorough understanding of the item scoring procedure and the construct. Key action sequences that can distinguish correct and incorrect answers served as features with good performance. Unexpectedly, time features, including total response time and its pieces, did not turn out to be important features for classification. This means that considerable variance of response time existed in each score group and the differences in response time distributions among the groups was not large enough to clearly distinguish the groups (see Figure A1 in Appendix A). This study generated features based on theoretical beliefs about the construct measured and used students as the unit of analysis. The data could be structured in other ways according to different research questions. For example, instead of using students as the unit of analysis, the attempts students made can be used as rows and actions as columns, then the attempts can be classified instead of people. Fossey (2017) included a detailed tutorial on clustering algorithms with such data structure in a game-based assessment.

Second, to evaluate classification consistency of these frequently used data mining techniques, the current study compared four supervised techniques with different properties, namely, CART, gradient boosting, random forest, and SVM. All four methods achieved satisfactory classification accuracy based on various outcome measures, with gradient boosting showing slightly better overall accuracy and Kappa value. In general, easy interpretability and graphical visualization are the major advantages of trees. Trees also deal with noisy and incomplete data well ( James et al., 2013 ). However, the trees are easily influenced by even small changes in the data due to its hierarchical splitting structure ( Hastie et al., 2009 ). SVM, on the contrary, generalizes well because once the hyperplane is found, small changes to data cannot greatly affect the hyperplane ( James et al., 2013 ). Given the specific dataset in the current study, even the CART method worked very well. In addition, the CART method can be easily understood and provided enough information about the detailed classifications between and within each score category. Thus, based on the results in the current study, the CART method is sufficient for future studies on similar datasets. Unsupervised learning algorithms, SOM and k -means, also showed convergent clustering results based on DBI and Kappa values. In the final clustering solution, students were grouped into 9 clusters, revealing specific problem-solving processes they went through.

Third, supervised and unsupervised learning methods serve to answer different research questions. Supervised learning methods can be used to train the algorithm to predict memberships in the future data, like automatic scoring. Unsupervised methods can reveal the problem-solving strategy patterns and further differentiate students in the same score category. This is especially helpful for formative purposes. Students can be provided with more detailed and individualized diagnostic reports. Teachers can better understand students' strengths and weaknesses, and adjust instructions in the classroom accordingly or provide more targeted tutoring to specific students. In addition, it is necessary to check any indication for cheating behavior in the misclassified or outlier cases from both types of data mining methods. For example, students answered the item correctly within an extremely short amount of time can imply item compromise.

This study has its own limitations. Other data mining methods, such as other decision trees algorithms and clustering algorithms, are worth of investigation. However, the procedure demonstrated in this study can be easily generalized to other algorithms. In addition, the six methods were compared based on the same set of data rather than data under various conditions. Therefore, the generalization of the current study is limited due to factors such as sample size and number of features. Future studies can use a larger sample size and extract more features from more complicated assessment scenarios. Lastly, the current study focuses on only one item for the didactic purpose. In the future study, process data for more items can be analyzed simultaneously to get a comprehensive picture of the students.

To sum up, the selection of data mining techniques for the analysis of process data in assessment depends on the purpose of the analysis and the data structure. Supervised and unsupervised techniques essentially serve different purposes for data mining with the former as a confirmatory approach while the latter as an exploratory approach.

Author Contributions

XQ as the first author, conducted the major part of study design, data analysis and manuscript writing. HJ as the second author, participated in the formulation and refinement of the study design and provided crucial guidance in the statistical analysis and manuscript composition.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fpsyg.2018.02231/full#supplementary-material

Bolsinova, M., De Boeck, P., and Tijmstra, J. (2017). Modeling conditional dependence between response time and accuracy. Psychometrika 82, 1126–1148. doi: 10.1007/s11336-016-9537-6

CrossRef Full Text | Google Scholar

Chung, G. K. W. K., Baker, E. L., Vendlinski, T. P., Buschang, R. E., Delacruz, G. C., Michiuye, J. K., et al. (2010). “Testing instructional design variations in a prototype math game,” in Current Perspectives From Three National RandD Centers Focused on Game-based Learning: Issues in Learning, Instruction, Assessment, and Game Design , eds R. Atkinson (Chair) (Denver, CO: Structured poster session at the annual meeting of the American Educational Research Association).

Google Scholar

Davies, D. L., and Bouldin, D. W. (1979). A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell. 1, 224–227. doi: 10.1109/TPAMI.1979.4766909

PubMed Abstract | CrossRef Full Text | Google Scholar

DiCerbo, K. E., and Kidwai, K. (2013). “Detecting player goals from game log files,” in Poster presented at the Sixth International Conference on Educational Data Mining (Memphis, TN).

DiCerbo, K. E., Liu, J., Rutstein, D. W., Choi, Y., and Behrens, J. T. (2011). “Visual analysis of sequential log data from complex performance assessments,” in Paper presented at the annual meeting of the American Educational Research Association (New Orleans, LA).

Fossey, W. A. (2017). An Evaluation of Clustering Algorithms for Modeling Game-Based Assessment Work Processes. Unpublished doctoral dissertation, University of Maryland, College Park . Available online at: https://drum.lib.umd.edu/bitstream/handle/1903/20363/Fossey_umd_0117E_18587.pdf?sequence=1 (Accessed August 26, 2018).

Fu, J., Zapata-Rivera, D., and Mavronikolas, E. (2014). Statistical Methods for Assessments in Simulations and Serious Games (ETS Research Report Series No. RR-14-12). Princeton, NJ: Educational Testing Service.

Gobert, J. D., Sao Pedro, M. A., Baker, R. S., Toto, E., and Montalvo, O. (2012). Leveraging educational data mining for real-time performance assessment of scientific inquiry skills within microworlds. J. Educ. Data Min. 4, 111–143. Available online at: https://jedm.educationaldatamining.org/index.php/JEDM/article/view/24 (Accessed November 9, 2018)

Hanley, J. A., and McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143, 29–36. doi: 10.1148/radiology.143.1.7063747

Hao, J., Smith, L., Mislevy, R. J., von Davier, A. A., and Bauer, M. (2016). Taming Log Files From Game/Simulation-Based Assessments: Data Models and Data Analysis Tools (ETS Research Report Series No. RR-16-10). Princeton, NJ: Educational Testing Service.

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd edn. New York, NY: Springer. doi: 10.1007/978-0-387-84858-7

Howard, L., Johnson, J., and Neitzel, C. (2010). “Examining learner control in a structured inquiry cycle using process mining.” in Proceedings of the 3rd International Conference on Educational Data Mining , 71–80. Available online at: https://files.eric.ed.gov/fulltext/ED538834.pdf#page=83 (Accessed August 26, 2018).

James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning, Vol 112 . New York, NY: Springer.

Jeong, H., Biswas, G., Johnson, J., and Howard, L. (2010). “Analysis of productive learning behaviors in a structured inquiry cycle using hidden Markov models,” in Proceedings of the 3rd International Conference on Educational Data Mining , 81–90. Available online at: http://educationaldatamining.org/EDM2010/uploads/proc/edm2010_submission_59.pdf (Accessed August 26, 2018).

Kerr, D., Chung, G., and Iseli, M. (2011). The Feasibility of Using Cluster Analysis to Examine Log Data From Educational Video Games (CRESST Report No. 790). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), Center for Studies in Education, UCLA . Available online at: https://files.eric.ed.gov/fulltext/ED520531.pdf (Accessed August 26, 2018).

Kohonen, T. (1997). Self-Organizing Maps . Heidelberg: Springer-Verlag. doi: 10.1007/978-3-642-97966-8

Kuhn, M. (2013). Predictive Modeling With R and the Caret Package [PDF Document] . Available online at: https://www.r-project.org/conferences/useR-2013/Tutorials/kuhn/user_caret_2up.pdf (Accessed November 9, 2018).

Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics 33, 159–174. doi: 10.2307/2529310

Levy, R. (2014). Dynamic Bayesian Network Modeling of Game Based Diagnostic Assessments (CRESST Report No.837). Los Angeles, CA: University of California, National Center for Research on Evaluation, Standards, and Student Testing (CRESST), Center for Studies in Education, UCLA . Available online at: https://files.eric.ed.gov/fulltext/ED555714.pdf (Accessed August 26, 2018).

Organisation for Economic Co-operation and Development (2014). PISA 2012 Results: Creative Problem Solving: Students' Skills in Tackling Real-Life Problems, Vol. 5 . Paris: PISA, OECD Publishing.

RStudio Team (2017). RStudio: Integrated development environment for R (Version 3.4.1) [Computer software] . Available online at: http://www.rstudio.com/

Sao Pedro, M. A., Baker, R. S. J., and Gobert, J. D. (2012). “Improving construct validity yields better models of systematic inquiry, even with less information,” in User Modeling, Adaptation, and Personalization: Proceedings of the 20th UMAP Conference , eds J. Masthoff, B. Mobasher, M. C. Desmarais, and R. Nkambou (Heidelberg: Springer-Verlag), 249–260. doi: 10.1007/978-3-642-31454-4_21

Shu, Z., Bergner, Y., Zhu, M., Hao, J., and von Davier, A. A. (2017). An item response theory analysis of problem-solving processes in scenario-based tasks. Psychol. Test Assess. Model. 59, 109–131. Available online at: https://www.psychologie-aktuell.com/fileadmin/download/ptam/1-2017_20170323/07_Shu.pdf (Accessed November 9, 2018)

Sinharay, S. (2016). An NCME instructional module on data mining methods for classification and regression. Educ. Meas. Issues Pract. 35, 38–54. doi: 10.1111/emip.12115

Soller, A., and Stevens, R. (2007). Applications of Stochastic Analyses for Collaborative Learning and Cognitive Assessment (IDA Document D-3421) . Arlington, VA: Institute for Defense Analysis.

Stevens, R. H., and Casillas, A. (2006). “Artificial neural networks,” in Automated Scoring of Complex Tasks in Computer-Based Testing , eds D. M. Williamson, R. J. Mislevy, and I. I. Bejar (Mahwah, NJ: Lawrence Erlbaum Associates, Publishers), 259–312.

van der Linden, W. J. (2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika 72, 287–308. doi: 10.1007/s11336-006-1478-z

Vapnik, V. (1995). The Nature of Statistical Learning Theory . New York, NY: Springer-Verlag. doi: 10.1007/978-1-4757-2440-0

Williamson, D. M., Mislevy, R. J., and Bejar, I. I. (2006). Automated Scoring of Complex Tasks in Computer-Based Testing, eds . Mahwah, NJ: Lawrence Erlbaum Associates, Publishers. doi: 10.4324/9780415963572

Xu, B., Recker, M., Qi, X., Flann, N., and Ye, L. (2013). Clustering educational digital library usage data: a comparison of latent class analysis and k-means algorithms. J. Educ. Data Mining 5, 38–68. Available online at: https://jedm.educationaldatamining.org/index.php/JEDM/article/view/21 (Accessed November 9, 2018)

Zhu, M., Shu, Z., and von Davier, A. A. (2016). Using networks to visualize and analyze process data for educational assessment. J. Educ. Meas. 53, 190–211. doi: 10.1111/jedm.12107

Keywords: data mining, log file, process data, educational assessment, psychometric

Citation: Qiao X and Jiao H (2018) Data Mining Techniques in Analyzing Process Data: A Didactic. Front. Psychol . 9:2231. doi: 10.3389/fpsyg.2018.02231

Received: 14 March 2018; Accepted: 29 October 2018; Published: 23 November 2018.

Reviewed by:

Copyright © 2018 Qiao and Jiao. This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY) . The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Xin Qiao, [email protected]

Disclaimer: All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Machine Learning and Data Mining

  • First Online: 07 September 2024

Cite this chapter

research paper for data mining

  • Wolfgang Ertel 14  

Part of the book series: Undergraduate Topics in Computer Science ((UTICS))

One of the major AI applications is the development of intelligent autonomous robots. Since flexibility and adaptivity are important features of really intelligent agents, research into learning mechanisms and the development of machine learning algorithms is one of the most important branches of AI. After motivating and introducing basic concepts of machine learning like classification and approximation, this chapter presents basic supervised learning algorithms such as the perceptron, nearest neighbor methods, and decision tree induction. Unsupervised clustering methods and data mining software tools complete the picture of this fascinating field.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save.

  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Python is a modern scripting language with very readable syntax, powerful data types, and extensive standard libraries, which can be used to this end.

Here and in the following text, the vector’s identifying number shown by the raised index p is put in parentheses to avoid confusing it with the p th power.

Caution! This is not a proof of convergence for the perceptron learning rule. It only shows that the perceptron converges when the training dataset consists of a single example.

The functionals argmin  and argmax  determine, similarly to min and max, the minimum or maximum of a set or function. However, rather than returning the value of the maximum or minimum, they give the position, that is, the argument in which the extremum appears.

The Hamming distance between two-bit vectors is the number of different bits of the two vectors.

To keep the example simple and readable, the feature vector x was deliberately kept one-dimensional.

The three-day total of snowfall is in fact an important feature for determining the hazard level. In practice, however, additional attributes are used [Bra01]. The example used here is simplified.In practice only integers are used, but we could just let them take on arbitrary real number values between one and five.

The confusion matrix is the special case of a contingency table with two features.

In ( 7.9 ) the natural logarithm rather than log  \(_{2}\) is used in the definition of entropy. Because here, and also in the case of the MaxEnt method, entropies are only being compared, this difference does not play a role. (see Exercise  8.14 ).

It would be better to use the error on the test data directly. At least when the amount of training data is sufficient to justify a separate testing set.

Feature scaling is necessary or advantageous for many machine learning algorithms.

The nearest neighbor algorithm is not to be confused with the nearest neighbor method for classification from Sect.  8.3 .

A minimum spanning tree is an acyclic, undirected graph with the minimum sum of edge lengths.

Author information

Authors and affiliations.

Hochschule Ravensburg-Weingarten, Weingarten, Germany

Wolfgang Ertel

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Wolfgang Ertel .

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Fachmedien Wiesbaden GmbH, part of Springer Nature

About this chapter

Ertel, W. (2025). Machine Learning and Data Mining. In: Introduction to Artificial Intelligence. Undergraduate Topics in Computer Science. Springer, Wiesbaden. https://doi.org/10.1007/978-3-658-43102-0_8

Download citation

DOI : https://doi.org/10.1007/978-3-658-43102-0_8

Published : 07 September 2024

Publisher Name : Springer, Wiesbaden

Print ISBN : 978-3-658-43101-3

Online ISBN : 978-3-658-43102-0

eBook Packages : Computer Science Artificial Intelligence (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Publish with us

Policies and ethics

  • Find a journal
  • Track your research

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 10 September 2024

Research on controlled mining of end slope fire-burned area in open-pit mine

  • Ziling Song 1 , 2 ,
  • Junfu Fan 4 ,
  • Xiaoliang Zhao 2 ,
  • Yuhang Zhang 3 ,
  • Shiyang Xia 1 &
  • Yifang Long 1  

Scientific Reports volume  14 , Article number:  21152 ( 2024 ) Cite this article

Metrics details

  • Energy science and technology
  • Engineering
  • Mathematics and computing

To solve the problem of controlling mining in the open-pit mine end slope fire-burned area, applying multivariate function fitting to the roof and floor modeling of multi-coal seam open pit mines, introducing the factor of coal quality changes in the fire-burned area, determining coal quality information at each location through proximate analysis of coal, to establish the net profit model of the mining area, it is determined the net profit of each mining position by numerical integration, the final mining position was determined without failure by calculating the slope stability based on the numerical simulation of strength reduction. Taking the Dananhu No. 2 open-pit mine in Hami, Xinjiang, China as the engineering background, the fire-burned area III within the southern end slope boundary of the first mining area is 240 m. It was finally determined that the optimal mining position is when the advancement degree is 182 m, the ultimate pit slope angle is 25°, the three-dimensional slope stability is 1.305, the profit is 671.96 million yuan, The deep boundary of the southern end slope fire-burned area of the slope is reduced by 58 m. This paper solves the problem of end slope mining in Dananhu No. 2 mine, and maximizes its net profit under the condition of ensuring safe production.

Introduction

There are a large number of fire-burned areas (FBA) distributed in the open-pit mining areas in the northwest of China. Especially in Xinjiang, which has abundant coal resources, the typical steppe climate contributes to dry and shallow coal formation when the temperature reaches a certain degree. This leads to spontaneous combustion of the coal layers, resulting in FBA within a certain regional scope 1 , 2 .

The rocks around the FBA undergo pyrolysis under high temperatures, leading to a decrease in mechanical properties 3 . Coal sheaths at the surface of the coal layers are prone to spontaneous combustion, which results in a decrease in coal quality, lower heat content, and a corresponding decrease in the sale price. Most open-pit mines adopt the strategy of capping and not mining when encountering these FBA, but this will result in a decrease in the recovery rate of coal resources and conflict with the concept of green mining 4 , 5 . If the entire coal is extracted, the low-quality coal would have a low sale price, which would indirectly increase the cost of mining and reduce the economic benefits of the mine. Current literature mostly focuses on the study of pyroclastic rocks, such as Yang Fan's investigation into the elemental migration and mineral phase transformation characteristics in the west of the Zhangjiaqiao coal field in the northern Ordos Basin, clarifying the mineral phases and their transformation rules in the process of thermal metamorphism 6 ; Yu Yuanxiang's study on the characteristics of pyroclastic rocks, revealing the unloading mechanism of high slopes and determining the safety width of hydraulic coal-rock columns 7 ; Wang's research on the pyroclastic soil after pyrolysis, determining its mass loss rate and thermal conductivity at high temperatures 8 . Based on previous research, the article introduces the factors of coal quality and slope stability and proposes a method to solve the problem of controlled mining of coal seams in fire-burned area.

Current status of the first mining area of Dananhu No. 2 Mine

The No. 2 open-pit mine in the west area of the Dananhu Mining Area ofthe CHN Energy Group Shenhua National Energy (herein after referred to as the Dananhu No. 2 Mine) is located in Hami City, Xinjiang, under the jurisdiction of Nanhu Township, Hami City, about 84 km from Hami City and 45 km from Nanhu Township 9 . Considering the production capacity of its mining process equipment and annual output, the mining area is divided into four mining areas for mining. The first mining area is located in the middle of the mining area and the south is close to the III fire area. The mining period is 44 years 10 . The current status of the Dananhu No. 2 Mine is shown in Fig.  1 , and the relative position map of the first mining area and III fire area of the open-pit mine is shown in Fig.  2 .

figure 1

Current status of Dananhu No. 2 open-pit mine in Xinjiang, China.

figure 2

Location of the first mining area.

At present, the first mining area has just reached the stage of starting internal drainage. The main coal seams in the first mining area are the well-developed No. 16, 18, 21, 25, and 29 coal seams. However, the No. 16 and 18 coal seams are exposed in the southern end slope (SES), resulting in Part of the spontaneous combustion area appearing, and the quality of deep coal being affected. If the mining of No. 16 and 18 coal is abandoned, it will lead to the waste of coal 21, 25, and 29 in the SES. If the coal of No. 16 and 18 is mined, it is not sure whether it will produce normal coal, economic benefit value. Faced with this situation, the internal dumping site cannot be close to the stop for continuous follow-up, which has affected daily production. Therefore, it is urgent to study the mining problems in the SES FBA of the first mining area of Dananhu No. 2 Mine.

Build function model

Comparison of deep realms in the first mining area.

As shown in Fig.  3 , the stripping operation in the first mining area of Dananhu No. 2 Mine has entered the FBA III which has seriously affected the normal production and economic benefits of the open-pit mine. In order to study the optimized mining plan of the FBA, it is first necessary to determine the reasonable mining realm of the FBA and simulate the changes in the deep realm of the first mining area under two extreme mining conditions: the first one is to consider the coal spontaneous combustion caused by 16 and 18 coals. Due to the decline in slope stability caused by the decline in quality and changes in rock mechanical parameters, the mining of the coal resources covered by the FBA III in the SES is abandoned; the second case is to mine according to the preliminary design, and all coals in the SES 16 and 18 are mined to the boundary, the end slopes of both mining methods are designed using the 25° slope foot in the preliminary design. The comparison of changes in the deep realm between the two methods is shown in Fig.  4 .

figure 3

Mining status of the first mining area in 2024.

figure 4

Floor boundary comparison.

Estimation of coal resources in the study area

According to the two mining plans, the mining software 3dmine was used to estimate the amount of main mining coal seams overlain in the area, as shown in Table 1 .

It can be seen from the above table that the SES has covered more than about 50.00Mt of normally mineable coal resources and nearly 10.00Mt of coal resources with different degrees of spontaneous combustion.

Construction of multivariate functions of coal seam roof and floor

From the above analysis, it can be seen that the Abbreviated SES of the first mining area of Dananhu No. 2 Mine has covered a large amount of coal resources, especially for the main mining coal seams in this area. This part of the reserves was caused by spontaneous combustion at the outcrops of the uppermost No. 16 and No. 18 coal seams. The coal quality of the No. 16 and 18 coal seams in this area has declined to varying degrees. The surrounding rocks are affected by the FBA III, resulting in a decrease in mechanical parameters. The mining of the No. 21, 25, and 29 coal seams in the SES of the main coal seam will also be affected by burnt rock. If the SES is mined according to the initially designed slope, it may face two situations: as coal 16 and 18 are mined to the boundary, changes in coal quality will cause an increase in mining costs. Complete mining of 16 and 18 Coal may not be the optimal solution in terms of economic benefits 11 ; excessive mining may lead to an increase in the exposed area of burnt rock, causing the SES to be unstable 12 , 13 , 14 . According to the green mining principle of open-pit mines, how to reasonably mine the coal seam in the SES of the side FBA and recover the underlying main mining coal seam has become a problem that needs to be solved.

When constructing the block model of coal seam and rock, assuming that the relevant mechanical parameters of rock, stratum, coal seam are consistent in trend and tendency, then the external influencing factors become the change of coal quality under the spontaneous combustion of coal seam and the decline of rock-related mechanical parameters affected by burning.

Since the outcrop areas of the 16 and 18 coal seams are affected by the FBA III, resulting in a decline in coal quality, the coal produced after mining will sell for varying degrees lower than the normal price, and coal below 18 is No. 21 coal with better coal quality. For the mining of coal seams in this area, it is not comprehensive to simply use the stripping ratio to measure the mining benefits. Therefore, changes in coal quality of spontaneously igniting coal seams 16, 18 and main mining coal seams with better coal quality 21, 25, 29 are introduced. The economic benefits generated by controlling the mining of coal ≮ the sum of other costs incurred by stripping and transportation 15 . For the slope location after mining, due to the changes in the properties of the burnt rock, if the initial boundary slope is observed, it will lead to the slope is unstable. Therefore, in the end, the stability of the boundary slope should be the primary goal, and economic benefits should be the secondary goal. The relationship between the boundary mining location, slope stability, and economic benefits should be comprehensively considered and analyzed.

It is planned to construct the net present value function of the main mining coal seams No. 16, 18, 21, 25, and 29, and analyze the most appropriate advance rate by determining the profit of each mining location. Furthermore, adding slopes under the above conditions Stability is a factor that comprehensively considers the optimal location of the coal seam at the SES. In this way, we can achieve the maximum economic benefit of SES mining and the optimal position of deep mining.

The smaller the block size in the model, the higher the accuracy, the larger the storage space required for data, and the slower the computer runs. Normally, the size of the block depends on the type, scale, and mining method of the ore body. Each cuboid with a certain volume is superimposed to form a block model, and the block needs to be divided into smaller sub-blocks at the edge of the well field so that The blocks at the edge of the wellfield are closer to coal seams. Despite this, a certain error still occurs when constraining the edges, which is more obvious when dealing with coal seams with more complex occurrences 16 .

In view of the stable existence of the main mining coal seam at the SES of the Dananhu No. 2 open-pit mine, this paper plans to use multiple integrations to estimate the amount of coal in the boundary before and after mining. Before that, in order to calculate the reserves at each mining stage, it is necessary to fit the surface and coal seam roof and floor into multivariate functions respectively. The first hexagram of the three-dimensional rectangular coordinate system was constructed on the constructed roof and floor model of the main mining coal seam, and 80 three-dimensional coordinates were extracted for the roof and floor of the coal seam respectively. Considering that part of the surface in the study area has been peeled off, in order to improve the ground surface model Accuracy, 144 three-dimensional coordinates are taken for analysis. The coal seam roof and floor elevation model is shown in Fig.  5 .

figure 5

3D coordinate system construction and coal seam roof and floor elevation point extraction.

Import the obtained three-dimensional coordinate point data into Matlab for polynomial fitting. The fitting results of the roof and floor of each coal seam are shown below. In order to ensure that the fitting results can truly reflect the existence of the coal seam as much as possible and avoid over-fitting, polynomial fitting is performed. Choose twice for the highest number of times. Take the 16-coal roof and floor as an example, see Figs.  6 and 7 .

figure 6

16 Coal seam roof fitting surface.

figure 7

16 Coal seam floor fitting surface.

The coefficient of determination ( R 2 ) represents the proportion of change in the response variable y explained by the independent variable X in the linear regression model. The larger the R 2 , the greater the variation explained by the linear regression model. To determine whether the above model is successfully constructed, its residuals need to be analyzed. From the above 3D coordinate point residual, we plot the residuals histogram respectively, as shown in Fig.  8 .

figure 8

Histogram of residuals.

It can be clearly seen from Fig.  8 above that the roof and floor residual histograms of the coal seam model conform to the normal distribution, and their R 2 are 0.9441 and 0.9747 respectively, which are at a very high level. Although there are certain abnormal data, this abnormal data is due to the real endowment of the coal seam. It is caused by the existing conditions, so there is no need to eliminate it. Therefore, it is judged that the model is successfully established.

From this, we can determine that the dependence of the established change pattern of the response function of the roof and floor of the 16 coal seams on the two predictors can be expressed by the following two formulas:

The formula ( 1 ) is the change pattern of the response function of the 16 coal seam roof, and its coefficient of determination is R 2  = 0.8122. The formula ( 2 ) is the change pattern of the response function of the 16 coal seam floor, and its coefficient of determination is R 2  = 0.8140, both are at a high level.

Final pit slope fitting surface function:

Current end slope fitting surface function:

The fitting functions of other coal seam roofs and floors can be deduced in the same way and will not be described again.

Based on the actual situation on site and the delineation of the initial boundary, and comprehensively considering the actual construction work and production progress on site, 2D location maps of the SES at the current, the current situation to the boundary end slope, and the preliminary design to the boundary end slope were made, as shown in the Fig.  9 shown.

figure 9

Schematic diagram of the SES in 2024.

As shown in the picture, the current situation in 2024, the pink line is the position of the backing in 2024, and the blue line is the position of the slope in the original design. At present, after June, the current status of the stope is the location of the pink line, at this time, we are faced with the difficult question of whether the SES should continue to push southward.

In order to solve the above problems and maximize the economic benefits of SES mining, the factor of coal quality changes in the 16 and 18 coal-burning areas was introduced. The economic benefits were not only measured by the economically reasonable stripping ratio but also by the normal coal quality. As a reference object, by constructing the changes of 16 and 18 coal combustion kcal in the study area, the coal quality in the area is determined, and the coal price at this time is further determined to construct the economic benefit change curve of mining the area 17 .

The number of drill holes in the study area and the locations of drill holes for subsequent supplementary surveys are selected based on the geological survey as shown in Fig.  10 .

figure 10

Drilling location.

In the shallow outcrop area of SES, coal seam samples were taken from different areas of the same coal seam, and coal quality tests were conducted to determine the coal quality indicators of 16 coal and 18 coal.

First, according to "GB/T 482-2008 Coal Seam Coal Sample Collection Method", fresh coal samples were collected from newly exposed coal walls in non-fired areas and non-weathered work areas as standard comparison coal samples. After being sealed on site, they were transported to the ground and sent to the laboratory 18 . After the coal samples are transported to the laboratory, they are crushed and screened. According to different experimental needs, the coal samples are sealed in ziplock bags of different particle sizes for later use, as shown in Fig.  11 .

figure 11

Coal sample processing.

According to "GB/T 213–2008 Determination of calorific value of coal", the experimental results determine that the high calorific value Q gr,ad of the air drying base of coal seams 16 and 18 of the standard working coal seam is 23.36 MJ/Kg and 24.07 MJ/Kg. Based on the drilling information at the selected location in the SES, the coal quality of the 16th and 18th coal seams were extracted respectively, and five industrial indicators of coal were analyzed. The average value of the three sample tests extracted from each borehole is selected as the high-level calorific value of the air-drying base of the coal in the area, and the change in calorific value at the location shown in the borehole in the area is obtained, as shown in Fig.  12 .

figure 12

Calorific value of drilled coal quality.

Moisture( M ad ), ash( A ad ), volatile matter( V ad ), and fixed carbon( FC ad ) in coal industry analysis are selected, and four coal quality analysis indicators are used to calculate the calorific value( Q ad ) 19 . Through the experimental results, firstly, the drilling data in each area are processed and converted according to the calorific value of the standard coal sample; secondly, the calorific value data of each location in the same interval, such as [0,100], is averaged to represent the location of the area. The average calorific value of the coal seam, and finally the fluctuation changes in coal quality in the study area [0,700] were determined through numerical fitting.

The data of each borehole are converted according to the coal sample data collected by the work team, and finally, the coal quality change curves of study areas 16 and 18 are obtained. Those that exceed the standard are calculated according to 1, and those that do not exceed the standard are calculated according to the standard. This represents the calorific value of the coal seam in the area. The changes in coal quality in each area are shown in Fig.  13 .

figure 13

Coal quality changes in various regions.

The calorific value Q ad of the drilling data in this area is fitted into a three-dimensional polynomial function curve through data fitting, thereby simulating the coal quality changes in the X-axis direction of the study area. This curve function is shown in Fig.  14 .

figure 14

Overall changes in coal quality in the study area.

As can be seen from the above figure, coals 16 and 18 both have spontaneous combustion to a certain extent. Due to the elevation position of coal 16, there are large fluctuations in coal quality along the mining direction, and the coal quality changes between [0.25,1]; 18 As the main mining coal seam below coal 16, coal is also subject to partial spontaneous combustion, but the overall fluctuation is not large, and the coal quality is between [0.9,1].

The fitting surface of the coal seam roof and floor and the coal quality change curve are shown in Fig.  15 .

figure 15

Coal seam roof and floor fitting surface.

According to the surface function of the roof and floor of the coal seam and using the numerical integration method, we can determine the volume of stripped rock and the total tonnage of coal mined in the study area. Before that, determine the x-axis coordinates of the intersection line between each curved surface and the slope, as shown in Fig.  16 .

figure 16

Coordinates of mining locations in the study area.

Take the stripping and mining of coal above 16 as an example:

Topsoil stripping volume above 16 coal:

As the advancement Δ x changes, the topsoil stripping volume formula is as follows:

The volume of 16 coal mined:

As the advancement Δ x changes, the formula for mining 16 coal volume is as follows:

Taking into account the fluctuations in coal quality of 16 and 18 coal, it is necessary to bring in the bulk density and coal quality. Using the coal quality curve of Coal 16, which was fitted based on the “Overall changes in coal quality in the study area” as shown in Fig.  14 , along with the bulk density and price of the coal, these values are substituted into Eq. ( 7 ). The economic benefits of mining each block are then calculated using a differential method. When the advancement degree is Δ x , the positive economic benefit of mining 16 coal is:

Simplify to get:

Among them, γ represents the bulk density of coal and p c represents the pit price of coal. The volume integral formula of the underlying coal seam and rock formation is obtained sequentially by using multiple integration methods, which will not be described again here.

Mining net profit model construction

At present, the volume stripped at the location of the mining area and the volume and quality generated by the amount of coal extracted are deduced. The following is to build a mathematical model between the total profit and the advancement rate. Considering that the mining process stops at transporting coal to the pit entrance, the commercial coal for this project is determined to be one type: blended coal, with a yield of 95.793% and γ is 1.34 t/m 3 , p c is 108.6 yuan/t. Considering that the dumping site within the eastern dump area has an irregular boundary and is currently in the initial stage of dumping, it has not affected the recovery of coal seams in the study area. Therefore, the issue of secondary stripping is not taken into account when constructing the economic model. Mining costs mainly come from four parts: Puncture blasting, mining and loading, transportation, and dumping. The profit is calculated based on the actual annual unit cost of the mine at full production. Some parameters are as shown in Table 2 .

The total profit model of this region is:

Objective function:

Curves showing the variation of the operational stripping ratio and net profit with advancement are presented, as shown in Fig.  17 .

figure 17

Operational Stripping Ratio Model and Net Profit Model.

From the operational stripping ratio curve model and the advancement model related to profit and coal quality established above, it is evident that using only the economically reasonable stripping ratio for evaluation is not accurate. When mining the coal seams in the SES of the first mining area, at a site advancement of 193 m, the maximum overall net profit value of the mine is 676,607,690 yuan. However, it is unreasonable to determine the propulsion degree as 193 m based on this. The SES is affected by the fire area, resulting in a reduction in the quality of some coal seams and changes in some rock mechanical parameters. Part of this area is burnt rock formed by the baking of normal rocks. If excessive mining results in excessive exposure of the burnt rock position, it will cause Slope instability is also undesirable, so it is necessary to introduce slope stability analysis under changes in mechanical parameters and calculate the overall advancement by combining the two factors.

Analysis of slope stability in the study area

Experimental analysis of mechanical parameters of burnt rock.

To determine the slope stability caused by burning in the study area, it is first necessary to determine the changes in rock mechanical parameters of normal rocks in the area and the locations affected by fire. Rock mechanical parameter experiments were conducted through on-site sampling. The results obtained are as follows:

Burnt rock mass has very developed cracks, its physical and mechanical properties are greatly different from those of the original rock, its water absorption rate is large, and it has poor frost resistance and disintegration resistance. It rapidly disintegrates and peels off under the influence of large temperature differences, alternating freeze–thaw atmospheric environment and groundwater, causing rock mass destruction.

The typical samples taken are shown in Fig.  18 .

figure 18

Typical specimen.

The sample is prepared as a cuboid with a length, width, and width of 5 cm and a height of 10 cm for rock deformation testing. The longitudinal (axial) and transverse (radial) deformation of the sample is measured under the action of longitudinal pressure, and the elasticity of the rock is calculated based on this. Modulus and Poisson's ratio. As shown in Fig.  19 .

figure 19

Stress and strain experiment.

The stress–strain curve of the burnt rock derived through experimental instruments is shown in Fig.  20 .

figure 20

Stress–strain curve.

Based on the experimental results and the current situation of the open-pit mine, the relevant rock mechanical parameters in the SES of the first mining area were determined as shown in Table 3 .

2D slope stability analysis of typical sections

To study the stability of the SES composite overall slope under different advancement degrees, as shown in Fig.  21 , a typical profile 1 was selected on the SES burned rock slope to establish a slope model, as shown in Fig.  22 . The rigid body limit equilibrium theory is used to quantitatively analyze the slope stability under different advancement rates 20 .

figure 21

Typical section selection.

figure 22

Typical cross section.

Among them, the rock formations are distributed nearly horizontally, and the trailing edge of the slope is the FBA. The thickness of the burnt rock on the section slope is 120 m, and the slope angle is 25° according to the final slope angle 21 . Considering that the SES is a fire area, and is an external dumpsite-end composite slope, with a service life of 44 years, the safety reserve factor of this area is determined to be 1.30. Four locations were selected with advancing degrees of 60 m, 120 m, 180 m, and 240 m respectively for simulated mining in the SES. The simulation results are shown in Fig.  23 , 24 , 25 , 26 , 27 .

figure 23

120 m.

figure 26

180 m.

figure 27

240 m.

As the advancement becomes larger, the stability of the slope becomes worse and worse, which is consistent with the inference made in advance. When the advancement is 240, the slope stability coefficient is 1.170, which does not meet the selected end safety reserve coefficient. At this time The slip surface is arc sliding, and the exposed position is the fire area. In order to further improve the accuracy of the 180 ~ 240 m area, it is subdivided into interval [180, 240], and one advancement is selected every ten meters for simulated mining. The results are shown in Figs.  28 , 29 , 30 , 31 , 32 , 33 22 .

figure 28

190 m.

figure 29

200 m.

figure 30

210 m.

figure 31

220 m.

figure 32

It can be seen from the figure that the control of the burnt rock on the slope surface is very obvious. As the advancement deepens, the stability of the slope gradually decreases from the position close to the fire area to the position where it enters the fire zone. When in the zone position, the stability coefficient decreases rapidly, and its change process is as shown in Fig.  34 .

figure 34

Slope stability change curve.

It can be seen from the slope stability change curve that when the mining reaches 193 m, that is, when the net profit is maximum, the slope stability coefficient is 1.27, which does not meet the selected safety and stability coefficient. When the advancement distance is 182 m, the slope stability coefficient is 1.27. The stability coefficient is 1.30, which is consistent with the selected safety and stability coefficient.

The slope stability and net profit curve are shown in Fig.  35 .

figure 35

Net profit curve model under multiple factors.

3D slope stability verification

To further verify the stability coefficient of this location, it is planned to further analyze this location by establishing a 3D slope model: the model is meshed using the free meshing method, and then the surface division in the griddle is selected for refinement, using triangles. The minimum side length is set to 10 m, and the maximum side length is set to 20 m. The grid file is a regular tetrahedron during output, and the output file is Flac3D6.0. The divided grid is as shown in Fig.  36 . The three-dimensional displacement cloud diagram of the slope is shown in Figs.  37 and 38 .

figure 36

Mesh division.

figure 37

3D slope displacement cloud diagram.

figure 38

2D slope displacement cloud diagram.

From the analysis of the above figure, it can be seen that for edge failure in the FBA, the burnt rock area has greater control over the stability of the broken surface. As the advance increases, the stability of the slope gradually decreases, and the slope is exposed in the FBA. The decline is particularly obvious at the surface position, and the slope consolidation slip is arc sliding. When mining reaches the outcrop position of the FBA, the landslide pattern tends to be layer-cutting-bedding sliding along the bottom plate with the burned rock as the weak layer. When the mining position is 182 m, the slope stability is 1.305. Continued mining will reduce its stability to below 1.3.

The SES deep realm optimization

Through the above research, it was determined that the advancement was 182 m. At this time, the deep realm of the study area shrank back by 58 m. The changes are shown in Fig.  39 .

figure 39

The optimized deep realm of SES.

This article innovatively introduces the multivariate function fitting into constructing the solid surface of the coal seam roof and floor. Through the extraction of three-dimensional data points and the integration of MATLAB tools, the coal seam roof, and floor data are quantitatively analyzed and the calculation error is reduced;

By introducing the factor of coal quality fluctuation in the end slope FBA, the coal quality function that changes with the change of propulsion degree is established, and finally, the net profit model of the mining area is constructed to replace the traditional stripping ratio for optimization, and the coal quality and net profit under each mining position are determined.

Considering the influence of slope stability on the end slope propulsion, referring to the constructed net profit mathematical model, the final mining propulsion is determined to be 182 m, the overall stability of the composite slope at the boundary position is 1.305, and the net profit of mining at this position is 671.96 million yuan.

Compared with the initial design mining boundary, the optimized deep boundary of the southern end slope is reduced by 58 m. Unnecessary stripping and mining of low-quality coal is avoided.

This paper solves the dilemma of mining in the FBA of Dananhu No. 2 mine, and maximizes the net profit of the study area under the premise of ensuring safe production. Its practical relevance is ensured.

Considering that the main content of the paper is the mining of coal seam at the end slope of coal seam fluctuation changes in the FBA, the focus of this paper is to replace the traditional stripping ratio change model with the net profit model constructed by coal quality fluctuation and how to introduce the multivariate function fitting method to establish the visualization model of coal seam roof and floor, so in the final evaluation of slope stability in the paper, the paper adopts a more intuitive, clear and clear method, and uses numerical simulation software to directly clarify the stability of the slope at the end of mining. Compared with determining the change and distribution of pores and fractures in coal 23 , the monitoring of the crack width on the slope surface by photogrammetry greatly shortens the time 24 , Still the follow-up research in this paper can further discuss the above two factors in depth.

For the stripped waste materials, due to the loose and porous structure of the burnt rock, the stability of the inner dump and backfill part of the open-pit mine is poor, and considering the safety issues, it is necessary to study the optimization of the backfill materials further 25 and the increase of backfill intensity to improve the further recovery rate of limited resources. 26

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

Ziling, S. et al. Dust emission rules and dust suppression technology in open-pit coal mine fire rock mining and installation work. J. Liaoning Univ. Eng. Technol. (Nat. Sci. Ed.) 40 (05), 401–408 (2021).

Google Scholar  

Hai, W. Research on water damage control technology for burnt rock aquifers in hidden fire areas. Coalfield Geol. Explor. 52 (05), 88–97 (2024).

Jia, Z. Z. et al. Study on blasting fragmentation mechanism of burnt rock in open-pit coal mine. Sci. Rep. 14 , 5034 (2024).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Ziling, S. et al. Research on fuzzy evaluation of ecological environment assessment system for green mining of open-pit coal mines. Coal Sci. Technol. 47 (10), 58–66 (2019).

Bai Runcai, Fu. et al. Safety-green-efficient-low-carbon collaborative mining technology system for open-pit coal mines. J. Coal 49 (01), 298–308 (2024).

Fan, Y. et al. Geochemistry and metamorphic mineral phase characteristics of burnt rocks in shallow coal seams. J. Northwest Univ. (Nat. Sci. Ed.) 54 (02), 329–344 (2024).

Yuanxiang, Yu., Guang, Q. & Pan, C. Research on unloading mechanism and stability of burnt rock high slope in open-pit mine. J. Xi’an Univ. Sci. Technol. 43 (05), 941–951 (2023).

Wang, S. et al. Response of thermal conductivity of loess after high temperature in northern Shaanxi burnt rock area, China. Environ. Sci. Pollut. Res. 30 , 33475–33484 (2023).

Article   Google Scholar  

Li, J. Research on Optimization of Coal Seam Blasting Scheme in Dananhu No. 2 Open-Pit Mine (Liaoning University of Engineering and Technology, 2023).

Li, C. et al. Research on rock step perforation blasting technology in the fire area of Dananhu No. 2 Mine. China Metal Bull. 12 , 173–175 (2019).

ADS   Google Scholar  

Liu, G. et al. Optimization of working slope configuration in seasonal operations of cold regions open-pit mine[J]. Alex. Eng. J. 87 , 533–542 (2024).

Wang, D. et al . Prediction method for surface deformation around soft base dumpsite based on viscoelastic theory. Coal Science and Technology 1–12 (2024).

Li, J. et al . Research on the slope instability and slip mechanism and its control induced by overlying rock migration in Bianbang coal mining. Coal Science and Technology 1–16 (2024).

Ma, L. et al . Research on the stability of the dump site in the inclined base considering anti-sliding coal pillars. Coal Science and Technology 1–8 (2024).

Guangwei, L. et al. Dynamic control of working shaft morphology and optimization of propulsion intensity in seasonal stripping open-pit coal mines. Coal Sci. Technol. 51 (10), 45–54 (2023).

Cao, B. et al. Research on boundary optimization of adjacent mining areas in open pit coal mine based on calculation of sectional stripping ratio. Sci. Rep. 13 , 21286 (2023).

Honglei, W. et al. Development status and application progress of comprehensive online detection technology of coal quality and quantity. Coal Sci. Technol. 52 (02), 219–237 (2024).

GB/T 482–2008. Sampling of Coal Seams .

Vilakazi, L. & Madyira, D. Estimation of gross calorific value of coal: A literature review. Int. J. Coal Prep. Util. https://doi.org/10.1080/19392699.2024.2339340 (2024).

Jia, L. et al. Landslide risk evaluation method of open-pit mine based on numerical simulation of large deformation of landslide. Sci. Rep. 13 , 15410 (2023).

Guo, X. Research on the Stability of Water-Rich Burned Rock Slopes in Open-Pit Coal Mines (General Institute of Coal Science and Technology, 2021).

Wang, W. et al. Study of roof water inrush control technology and water resources utilization during coal mining in a karst area. Mine Water Environ. 42 , 546–559 (2023).

Article   ADS   Google Scholar  

Li, Z. et al. Microstructure evolution in bituminous-coal pyrolysis under in situ and stress-free conditions: A comparative study. Geomech. Geophys. Geo-energy Geo-resour. 10 , 134 (2024).

Article   CAS   Google Scholar  

Klyuev, R. V., Brigida, V. S., Lobkov, K. Y., Stupina, A. A. & Tynchenko, V. V. On the issue of monitoring crack formation in natural-technical systems during earth surface displacements. MIAB. Mining Inf. Anal. Bull. 11–1 , 292–304 (2023).

Ma, L. et al. Dynamics of backfill compressive strength obtained from enrichment tails for the circular waste management. Resour. Conserv. Recycl. Adv. 23 , 200224 (2024).

Brigida, V. S. et al. Efficiency gains when using activated mill tailings in underground mining. Metallurgist 67 (3), 398–408 (2023).

Download references

Acknowledgements

This research was funded by the National Natural Science Foundation of China (51474119), the Liaoning Technical University of Engineering and Technology Ordos Research Institute campus-site science and technology cooperation cultivation project (YJY-XD-2023-027), and the Liaoning Technical University of Engineering and Technology Collaborative Innovation Center for Mining Major Disaster Prevention and Environmental Restoration Open Project (CXZX-2024-01).

Author information

Authors and affiliations.

College of Mining, Liaoning Technical University, Fuxin, 123000, Liaoning, China

Yu Wen, Ziling Song, Shiyang Xia & Yifang Long

College of Environmental Science and Engineering, Liaoning Technical University, Fuxin, 123000, Liaoning, China

Ziling Song & Xiaoliang Zhao

College of Applied Technology and Economic Management, Liaoning Technical University, Fuxin, 123000, Liaoning, China

Yuhang Zhang

College of Resources and Environmental Engineering, Inner Mongolia University of Technology, Hohhot, 010000, China

You can also search for this author in PubMed   Google Scholar

Contributions

Y.W.: Methodology, Software, Validation, Investigation, Writing-original draft, Writing-review & editing, Visualization. Z.S.: Conceptualization, Formal analysis, Writing-review & editing, Supervision, Project administration, Funding acquisition. J.F.: Validation, Supervision. X.Z.: Investigation, Resources, Funding Acquisition. Y.Z.: Validation, Data curation. S.X.: Supervision, Resources. Y.L.: Resources. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Yu Wen .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/ .

Reprints and permissions

About this article

Cite this article.

Wen, Y., Song, Z., Fan, J. et al. Research on controlled mining of end slope fire-burned area in open-pit mine. Sci Rep 14 , 21152 (2024). https://doi.org/10.1038/s41598-024-72017-7

Download citation

Received : 02 July 2024

Accepted : 03 September 2024

Published : 10 September 2024

DOI : https://doi.org/10.1038/s41598-024-72017-7

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Open pit mine
  • Controlled mining
  • Coal quality
  • Slope stability
  • Boundary optimization

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

research paper for data mining

COMMENTS

  1. data mining Latest Research Papers

    Epidemic diseases can be extremely dangerous with its hazarding influences. They may have negative effects on economies, businesses, environment, humans, and workforce. In this paper, some of the factors that are interrelated with COVID-19 pandemic have been examined using data mining methodologies and approaches.

  2. (PDF) Data mining techniques and applications

    PDF | Data mining is a process which finds useful patterns from large amount of data. The paper discusses few of the data mining techniques, algorithms... | Find, read and cite all the research ...

  3. 345193 PDFs

    Explore the latest full-text research PDFs, articles, conference papers, preprints and more on DATA MINING. Find methods information, sources, references or conduct a literature review on DATA MINING

  4. Home

    Data Mining and Knowledge Discovery - SpringerLink

  5. Knowledge Discovery: Methods from data mining and machine learning

    Knowledge Discovery: Methods from data mining and ...

  6. Data mining

    Data mining - Latest research and news

  7. Data Mining for the Internet of Things: Literature Review and

    In this paper, we survey the data mining in 3 different views: knowledge view, technique view, and application view. In knowledge view, we review classification, clustering, association analysis, time series analysis, and outlier analysis. ... and data mining system area. Based on the survey of the current research, a suggested big data mining ...

  8. Recent Advances in Data Mining

    Feature papers represent the most advanced research with significant potential for high impact in the field. A Feature Paper should be a substantial original Article that involves several techniques or approaches, provides an outlook for future research directions and describes possible research applications. ... Data mining is the procedure of ...

  9. Trends in data mining research: A two-decade review using topic analysis

    The research direction related to practical Applications of data mining also shows a tendency to grow. The last two topics, Text Mining and Data Streams have attracted steady interest from ...

  10. Recent advances in domain-driven data mining

    Data mining research has been significantly motivated by and benefited from real-world applications in novel domains. This special issue was proposed and edited to draw attention to domain-driven data mining and disseminate research in foundations, frameworks, and applications for data-driven and actionable knowledge discovery. Along with this special issue, we also organized a related ...

  11. Statistical Analysis and Data Mining: The ASA Data Science Journal

    Statistical Analysis and Data Mining

  12. Data mining techniques and applications

    This paper reviews data mining techniques and its applications such as educational data mining (EDM), finance, commerce, life sciences and medical etc. We group existing approaches to determine how the data mining can be used in different fields. Our categorization specifically focuses on the research that has been published over the period ...

  13. PDF A comprehensive survey of data mining

    To take a holistic view of the research trends in the area of data mining, a comprehensive survey is presented in this paper. This paper presents a systematic and comprehensive survey of various data mining tasks and techniques. Further, various real-life applications of data mining are presented in this paper.

  14. Data mining articles within Scientific Reports

    Read the latest Research articles in Data mining from Scientific Reports. ... Identifying and overcoming COVID-19 vaccination impediments using Bayesian data mining techniques ... Calls for Papers ...

  15. Adaptations of data mining methodologies: a systematic literature

    The main research objective of this article is to study how data mining methodologies are applied by researchers and practitioners. To this end, we use systematic literature review (SLR) as scientific method for two reasons. Firstly, systematic review is based on trustworthy, rigorous, and auditable methodology.

  16. Data Mining: Data Mining Concepts and Techniques

    There are different process and techniques used to carry out data mining successfully. Published in: 2013 International Conference on Machine Intelligence and Research Advancement. Date of Conference: 21-23 December 2013. Date Added to IEEE Xplore: 09 October 2014. Electronic ISBN:978--7695-5013-8.

  17. Data Mining Methods and Obstacles: A Comprehensive Analysis

    Big data analytics: a li terature review paper. in Advances in Data Mining. Applications and Theo- Applications and Theo- retical Aspects: 14th Industrial Conference, ICDM 2014, St. Petersburg ...

  18. Data Mining in Healthcare: Applying Strategic Intelligence Techniques

    Data Mining in Healthcare: Applying Strategic Intelligence ...

  19. A comprehensive survey of data mining

    Data mining plays an important role in various human activities because it extracts the unknown useful patterns (or knowledge). Due to its capabilities, data mining become an essential task in large number of application domains such as banking, retail, medical, insurance, bioinformatics, etc. To take a holistic view of the research trends in the area of data mining, a comprehensive survey is ...

  20. Review Paper on Data Mining Techniques and Applications

    Abstract. Data mining is the process of extracting hidden and useful patterns and information from data. Data mining is a new technology that helps businesses to predict future trends and behaviors, allowing them to make proactive, knowledge driven decisions. The aim of this paper is to show the process of data mining and how it can help ...

  21. Educational Data mining and Learning Analytics: An updated survey

    Educational Data Science (EDS) is defined as the use of data gathered from educational environments/settings for solving educational problems (Romero & Ventura, 2017). Data science is a concept to unify statistics, data analysis, machine learning and their related methods. This survey is an updated and improved version of the previous one ...

  22. Data Mining Techniques in Analyzing Process Data: A Didactic

    This study analyzed the process data in the log file from one of the 2012 PISA problem-solving items using data mining techniques. The data mining methods used, including CART, gradient boosting, random forest, SVM, SOM, and k-means, yielded satisfactory results with this dataset. The three major purposes of the current study were summarized as ...

  23. A sample study on applying data mining research techniques in

    The purpose of this research is to present a sample study analyzing data gathered from an educational study using data mining techniques appropriate for processing these data. In order to achieve this aim, a "Computer Self-efficiency Scale" used in educational sciences was selected and this scale was applied in a study group.

  24. Machine Learning and Data Mining

    Nearest neighbor methods are not suitable when a description of the knowledge extracted from the data must be understandable by humans, which today is the case for many data mining applications (see Sect. 8.6). In recent years these memory-based learning methods are becoming popular, and various improved variants (for example locally weighted ...

  25. Research on controlled mining of end slope fire-burned area in open-pit

    To solve the problem of controlling mining in the open-pit mine end slope fire-burned area, applying multivariate function fitting to the roof and floor modeling of multi-coal seam open pit mines ...