Basic concepts in mining data streams

  1. Learning concept
  2. An Introduction to Big Data Concepts and Terminology
  3. Tutorial: Data Stream Mining and Its Applications
  4. Data stream mining
  5. An Overview on Mining Data Streams


Download: Basic concepts in mining data streams
Size: 59.78 MB

Learning concept

Few online classification algorithms based on traditional inductive ensembling, such as online bagging or boosting, focus on handling concept drifting data streams while performing well on noisy data. Motivated by this, an incremental algorithm based on Ensemble Decision Trees for Concept-drifting data streams (EDTC) is proposed in this paper. Three variants of random feature selection are introduced to implement split-tests and two thresholds specified in Hoeffding Bounds inequality are utilized to distinguish concept drifts from noisy data. Extensive studies on synthetic and real streaming databases demonstrate that our algorithm of EDTC performs very well compared to several known online algorithms based on single models and ensemble models. A conclusion is hence drawn that multiple solutions are provided for learning from concept drifting data streams under noise. Introduction Industry and business applications generate a tremendous amount of data streams, such as telephone call records and sensor network data. Meanwhile, with the development of Web Services technology, streaming data from the Internet span across a wide range of domains, including shopping transactions and Internet search requests. In contrast to the traditional data sources of data mining, these streaming data present new characteristics as being continuous, high-volume, open-ended and concept drifting. It is hence a challenging and interesting issue for us to learn from concept drifting data streams...

An Introduction to Big Data Concepts and Terminology

Big data is a blanket term for the non-traditional strategies and technologies needed to gather, organize, process, and gather insights from large datasets. While the problem of working with data that exceeds the computing power or storage of a single computer is not new, the pervasiveness, scale, and value of this type of computing has greatly expanded in recent years. In this article, we will talk about big data on a fundamental level and define common concepts you might come across while researching the subject. We will also take a high-level look at some of the processes and technologies currently being used in this space. An exact definition of “big data” is difficult to nail down because projects, vendors, practitioners, and business professionals use it quite differently. With that in mind, generally speaking, big data is: • large datasets • the category of computing strategies and technologies that are used to handle large datasets In this context, “large dataset” means a dataset too large to reasonably process or store with traditional tooling or on a single computer. This means that the common scale of big datasets is constantly shifting and may vary significantly from organization to organization. The basic requirements for working with big data are the same as the requirements for working with datasets of any size. However, the massive scale, the speed of ingesting and processing, and the characteristics of the data that must be dealt with at each stage of the ...

Tutorial: Data Stream Mining and Its Applications

Data streams are continuous flows of data. Examples of data streams include network traffic, sensor data, call center records and so on. Their sheer volume and speed pose a great challenge for the data mining community to mine them. Data streams demonstrate several unique properties: infinite length, concept-drift, concept-evolution, feature-evolution and limited labeled data. Concept-drift occurs in data streams when the underlying concept of data changes over time. Concept-evolution occurs when new classes evolve in streams. Feature-evolution occurs when feature set varies with time in data streams. Data streams also suffer from scarcity of labeled data since it is not possible to manually label all the data points in the stream. Each of these properties adds a challenge to data stream mining. Multi-step methodologies and techniques, and multi-scan algorithms, suitable for knowledge discovery and data mining, cannot be readily applied to data streams. This is due to well-known limitations such as bounded memory, high speed data arrival, online/timely data processing, and need for one-pass techniques (i.e., forgotten raw data) issues etc. In spite of the success and extensive studies of stream mining techniques, there is no single tutorial dedicated to a unified study of the new challenges introduced by evolving stream data like change detection, novelty detection, and feature evolution. This tutorial presents an organized picture on how to handle various data mining tech...

Data stream mining

Data Stream Mining (also known as stream learning) is the process of extracting knowledge structures from continuous, rapid data records. A In many data stream mining applications, the goal is to predict the class or value of new instances in the data stream given some knowledge about the class membership or values of previous instances in the data stream. Examples of data streams include computer network traffic, phone conversations, ATM transactions, web searches, and sensor data. Data stream mining can be considered a subfield of Software for data stream mining [ ] • [ citation needed] • (This software is discontinued) • • • RiverML: River is a Python library for online machine learning. It is the result of a merger between creme and scikit-multiflow. River's ambition is to be the go-to library for doing machine learning on streaming data. • Events [ ] • • • • • See also [ ] • • • • • • • Books [ ] • Bifet, Albert; Gavaldà, Ricard; Holmes, Geoff; Pfahringer, Bernhard (2018). Machine Learning for Data Streams with Practical Examples in MOA. Adaptive Computation and Machine Learning. MIT Press. p.288. 9780262037792. • Gama, João; Gaber, Mohamed Medhat, eds. (2007). Learning from Data Streams: Processing Techniques in Sensor Networks. Springer. p.244. 9783540736783. • Ganguly, Auroop R.; Gama, João; Omitaomu, Olufemi A.; Gaber, Mohamed M.; Vatsavai, Ranga R., eds. (2008). Knowledge Discovery from Sensor Data. Industrial Innovation. CRC Press. p.215. 9781420082326. • Gama, ...

An Overview on Mining Data Streams

The most challenging applications of knowledge discovery involve dynamic environments where data continuous flow at high-speed and exhibit non-stationary properties. In this chapter we discuss the main challenges and issues when learning from data streams. In this work, we discuss the most relevant issues in knowledge discovery from data streams: incremental learning, cost-performance management, change detection, and novelty detection. We present illustrative algorithms for these learning tasks, and a real-world application illustrating the advantages of stream processing. The chapter ends with some open issues that emerge from this new research area. Keywords • Data Stream • Concept Drift • Decision Node • Load Forecast • Novelty Detection These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves. • Aggarwal, C., Han, J., Wang, J., Yu, P.: A framework for clustering evolving data streams. In: Proceedings of Twenty-Ninth International Conference on Very Large Data Bases, pp. 81–92. Morgan Kaufmann, San Francisco (2003) • Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Proceedings of the 21st Symposium on Principles of Database Systems, pp. 1–16. ACM Press, New York (2002) • Barbará, D.: Requirements for clustering data streams. SIGKDD Explorations 3, 23–27 (2002) • Barbara, D., Chen, P.: Using the fractal dimension to c...