Selasa, 29 September 2009

Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development


Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development Summary:
Publisher: Information Science Reference | 426 pages | 2009-07-31 | ISBN: 160566748X | English | PDF | 16.25 MB


Recently, researchers have focused on challenging problems facing the development of data warehousing, knowledge discovery, and data mining applications.


Complex Data Warehousing and Knowledge Discovery for Advanced Retrieval Development: Innovative Methods and Applications provides a comprehensive analysis of current issues and trends in retrieval expansion. Containing research from leading international experts, this book presents future challenges and opportunities in the field valuable to academicians, researchers, and practitioners.


Table of Contents:


Section I: DWH Architectures & Fundamentals:


Three chapters in Section I, Data Warehouse Architectures & Fundamentals present the current trends of research on Data Warehouse architecture, storage and implementations which towards on improving performance and response time.


Chapter I: The LBF R-tree: Scalable Indexing and Storage for Data Warehousing Systems


In Chapter I, the authors propose a LBF R-tree framework for effective indexing mechanisms in multi-dimensional database environment. The proposed framework addresses not only improves performance on common user-defined range queries, but also gracefully degrades to a linear scan of the data on pathologically large queries. Experimental results demonstrating both efficient disk access on the LBF R-tree, as well as impressive compression ratios for data and indexes.


Chapter II: Dynamic Workload for Schema Evolution in Data Warehouses: a Performance Issue


Chapter II addresses the issues related to the workload’s evolution and maintenance in data warehouse systems in response to new requirements modeling resulting from users’ personalized analysis needs. The proposed workload management system assists the administrator to maintain and adapt dynamically the workload according to changes arising on the data warehouse schema by improving two types of workload updates: (1) maintaining existing queries consistent with respect to the new data warehouse schema and (2) creating new queries based on the new dimension hierarchy levels.


Chapter III: Preview: Optimizing View Materialization Cost in Spatial Data Warehouses


Chapter III presents an optimization approach for materialized view implementation in Spatial Data Warehouse. Due to the fact that spatial data are larger in size and spatial operations are more complex than the traditional relational operations, both the view materialization cost and the on-the-fly computation cost are often extremely high. The authors propose a new notion, called preview, for which both the materialization and on-the-fly costs are significantly smaller than those of the traditional views, so that the total cost is optimized.


Section II: Multidimensional Data and OLAP


Section II consists of three chapters discussing related issues and challenges in multidimensional database and Online Analytical Processing (OLAP) environment.


Chapter IV: Decisional Annotations: Integrating and Preserving Decision-Makers’ Expertise in Multidimensional Systems


Chapter IV deals with an annotation-based decisional system. The decisional system is based on multidimensional databases, which are composed of facts and dimensions. The expertise of decision-makers is modeled, shared and stored through annotations. Every piece of multidimensional data can be associated with zero or more annotations which allow decision-makers to carry on active analysis and to collaborate with other decision-makers on a common analysis.


Chapter V: Federated Data Warehouses


Chapter V discuss Federated Data Warehouse Systems which consist of a collection of Data Marts provided by different enterprises or public organizations and widen the knowledge base for business analysts, thus enabling better founded strategic decisions. The authors argue that the integration of heterogeneous Data Marts at the logical schema level is preferable to the migration of data into a physically new system if the involved organizations remain autonomous. They present a federated architecture that provides a global multi-dimensional schema to which the Data Mart schemas are semantically mapped, repairing all heterogeneities.


Chapter VI: Built-In Indicators to Support Business Intelligence in OLAP Databases


Chapter VI describes algorithms to support business intelligence in OLAP databases applying Data Mining methods to the multidimensional environment. Those methods help end-users’ analysis in two ways. First, they identify the most interesting dimensions to expand in order to explore the data. Then, they automatically detect interesting cells among a user selected ones. The solution is based on a tight coupling between OLAP tools and statistical methods, based upon the built-in indicators computed instantaneously during the end-users’ exploration of the data cube.


Section III: DWH and OLAP Applications


The next three chapters in Section III, DWH & OLAP Applications, present some typical applications using Data Warehouse and OLAP technology as well as the challenges and issues facing in the real practice.


Chapter VII: Conceptual Data Warehouse Design Methodology for Business Process Intelligence


Chapter VII presents a conceptual framework for adopting the data warehousing technology for business process analysis, with Surgical Workflows Analysis as a challenging real-world application. Deficiencies of the conventional OLAP approach are overcome by proposing an extended multidimensional data model, which enables adequate capturing of flow-oriented data. The model supports a number of advanced properties, such as non-quantitative and heterogeneous facts, many-to-many relationships between facts and dimensions, full and partial dimension sharing, dynamic specification of new measures, and interchangeability of fact and dimension roles.


Chapter VIII: Data Warehouse Facilitating Evidence-Based Medicine


Deployment of a federated data warehouse approach for the integration of the wide range of different medical data sources and for distribution of evidence-based clinical knowledge, to support clinical decision makers, primarily clinicians at the point of care is the main topic of chapter VIII: Data Warehouse Facilitating Evidence-Based Medicine. A real-world scenario is used to illustrate the possible application field in the area of emergency and intensive care in which the evidence-based medicine merges data originating in a pharmacy database, a social insurance company database and diverse clinical DWHs with the minimized administration effort.


Chapter IX: Deploying Data Warehouses in Grids with Efficiency and Availability


Chapter IX, Deploying Data Warehouses in Grids with Efficiency and Availability, discusses the deployment of data warehouses over Grids. The authors present the Grid-NPDW architecture, which aims at providing high throughput and data availability in grid-based warehouses. High efficiency in situations with site failure is also achieved with the use of on-demand query scheduling and data partitioning and replication. The Chapter also describes the main components of the Grid-NPDW Scheduler and presents some experimental results of proposed strategies.


Section IV: Data Mining Techniques


Section IV, Data Mining Techniques, consists of three chapters discussing a variety of traditional data mining techniques such as clustering, ranking, classification but towards the efficiency and performance improvement.


Chapter X: MOSAIC: Agglomerative Clustering with Gabriel Graphs


Chapter X, MOSAIC: Agglomerative Clustering with Gabriel Graphs, introduces MOSAIC, a post-processing technique that has been designed to overcome these disadvantages. MOSAIC is an agglomerative technique that greedily merges neighboring clusters maximizing an externally given fitness function; Gabriel graphs are used to determine which clusters are neighboring, and non-convex shapes are approximated as the unions of small convex clusters. Experimental results are presented that show that using MOSAIC leads to clusters of higher quality compared to running a representative clustering algorithm stand-alone.


Chapter XI: Ranking Gradients in Multi-Dimensional Spaces


Chapter XI investigates how to mine and rank the most interesting changes in a multi-dimensional space applying a promising TOP-K gradient strategy. Interesting changes in customer behavior are usually discovered by gradient queries which are particular cases of multi-dimensional data analysis on large data warehouses. The main problem, however, arises from the fact that more interesting changes should be those ones having more dimensions in the gradient query (the curse-of-dimensionality dilemma). Besides, the number of interesting changes should be of a large amount (the preference selection criteria).


Chapter XII: Simultaneous Feature Selection and Tuple Selection for Efficient Classification


In Chapter XII, a method is proposed to combine feature selection and tuple selection to improve classification accuracy. Although feature selection and tuple selection have been studied earlier in various research areas such as machine learning, data mining, and so on, they have rarely been studied together. Feature selection and tuple selection help the classifier to focus better. The method is based on the principle that a representative subset has similar histogram as the full set. The proposed method uses this principle both to choose a subset of features and also to choose a subset of tuples. The empirical tests show that the proposed method performs better than several existing feature selection methods.


Section V: Advanced Mining Applications


The last four chapters in Section V, Advanced Mining Applications introduces innovative algorithms and applications in some emerging application fields in Data Mining and Knowledge Discovery, especially continuous data stream mining, which could not be solved by traditional mining technology.


Chapter XIII: Learning Cost-Sensitive Decision Trees to Support Medical Diagnosis


In Chapter XIII, Learning cost-sensitive decision trees to support medical, the authors discuss about diagnosis a cost-sensitive learning method. The chapter aims to enhance the understand of cost-sensitive learning problems in medicine and presents a strategy for learning and testing cost-sensitive decision trees, while considering several types of costs associated with problems in medicine. It begins with a contextualization and a discussion of the main types of costs. Then, reviews related work and presents a discussion about the evaluation of classifiers as well as explains a cost-sensitive decision tree strategy and presents some experimental results.


Chapter XIV: An Approximate Approach for Maintaining Recent Occurrences of Itemsets in a Sliding Window over Data Streams


Chapter XIV discuss about catching the recent trend of data when mining frequent itemsets over data streams. A data representation method, named frequency changing point (FCP), is introduced for monitoring the recent occurrence of itemsets over a data stream to prevent from storing the whole transaction data within a sliding window. The effect of old transactions on the mining result of recently frequent itemsets is diminished by performing adjusting rules on the monitoring data structure. Accordingly, the recently frequent itemsets or representative patterns are discovered from the maintained structure approximately. Experimental studies demonstrate that the proposed algorithms achieve high true positive rates and guarantees no false dismissal to the results yielded.


Chapter XV: Protocol Identification of Encrypted Network Streams


Chapter XV proposes a simple machine learning approach for protocol identification in network streams that have been encrypted, such that the only information available for identifying the underlying protocol of a connection was the size, timing and direction of packets. With very little information available from the network stream, it is possible to pinpoint potentially inappropriate activities for a workplace, institution or research center, such as using BitTorrent, GMail or MSN, and not confuse them with other common protocols such as HTTP and SSL.


Chapter XVI: Exploring Calendar-Based Pattern Mining in Data Streams


Chapter XVI introduces a calendar-based pattern mining aims at identifying patterns on specific calendar partitions in continuous data streams. The authors present how a data warehouse approach can be applied to leverage calendar-based pattern mining in data streams and how the framework of the DWFIST approach can cope with tight time constraints imposed by data streams, keep storage requirements at a manageable level and, at the same time, support calendar-based frequent itemset mining. The minimum granularity of analysis, parameters of the data warehouse (e.g. mining minimum support) and parameters of the database (e.g. extent size) provide ways to tune the load performance.


Tidak ada komentar:

Posting Komentar