
Data Mining: An Overview
1.Introduction.
Every organization accumulates enormous amounts of data from a variety of sources on a daily basis. Data Mining is an iterative process of creating predictive models and descriptive, by the update the trends previously unknown patterns and in vast quantities of data across the enterprise to support decision making. Text mining apply the same analysis techniques to text-based documents. The knowledge gleaned from data mining and text can be used to power the strategic decision making. During the last decade a number of systems knowledge discovery have been developed that detect hidden structure in data as functional dependencies between attributes and formulate them as mathematical equations or other symbolic rules. One of the most developed systems that can learn very complex and different equations, solves systematically problems of analysis of error data and assesses the statistical significance of results is designed to discover empirical laws in data in the form of functional programs built from standard and user-defined functional primitives. While systems that discover dependencies digital data using diverse formalisms of knowledge representation and search methods that face similar difficulties in their approach. Traditional document management tools and text are insufficient to meet the public services. Systems document management work well with homogeneous collections of documents, but not with the mishmash that knowledge workers face every day.
Even the best Internet search tools suffer from poor precision and recall.
2. The architecture for Data Mining
To best implement these advanced technologies, they must be fully integrated into a data warehouse and flexibility interactive business analysis tools. Many tools for data mining currently operate outside the warehouse, and requiring additional steps for extraction, import and data analysis. In addition, when new ideas require operational implementation, integration with the warehouse simplifies the application of the results of data mining. The analysis resulting data warehouse can be applied to improve business processes across the organization in areas such as management promotional campaigns, detecting fraud, the deployment of new products, and so on. Figure 1 illustrates an architecture for further analysis in a large data warehouse.
Figure 1 – Integrated Data Mining Architecture
The ideal starting point is a data warehouse containing a combination of data from internal monitoring of all customer contact associated with external market data about the activity competitors. General information on potential customers also provides an excellent base for exploration. This warehouse can be implemented in a variety systems relational databases: Sybase, Oracle, Redbrick, and so on, and must be optimized for flexible data access and fast.
An OLAP (On-Line Analytical Processing) server allows an end user more sophisticated model of the company to apply when navigation of the data warehouse. The multidimensional structures allow the user to analyze data as they want to get their business – Summary by product line, region, and other key perspectives of their business. The Data Mining Server must be incorporated into the warehouse data and the OLAP server to include ROI analysis focused on business directly in the infrastructure. An advanced process focused on the metadata model defines the goals of data mining for specific trade issues such as campaign management, exploration and optimization promotion. Integration into the data warehouse enables operational decisions to be directly implemented and monitored. As the warehouse grows with new decisions and results, the organization can continually mine the best practices and apply to future decisions.
2.1. The scope of Data Mining
Data mining derives its name from the similarities between the search for business information value in a large database – for example, find related products to store gigabytes of data the scanner – And mining a mountain for a vein of precious ore. Both processes require screening, or by an immense amount material, or intelligently probing to find exactly where the value lies. Given databases of sufficient size and quality, data mining technology can generate new business opportunities by providing these capabilities:
2.2. Capacities:
- Automated prediction of trends and behaviors. Data mining allows automate the process of finding predictive information in large databases. Questions that traditionally required extensive hands the analysis can now answer directly from the data – quickly. A typical example of a prediction problem is marketing targeted. Data mining uses data on past promotional mailings to identify the targets most likely to maximize the return investment in future mailings. Other predictive problems include forecasting bankruptcy and other forms of default, and identify segments of a population likely to respond to events even given.
- Automated discovery of previously unknown patterns. Mining tools sweep through databases and identify previously hidden patterns in one step. An example of pattern discovery is the analysis of data on retail sales to identify seemingly unrelated products that are often purchased together. Discover reasons other problems include detecting fraudulent credit cards and identification data anomalies that could represent data entry keying errors.
Mining techniques data can provide the benefits of automation software and existing hardware platforms and can be applied to new systems of existing platforms are upgraded and new products developed. When the tools of data mining are implemented on systems high performance parallel processing, they can analyze massive databases in minutes. Faster processing means that users can automatically experiment with more models to understand complex data. High speed, it is convenient for users to analyze massive amounts data. Larger databases, in turn, provide improved forecasts. Databases can be larger in depth and width:
- More columns. Analysts must often limit the number of variables involved when they consider the analysis due to time constraints. However, variables that are rejected because they seem unimportant may contain information about the unknown models. High Performance Data Mining enables users to explore the depth of a database without screening a subset of variables.
- More lines. Samples higher yields and reduce errors of estimation of variance, and allow users to make inferences on segments of small but significant, population.
A recent Gartner Group Advanced Technology Research Note listed data mining and artificial intelligence at the top of the five key technology areas that "will clearly have a major impact on a wide range of industries in the next 3 to 5 years. "Gartner also listed 2 parallel architectures and data mining as two of the top 10 news technologies in which companies invest over the next 5 years. A recent study by Gartner Research Note HPC, "With the rapid progress data capture, transport and storage, large system users increasingly need to implement new and innovative ways to mine the post-value marketability of their vast reserves of detail data, which employs MPP [massively parallel processing] systems to create new sources business advantage (0.9 probability). "3
3. The most common techniques used in exploration Data are as follows:
- Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
- Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for classifying a data set. Specific methods include decision tree Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).
- Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation and natural selection a design based on the concepts of evolution.
- Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of classes of the record K (s) most similar to it in a set historical data (where k = 1). Sometimes called the technique the K-nearest neighbor.
Mining 4.Text) ™ Techniques
The main technical exploration of the following:
1. Feature extraction
2. Thema tic indexing
3. Clustering
4. Abstract
These four techniques are essential because they solve two main problems with the application of text mining: they make available textual information, and they reduce the volume of text
who should be read by end users before the information is found. Feature extraction deals to find specific items of information within a text. The target information can be as general as the type descriptions or activities of the former, while pattern-driven. By example, reviewing proposals for merger and acquisition stories May retrieve the names of the companies involved, cost, financing arrangements and whether yes or no regulatory approval is required. thematic indexing uses knowledge about the meaning of words in a text to identify themes covered by a document. For example, documents about aspirin and perhaps both classified as analgesics or painkillers. Thematic Indexing as it is often implemented by multidimensional taxonomy. Taxonomy within the meaning of text mining is a hierarchical system of knowledge representation. The construction, sometimes called ontology to distinguish navigational taxonomies such as Yahoo, provides the means to find documents on a subject instead of documents with specific keywords. For example, an analyst with research on mobile communications should be able to search documents on protocols without wire without having to know the key phrases such as application protocol wireless (WAP). Clustering is another text mining technique with applications in the bus intelligence. Clustering groups similar documents according to dominant characteristics. In search of text mining and information weighted characteristic vector is frequently used to describe a document. These feature vectors contain a list of key topics or keywords with a numerical weight indicating the relative importance of the topic or the length of the document in its entirety. Contrary to extract data applications using a defined set of functions for all items were analyzed (eg, age, income, gender, etc.), documents are described with a few words or themes chosen to thousands of possible dimensions. There is no single best way to deal with document clustering, but three approaches are commonly used: hierarchical clusters, binary clusters and self-organizing maps. Hierarchical clustering [3] Use a set-based approach. The root the hierarchy is the set of all documents in a collection, and the leaf nodes are sets with different documents. Intervening layers in the nodes sheets have progressively larger sets of documents, grouped by similarity. In the binary clusters of each document is in exactly one cluster, and clusters are created to optimize the similarity measures between documents in a cluster and minimize the extent of similarity between documents in different groups. Self-organizing maps (SOM), use neural networks to map from rare documents of high-dimensional spaces in
Two-dimensional maps. Documents similar tend in the same position in two-dimensional grid. The latest text mining technology is summarized. The purpose of the presentation of a summary is to describe the contents of a document while reducing the amount of text, a user must read. The main ideas of most documents can be described with as little as 20 per cent of the original text. Few things are lost in summary. Since reunification, there is no algorithm summary unique. Most use the morphological analysis of words to identify the terms most frequently used, while eliminating words that bear little meaning, such as articles, a, and A. Some terms of weight of algorithms used in the opening and closing sentences of more heavily than other words, while certain approaches look for key phrases that identify
5. Fields of application of the MT
From government agencies and legislative branches, companies and universities and journalists, writers and college students, we all have create, store, retrieve and analyze text. Thus, many organizations are faced with managing documents and tasks of text analysis. Consider few simple examples: · Internet search engines could provide much better quality results in accepting and giving meaning to language queries natural. If the documents found in response to a question of semantics, were analyzed for their relevance in the context of the original complaint, he could significantly increase the precision of the search: instead of finding a total knockout stages of over 10,000 documents in response to your request the system can provide a short list of most relevant documents. · Specialists call center is to understand the issues of customer support, quickly select relevant documents from the books available, most frequently asked question lists, notes and engineering, and recover pieces of knowledge that can answer the question. An automated system for classifying materials available and by getting the most relevant fragments corresponding natural language questions could save hundreds of thousands of hours of work and greatly reduce response times. Identification fragments better by thesauri and anthologies could significantly improve the recall or completeness of the search. Lawyers, insurers and investors Venture capital often have to quickly grasp the meaning of business, claims and proposals accordingly. They need to improve the quality of questioning Web and different databases to find and retrieve relevant documents. Their practice can benefit greatly from automated summaries text and feature extraction, where key points of text are organized into a meta-database containing the information to improve future access to knowledge contained in documents. The search medical journals for new hypotheses of causation for a disease is an ideal case that text mining should be able to do. Email intelligent routing, automated monitoring chat rooms, web pages are monitoring all important application
5.1. The major challenges for extracting text.
Text Mining is a fascinating research area that tries to solve the problem of information overload by using techniques from data, learning machine extraction, NLP, IR and knowledge management. Text Mining involves the preprocessing of document collections (text categorization, extraction information, term extraction), storage of intermediate representations, the technical analysis of these intermediate representations (distribution analysis, clustering, trend analysis, association rules, etc.) and display results. Some challenges facing the area of text mining research:
5.2.Challenge 1: entity extraction.
Most systems of text analysis based on accurate extraction of entities and relationships of these documents. However, the accuracy systems entity extraction in some areas reached only 70 – 80% and creates a noise level which prohibits the adaptation of text mining by a wider audience. We are looking TNS domain independent and language independent (the name recognition of the entity) systems that will be able to achieve an accuracy of 99-100%. Based on such a system, we seek independent domain and systems for language independent suction will be able to
achieve an accuracy of 98-100% and 95-100% recall. Since systems should work in any field, they must be fully autonomous and require no human intervention.
5.3.Challenge 2: Autonomous Text Analysis.
Text Analytics today's systems are easy to use about tour, and they allow users to visualize different aspects of this body. We would like to have a system of text analysis that is fully autonomous and analyze enormous body and lead to very interesting results that are not captured by a single document in the corpus and are not known before. The system can use the Internet to filter results are already known. The "interest" measure which is totally subjective
will be defined by a committee of experts in each field. Such systems can then be used for alerting purposes in the financial field, anti-terrorism field, biomedical and many other business areas. The system will get the flow of documents from various sources and send emails to people concerned if an "interesting" observation is detected. Based on systems developed in Step 1 and 2, we would have (which is our biggest challenge text mining)
6. Conclusion
Mining laws in different Language is a major problem, because the tools to extract text should be able to work with many languages and multilingual documents. Integration a domain knowledge base with a text mining engine would enhance its effectiveness, especially in information retrieval and extraction phases information. Acquiring this knowledge involves interrogating effective document, and the combination of different pieces of information textual sources (eg the World Wide Web). Discover such hidden know ledge is an essential requirement for many companies because of its broad spectrum of applications
7. References
1. Jochen Dorre, Peter Gerst, Roland Seiffert (1999), Text Mining: Nuggets found in mountains of textual data, ACM KDD 1999 in San Diego, CA, USA.
2. Ah-Hwee Tan, (1999) Text Mining: The state of the art and challenges,
procedure PAKDD'99 Workshop on Knowledge Discovery from Advanced
Databases (KDAD'99), Beijing, pp. 71-76, April 1999.
3. Danial Tkach (1998), Text Mining Technology Turning Information into
Knowledge A White Paper from IBM.
4. Helena Ahonen, Oskari Heinonen, Mika Klemettinen, A. Inkeri Verkamo, (1997)
Applying Data Mining Techniques in Text Analysis, Report C-1997-23
Department of Computer Science, University of Helsinki, 1997
5. Mark Dixon (1997), an overview of Document Mining Technology
http://www.geocities.com/ResearchTriangle/Thinktank/1997/mark/writings/dixm
97_dm.ps
Arseniev, SB & Kiselev, MV (1991)
The Object-Oriented Approach to the Real Time Medical System Design Proceedings of MIE-91, In:
Lecture Notes in Medical Informatics, Springer-Verlag, Berlin, V.45, pp 508-512
FALKENHAIN, BC & Michalski, RS (1990)
Integrating Quantitative and qualitative discovery in the ABACUS system In: Y. Kodratoff,
RSMichalski, (Eds.): Machine Learning: An Artificial Intelligence Approach (Volume III). San
Mateo, CA: Kaufmann, pp 153-190.
Kiselev, MV (1994)
PolyAnalyst – a discovery machine inference system of functional programs, Proceedings of AAAI
Workshop on Knowledge Discovery in Databases'94, Seattle, pp 237-249.
Kiselev, MV, Arseniev, SB & FLEROV EV (1994)
PolyAnalyst – A Machine Discovery System for Intelligent Analysis of Clinical Data, ESCTAIC-4
Abstracts (European Society for Computer Technology in Anesthesiology and Intensive Care)
Halkidiki, Greece, p. H6.
LANGLEY, P., SIMON, HA, Bradshaw, GL & Zytkow, JM (1987)
Scientific discovery: computational explorations creative processes. Cambridge, MA: MIT
Press.
Mr. Chandrakant R. Satpute. Librarian is in Godavari College of Engineering Jalgaon Maharashtra. He brings with him 11 years of teaching experience and librarianship. He has been associated with ALK. (Khandesh Library Association) has published six journal nationally and internationally. His area of interest of library automation and digitization.
E-mail: chandanlib1@yahoo.co.in
About the Author
Master in Library and information science
Service Oriented Scientific Grid Computing: CANARIE NEP-1 Project