C175

Lesson 2

Quiz 2

1) Which relationship has a degree of three?

-Ternary

2) Which of the following entities is uniquely identified by concatenating the primary keys of the two entities it connects?

-Associative

3) Which of the following is an indicator that refers to the minimum number of entity occurrences that can be involved in a relationship?

-Modality

4) Which of the following entities provides the resolution of a many-to-many relationship?

-Intersection

5) Which of the following images shows modality is zero and cardinality is one in a relationship?

6) Which of the following entities is a type of weak entity?

-Associative

7) Which of the following relationships shows self-referencing relationships?

-One-to-one unary relationship

8) There are two types of values of cardinality and one of the values is "many". What is the other value of cardinality?

-One

9) Which of the following elements of data modeling describes information about the entity that must be captured?

-Attribute

10) Which of the following elements of a data model illustrates an association between two entities?

-Relationship

11) Which of the following terms is an object or an event in our environment that we want to keep track of?

-Entity

12) Which of the following is an indicator that refers to the maximum number of entity occurrences that can be involved in a relationship?

-Cardinality

13) Which of the following data modeling terminologies is used to uniquely identify each record in a database table?

-Primary Key

14) Which of the following is NOT a delete rule that is provided by a modern relational DBMS?

-Not null

15) Which type of entity is also called a dependent entity?

-Weak

Lesson 3

Data Modeling and Representation Completing this lesson will help you gain the following competency:

Competency 4017.1.4: Data Modeling The graduate demonstrates an understanding of the concepts involved in the modeling of data.

This topic highlights the following objectives:

   Identify the advantages and challenges of data.
   Demonstrate the concept of a database and database management.
   Explain the relational model of data and relational databases.
   Differentiate Primary, Foreign, and Candidate keys.
   Explain how and why data is indexed.

3.1 Data In Today's Information Systems Environment

What are databases? Using Data for Competitive Advantage

Today's computers are technological marvels. Their speeds, compactness, ease of use, price as related to capability, and, yes, their data storage capacities are truly amazing. And yet, our fundamental interest in computers is the same as that of the ancient Middle‐Eastern shepherds in their pebbles and sacks: they are the vehicles we need to store and utilize the data that is important to us in our environment.

Indeed, data has become indispensable in every kind of modern business and government organization. Data, the applications that process the data, and the computers on which the applications run are fundamental to every aspect of every kind of endeavor. When speaking of corporate resources, people used to list such items as capital, plant and equipment, inventory, personnel, and patents. Today, any such list of corporate resources must include the corporation's data. It has even been suggested that data is the most important corporate resource because it describes all of the others.

Data can provide a crucial competitive advantage for a company. We routinely speak of data and the information derived from it as competitive weapons in hotly contested industries. For example, FedEx had a significant competitive advantage when it first provided access to its package tracking data on its Web site. Then, once one company in an industry develops a new application that takes advantage of its data, the other companies in the industry are forced to match it to remain competitive. This cycle continually moves the use of data to ever‐higher levels, making it an ever more important corporate resource than before. Examples of this abound. Banks give their customers online access to their accounts. Package shipping companies provide up‐to‐the‐minute information on the whereabouts of a package. Retailers send manufacturers product sales data that the manufacturers use to adjust inventories and production cycles. Manufacturers automatically send their parts suppliers inventory data and expect the suppliers to use the data to keep a steady stream of parts flowing. Problems in Storing and Accessing Data

But being able to store and provide efficient access to a company's data while also maintaining its accuracy so that it can be used to competitive advantage is anything but simple. In fact, several factors make it a major challenge. First and foremost, the volume or amount of data that companies have is massive and growing all the time. Walmart estimates that its data warehouse (a type of database we will explore later) alone contains hundreds of terabytes (trillions of characters) of data and is constantly growing. The number of people who want access to the data is also growing: at one time, only a select group of a company's own employees were concerned with retrieving its data, but this has changed. Now, not only do vastly more of a company's employees demand access to the company's data but also so do the company's customers and trading partners. All major banks today give their depositors Internet access to their accounts. Increasingly tightly linked "supply chains" require that companies provide other companies, such as their suppliers and customers, with access to their data. The combination of huge volumes of data and large numbers of people demanding access to it has created a major performance challenge. How do you sift through so much data for so many people and give them the data that they want in an acceptably small amount of time? How much patience would you have with an insurance company that kept you on the phone for five or ten minutes while it retrieved claim data about which you had a question? Of course, the tremendous advances in computer hardware, including data storage hardware, have helped—indeed, it would have been impossible to have gone as far as we have in information systems without them. But as the hardware continues to improve, the volumes of data and the number of people who want access to it also increase, making it a continuing struggle to provide them with acceptable response times.

Other factors that enter into data storage and retrieval include data security, data privacy, and backup and recovery. Data security involves a company protecting its data from theft, malicious destruction, deliberate attempts to make phony changes to the data (e.g. someone trying to increase his own bank account balance), and even accidental damage by the company's own employees. Data privacy implies assuring that even employees who normally have access to the company's data (much less outsiders) are given access only to the specific data they need in their work. Put another way, sensitive data such as employee salary data and personal customer data should be accessible only by employees whose job functions require it. Backup and recovery means the ability to reconstruct data if it is lost or corrupted, say in a hardware failure. The extreme case of backup and recovery is known as disaster recovery when an information system is destroyed by fire, a hurricane, or other calamity.

Another whole dimension involves maintaining the accuracy of a company's data. Historically, and in many cases even today, the same data is stored several, sometimes many, times within a company's information system. Why does this happen? For several reasons. Many companies are simply not organized to share data among multiple applications. Every time a new application is written, new data files are created to store its data. As recently as the early 1990s, I spoke to a database administration manager (more on this type of position later) in the securities industry who told me that one of the reasons he was hired was to reduce duplicate data appearing in as many as 60–70 files! Furthermore, depending on how database files are designed, data can even be duplicated within a single file. We will explore this issue much more in this book, but for now, suffice it to say that duplicate data, either in multiple files or in a single file, can cause major data accuracy problems. Data as a Corporate Resource

Every corporate resource must be carefully managed so that the company can keep track of it, protect it, and distribute it to those people and purposes in the company that need it. Furthermore, public companies have a responsibility to their shareholders to competently manage the company's assets. Can you imagine a company's money just sort of out there somewhere without being carefully managed? In fact, the chief financial officer with a staff of accountants and financial professionals is responsible for the money, with outside accounting firms providing independent audits of it. Typically vice presidents of personnel and their staffs are responsible for the administrative functions necessary to manage employee affairs. Production managers at various levels are responsible for parts inventories, and so on. Data is no exception.

But data may just be the most difficult corporate resource to manage. In data, we have a resource of tremendous volume, billions, trillions, and more individual pieces of data, each piece of which is different from the next. And it has the characteristic that much of it is in a state of change at any one time. It's not as if we're talking about managing a company's employees. Even the largest companies have only a few hundred thousand of them, and they don't change all that frequently. Or the money a company has: sure, there is a lot of it, but it's all the same in the sense that a dollar that goes to payroll is the same kind of dollar that goes to paying a supplier for raw materials.

As far back as the early to mid‐1960s, barely ten years after the introduction of commercially viable electronic computers, some forward‐looking companies began to realize that storing each application's data separately, in simple files, was becoming problematic and would not work in the long run, for just the reasons that we've talked about: the increasing volumes of data (even way back then), the increasing demand for data access, the need for data security, privacy, backup, and recovery, and the desire to share data and cut down on data redundancy. Several things were becoming clear. The task was going to require both a new kind of software to help manage the data and progressively faster hardware to keep up with the increasing volumes of data and data access demands. And data‐management specialists would have to be developed, educated, and made responsible for managing the data as a corporate resource.

Out of this need was born a new kind of software, the database management system (DBMS), and a new category of personnel, with titles like database administrator and data management specialist. And yes, hardware has progressively gotten faster and cheaper for the performance it provides. The integration of these advances adds up to much more than the simple sum of their parts. They add up to the database environment. The Database Environment

Back in the early 1960s, the emphasis in what was then called data processing was on programming. Data was little more than a necessary afterthought in the application development process and in running the data‐processing installation. There was a good reason for this. By today's standards, the rudimentary computers of the time had very small main memories and very simplistic operating systems. Even relatively basic application programs had to be shoehorned into main memory using low‐level programming techniques and a lot of cleverness. But then, as we progressed further into the 1960s and beyond, two things happened simultaneously that made this picture change forever. One was that main memories became progressively larger and cheaper and operating systems became much more powerful. Plus, computers progressively became faster and cheaper on a price/performance basis. All these changes had the effect of permitting the use of higher‐level programming languages that were easier for a larger number of personnel to use, allowing at least some of the emphasis to shift elsewhere. Well, nature hates a vacuum, and at the same time that all of this was happening, companies started becoming aware of the value of thinking of data as a corporate resource and using it as a competitive weapon.

The result was the development of database management systems (DBMS) software and the creation of the "database environment." Supported by ever‐improved hardware and specialized database personnel, the database environment is designed largely to correct all the problems of the non‐database environment. It encourages data sharing and the control of data redundancy with important improvements in data accuracy. It permits storage of vast volumes of data with acceptable access and response times for database queries. And it provides the tools to control data security, data privacy, and backup and recovery.

This book is a straightforward introduction to the fundamentals of database in the current information systems environment. It is designed to teach you the important concepts of the database approach and also to teach you specific skills, such as how to design relational databases, how to improve database performance, and how to retrieve data from relational databases using the SQL language. In addition, as you proceed through the book you will explore such topics as entity‐relationship diagrams, object‐oriented database, database administration, distributed database, data warehousing, Internet database issues, and others.

We start with the basics of database and take a step‐by‐step approach to exploring all the various components of the database environment. Each chapter progressively adds more to an understanding of both the technical and managerial aspects of the field. Database is a very powerful concept. Overall it provides ingenious solutions to a set of very difficult problems. As a result, it tends to be a multifaceted and complex subject that can appear difficult when one attempts to swallow it in one gulp. But database is approachable and understandable if we proceed carefully, cautiously, and progressively step by step. And this is an understanding that no one involved in information systems can afford to be without. 3.2 Introduction To Relational Database

Before the database concept was developed, all data in information systems (then generally referred to as "data processing systems") was stored in simple linear files. Some applications and their programs required data from only one file. Some applications required data from several files. Some of the more complex applications used data extracted from one file as the search argument (the item to be found) for extracting data from another file. Generally, files were created for a single application and were used only for that application. There was no sharing of files or of data among applications and, as a result, the same data often appeared redundantly in multiple files. In addition to this data redundancy among multiple files, a lack of sophistication in the design of individual files often led to data redundancy within those individual files.

As information systems continued to grow in importance, a number of the ground rules began to change. Hardware became cheaper—much cheaper relative to the computing power that it provided. Software development took on a more standardized, "structured" form. Large backlogs of new applications to be implemented built up, making the huge amount of time spent on maintaining existing programs more and more unacceptable. It became increasingly clear that the lack of a focus on data was one of the major factors in this program maintenance dilemma. Furthermore, the redundant data across multiple files and even within individual files was causing data accuracy nightmares (to be explained further in this chapter), just as companies were relying more and more on their information systems to substantially manage their businesses. As we will begin to see in this chapter, the technology that came to the rescue was the database management system.

Summarizing, the problems included:

   Data was stored in different formats in different files.
   Data was often not shared among different programs that needed it, necessitating the duplication of data in redundant files.
   Little was understood about file design, resulting in redundant data within individual files.
   Files often could not be rebuilt after damage by a software error or a hardware failure.
   Data was not secure and was vulnerable to theft or malicious mischief by people inside or outside the company.
   Programs were usually written in such a manner that if the way that the data was stored changed, the program had to be modified to continue working.
   Changes in everything from access methods to tax tables required programming changes.

This chapter will begin by presenting some basic definitions and concepts about data. Then it will describe the type of file environment that existed before database management emerged. Then it will describe the problems inherent in the file environment and show how the database concept overcame them and set the stage for a vastly improved information systems environment. Practice: Relational Database

3.3 Data Before Database Management

Pieces of data are facts in our environment that are important to us. Usually we have many facts to describe something of interest to us. For example, let's consider the facts we might be interested in about an employee of ours named John Baker. Our company is a sales-oriented company and John Baker is one of our salespersons. We want to remember that his employee number (which we will now call his salesperson number) is 137. We are also interested in the facts that his commission percentage on the sales he makes is 10%, his home city is Detroit, his home state is Michigan, his office number is 1284, and he was hired in 1995. There are, of course, reasons that we need to keep track of these facts about John Baker, such as generating his paycheck every week. It certainly seems reasonable to collect together all of the facts about Baker that we need and to hold all of them together. Figure 4.1 shows all of these facts about John Baker presented in an organized way.

Figure 4.1: Facts about salesperson Baker Records and Files

Since we have to generate a paycheck each week for every employee in our company, not just for Baker, we are obviously going to need a collection of facts like those in Figure 4.1 for every one of our employees. Figure 4.2 shows a portion of that collection.

Figure 4.2: Salesperson file

CONCEPTS IN ACTION 3-A MEMPHIS LIGHT, GAS AND WATER Memphis Light, Gas and Water (MLGW) is the largest "three-service" (electricity, natural gas and water) municipal utility system in the United States. It serves over 400,000 customers in Memphis and Shelby County, TN, and has 2,600 employees. MLGW is the largest of the 159 distributors of the federal Tennessee Valley Authority's electricity output. It brings in natural gas via commercial pipelines and it supplies water from a natural aquifer beneath the city of Memphis.

Like any supplier of electricity, MLGW is particularly sensitive to electrical outages. It has developed a two-stage application system to determine the causes of outages and to dispatch crews to fix them. The first stage is the Computer-Aided Restoration of Electric Service (CARES) system, which was introduced in 1996. Beginning with call-in patterns as customers report outages, CARES uses automated data from MLGW's electric grid, wiring patterns to substations, and other information, to function as an expert system to determine the location and nature of the problem. It then feeds its conclusion to the second-stage Mobile Dispatching System (MDS), which was introduced in 1999. MDS sends a repairperson to an individual customer's location if that is all that has been affected or sends a crew to a malfunctioning or damaged piece of equipment in the grid that is affecting an entire neighborhood. There is a feedback loop in which the repairperson or crew reports back to indicate whether the problem has been fixed or a higher-level crew is required to fix it.

The CARES and MDS systems are supported by an Oracle database running on Hewlett-Packard and Compaq Alpha Unix platforms. The database includes a wide range of tables: a Customer Call table has one record per customer reporting call; an Outage table has one record per outage; a Transformer table has one record for each transformer in the grid; a Device table has records for other devices in the grid. These can also interface to the Customer Information System, which has a Customer table with one record for each of the over 400,000 customers. In addition to its operational value, CARES and other systems feed a System Reliability Monitoring database that generates reports on outages and can be queried to gain further knowledge of outage patterns for improving the grid.

Photo Courtesy of Memphis Light, Gas, and Water Division

What we have been loosely referring to as a "thing" or "object" in our environment that we want to keep track of is called an entity. Remember that this is the real physical object or event, not the facts about it. John Baker, the real, living, breathing person whom you can go over to and touch, is an entity. A collection of entities of the same type (e.g., all the company's employees) is called an entity set. An attribute is a property of, a characteristic of, or a fact that we know about an entity. Each characteristic or property of John Baker, including his salesperson number 137, his name, city of Detroit, state of Michigan, office number 1284, commission percentage 10, and year of hire 1995, are all attributes of John Baker. Some attributes have unique values within an entity set. For example, the salesperson numbers are unique within the salesperson entity set, meaning each salesperson has a different salesperson number. We can use the fact that salesperson numbers are unique to distinguish among the different salespersons.

Using the structure in Figure 4.2, we can define some standard file-structure terms and relate them to the terms entity, entity set, and attribute. Each row in Figure 4.2 describes a single entity. In fact, each row contains all the facts that we know about a particular entity. The first row contains all the facts about salesperson 119, the second row contains all the facts about salesperson 137, and so on. Each row of a structure like this is called a record. The columns representing the facts are called fields. The entire structure is called a file. The file in Figure 4.2, which is about the most basic kind of file imaginable, is often called a simple file or a simple linear file (linear because it is a collection of records listed one after the other in a long line). Since the salesperson attribute is unique, the salesperson field values can be used to distinguish the individual records of the file. Speaking loosely at this point, the salesperson number field can be referred to as the key field or key of the file.

Tying together the two kinds of terminology that we have developed, a record of a file describes an entity, a whole file contains the descriptions of an entire entity set, and a field of a record contains an attribute of the entity described by that record. In Figure 4.2, each row is a record that describes an entity, specifically a single salesperson. The whole file, row by row or record by record, describes each salesperson in the collection of salespersons. Each column of the file represents a different attribute of salespersons. At the row or entity level, the salesperson name field for the third row of the file indicates that the third salesperson, salesperson 186, has Adams as his salesperson name attribute, i.e. he is named Adams.

One last terminology issue is the difference between the terms "type" and "occurrence." Let's talk about it in the context of a record. If you look at a file, like that in Figure 4.2, there are two ways to describe "a record." One, which is referred to as the record type, is a structural description of each and every record in the file. Thus, we would describe the salesperson record type as a record consisting of a salesperson number field, a salesperson name field, a city field, and so forth. This is a general description of what any of the salesperson records looks like. The other way of describing a record is referred to as a record occurrence or a record instance. A specific record of the salesperson file is a record occurrence or instance. Thus, we would say that, for example, the set of values {186, Adams, Dallas, TX, 1253, 15, 2001} is an occurrence of the salesperson record type. Basic Concepts in Storing and Retrieving Data

Having established the idea of a file and its records, we can now, in simple terms at this point, envision a company's data as a large collection of files. The next step is to discuss how we might want to access data from these files and otherwise manipulate the data in them.

Retrieving and Manipulating Data There are four fundamental operations that can be performed on stored data, whether it is stored in the form of a simple linear file, such as that of Figure 4.2, or in any other form. They are:

   Retrieve or Read
   Insert
   Delete
   Update

It is convenient to think of each of these operations as basically involving one record at a time, although in practice they can involve several records at once, as we will see later in the book. Retrieving or reading a record means looking at a record's contents without changing them. For example, using the Salesperson file of Figure 4.2, we might read the record for salesperson 204 because we want to find out what year she was hired. Insertion means adding a new record to the file, as when a new salesperson is hired. Deletion means deleting a record from the file, as when a salesperson leaves the company. Updating means changing one or more of a record's field values, for example if we want to increase salesperson 420's commission percentage from 10 to 15. There is clearly a distinction between retrieving or reading data and the other three operations. Retrieving data allows a user to refer to the data for some business purpose without changing it. All of the other three operations involve changing the data. Different topics in this book will focus on one or another of these operations simply because a particular one of the four operations may be more important for a particular topic than the others.

One particularly important concept concerning data retrieval is that, while information systems applications come in a countless number of variations, there are fundamentally only two kinds of access to stored data that any of them require. These two ways of retrieving data are known as sequential access and direct access.

Sequential Access The term sequential access means the retrieval of all or a portion of the records of a file one after another, in some sequence, starting from the beginning, until all the required records have been retrieved. This could mean all the records of the file, if that is the goal, or all the records up to some point, such as up to the point that a record being searched for is found. The records will be retrieved in some order and there are two possibilities for this. In "physical" sequential access, the records are retrieved one after the other, just as they are stored on the disk device (more on these devices later). In "logical" sequential access the records are retrieved in order based on the values of one or a combination of the fields.

Assuming the records of the Salesperson file of Figure 4.2 are stored on the disk in the order shown in the figure, if they are retrieved in physical sequence they will be retrieved in the order shown in the figure. However, if, for example, they are to be retrieved in logical sequence based on the Salesperson Name field, then the record for Adams would be retrieved first, followed by the record for Baker, followed by the record for Carlyle, and so on in alphabetic order. An example of an application that would require the sequential retrieval of the records of this file would be the weekly payroll processing. If the company wants to generate a payroll check for each salesperson in the order of their salesperson numbers, it can very simply retrieve the records physically sequentially, since that's the order in which they are stored on the disk. If the company wants to produce the checks in the order of the salespersons' names, it will have to perform a logical sequential retrieval based on the Salesperson Name field. It can do this either by sorting the records on the Salesperson Name field or by using an index (see below) that is built on this field.

We said that sequential access could involve retrieving a portion of the records of a file. This sense of sequential retrieval usually means starting from the beginning of the file and searching every record, in sequence, until finding a particular record that is being sought. Obviously, this could take a long time for even a moderately large file and so is not a particularly desirable kind of operation, which leads to the concept of direct access.

Direct Access The other mode of access is direct access. Direct access is the retrieval of a single record of a file or a subset of the records of a file based on one or more values of a field or a combination of fields in the file. For example, in the Salesperson file of Figure 4.2, if we need to retrieve the record for salesperson 204 to find out her year of hire, we would perform a direct access operation on the file specifying that we want the record with a value of 204 in the Salesperson Number field. How do we know that we would retrieve only one record? Because the Salesperson Number field is the unique, key field of the file, there can only be one record (or none) with any one particular value. Another possibility is that we want to retrieve the records for all the salespersons with a commission percentage of 10. The subset of the records retrieved would consist of the records for salespersons 137, 204, and 420.

Direct access is a crucial concept in information systems today. If you telephone a bank with a question about your account, you would not be happy having to wait on the phone while the bank's information system performs a sequential access of its customer file until it finds your record. Clearly this example calls for direct access. In fact, the vast majority of information systems operations that all companies perform today require direct access.

Both sequential access and direct access can certainly be accomplished with data stored in simple files. But simple files leave a lot to be desired. What is the concept of database and what are its advantages? Practice: Data Management

3.4 The Database Concept

The database concept is one of the most powerful, enduring technologies in the information systems environment. It encompasses a variety of technical and managerial issues and features that are at the heart of today's information systems scene. In order to get started and begin to develop the deep understanding of database that we seek, we will focus on five issues that establish a set of basic principles of the database concept:

   The creation of a datacentric environment in which a company's data can truly be thought of as a significant corporate resource. A key feature of this environment is the ability to share data among those inside and outside of the company who require access to it.
   The ability to achieve data integration while at the same time storing data in a non-redundant fashion. This, alone, is the central, defining feature of the database approach.
   The ability to store data representing entities involved in multiple relationships without introducing data redundancy or other structural problems.
   The establishment of an environment that manages certain data control issues, such as data security, backup and recovery, and concurrency control.
   The establishment of an environment that permits a high degree of data independence.

Data as a Manageable Resource

Broadly speaking, the information systems environment consists of several components including hardware, networks, applications software, systems software, people, and data. The relative degree of focus placed on each of these has varied over time. In particular, the amount of attention paid to data has undergone a radical transformation. In the earlier days of "data processing," most of the time and emphasis in application development was spent on the programs, as opposed to on the data and data structures. Hardware was expensive and the size of main memory was extremely limited by today's standards. Programming was a new discipline and there was much to be learned about it in order to achieve the goal of efficient processing. Standards for effective programming were unknown. In this environment, the treatment of the data was hardly the highest-priority concern.

At the same time, as more and more corporate functions at the operational, tactical, and strategic levels became dependent on information systems, data increasingly became recognized as an important corporate resource. Furthermore, the corporate community became increasingly convinced that a firm's data about its products, manufacturing processes, customers, suppliers, employees, and competitors could, with proper storage and use, give the firm a significant competitive advantage.

Money, plant and equipment, inventories, and people are all important enterprise resources and, indeed, a great deal of effort has always been expended to manage them. As corporations began to realize that data is also an important enterprise resource, it became increasingly clear that data would have to be managed in an organized way, too, Figure 4.3. What was needed was a software utility that could manage and protect data while providing controlled shared access to it so that it could fulfill its destiny as a critical corporate resource. Out of this need was born the database management system.

Figure 4.3: Corporate resources

As we look to the future and look back at the developments of the last few years, we see several phenomena that emphasize the importance of data and demand its careful management as a corporate resource. These include reengineering, electronic commerce, and enterprise resource planning (ERP) systems that have placed an even greater emphasis on data. In reengineering, data and information systems are aggressively used to redesign business processes for maximum efficiency. At the heart of every electronic commerce Web site is a database through which companies and their customers transact business. Another very important development was that of enterprise resource planning (ERP) systems, which are collections of application programs built around a central shared database. ERP systems very much embody the principles of shared data and of data as a corporate resource. Data Integration and Data Redundancy

Data integration and data redundancy, each in their own right, are critical issues in the field of database management.

   Data integration refers to the ability to tie together pieces of related data within an information system. If a record in one file contains customer name, address, and telephone data and a record in another file contains sales data about an item that the customer has purchased, there may come a time when we want to contact the customer about the purchased item.
   Data redundancy refers to the same fact about the business environment being stored more than once within an information system. Data integration is clearly a positive feature of a database management system. Data redundancy is a negative feature (except for performance reasons under certain circumstances that will be discussed later in this book).

In terms of the data structures used in database management systems, data integration and data redundancy are tied together and will be discussed together in this section of the book.

Data stored in an information system describes the real-world business environment. Put another way, the data is a reflection of the environment. Over the years that information systems have become increasingly sophisticated, they and the data that they contain have revolutionized the ways that we conduct virtually all aspects of business. But, as valuable as the data is, if the data is duplicated and stored multiple times within a company's information systems facilities, it can result in a nightmare of poor performance, lack of trust in the accuracy of the data, and a reduced level of competitiveness in the marketplace. Data redundancy and the problems it causes can occur within a single file or across multiple files. The problems caused by data redundancy are threefold:

   First, the redundant data takes up a great deal of extra disk space. This alone can be quite significant.
   Second, if the redundant data has to be updated, additional time is needed to do so since, if done correctly, every copy of the redundant data must be updated. This can create a major performance issue.
   Third and potentially the most significant is the potential for data integrity problems. The term data integrity refers to the accuracy of the data. Obviously, if the data in an information system is inaccurate, it and the whole information system are of limited value. The problem with redundant data, whether in a single file or across multiple files, occurs when it has to be updated (or possibly when it is first stored). If data is held redundantly and all the copies of the data record being updated are not all correctly updated to the new values, there is clearly a problem in data integrity. There is an old saying that has some applicability here, "The person with one watch always knows what time it is. The person with several watches is never quite sure," Figure 4.4.

   Figure 4.4: With several watches the correct time might not be clear

Data Redundancy Among Many Files

Beginning with data redundancy across multiple files, consider the following situation involving customer names and addresses. Frequently, different departments in an enterprise in the course of their normal everyday work need the same data. For example, the sales department, the accounts receivable department, and the credit department may need customer name and address data. Often, the solution to this multiple need is redundant data. The sales department has its own stored file that, among other things, contains the customer name and address, and likewise for the accounts receivable and credit departments, Figure 4.5.

Figure 4.5: Three files with redundant data

One day customer John Jones, who currently lives at 123 Elm Street, moves to 456 Oak Street. If his address is updated in two of the files but not the third, then the company's data is inconsistent, Figure 4.6. Two of the files indicate that John Jones lives at 456 Oak Street but one file still shows him living at 123 Elm Street. The company can no longer trust its information system. How could this happen? It could have been a software or a hardware error. But more likely it was because whoever received the new information and was responsible for updating one or two of the files simply did not know of the existence of the third. As mentioned earlier, at various times in information systems history it has not been unusual in large companies for the same data to be held redundantly in sixty or seventy files! Thus, the possibility of data integrity problems is great.

Figure 4.6: Three files with a data integrity problem

Multiple file redundancy begins as more a managerial issue than single file redundancy, but it also has technical components. The issue is managerial to the extent that a company's management does not encourage data sharing among departments and their applications. But it is technical when it comes to the reality of whether the company's software systems are capable of providing shared access to the data without compromising performance and data security.

Data Integration and Data Redundancy Within One File Data redundancy in a single file results in exactly the same three problems that resulted from data redundancy in multiple files: wasted storage space, extra time on data update, and the potential for data integrity problems. To begin developing this scenario, consider Figure 4.7, which shows two files from the General Hardware Co. information system. General Hardware is a wholesaler of hardware, tools, and related items. Its customers are hardware stores, home improvement stores, and department stores, or chains of such stores. Figure 4.7a shows the Salesperson file, which has one record for each of General Hardware's salespersons. Salesperson Number is the unique identifying "key" field and as such is underlined in the figure. Clearly, there is no data redundancy in this file. There is one record for each salesperson and each individual fact about a salesperson is listed once in the salesperson's record.

Figure 4.7: General Hardware Company files

Figure 4.7b shows General Hardware's Customer file. Customer Number is the unique key field. Again, there is no data redundancy, but two questions have to be answered regarding the Salesperson Number field appearing in this file. First, why is it there? After all, it seems already to have a good home as the unique identifying field of the Salesperson file. The Salesperson Number field appears in the Customer file to record which salesperson is responsible for a given customer account. In fact, there is a one-to-many relationship between salespersons and customers. A salesperson can and generally does have several customer accounts, while each customer is serviced by only one General Hardware salesperson. The second question involves the data in the Salesperson Number field in the Customer file. For example, salesperson number 137 appears in four of the records (plus once in the first record of the Salesperson file!). Does this constitute data redundancy? The answer is no. For data to be redundant (and examples of data redundancy will be coming up shortly), the same fact about the business environment must be recorded more than once. The appearance of salesperson number 137 in the first record of the Salesperson file establishes 137 as the identifier of one of the salespersons. The appearance of salesperson number 137 in the first record of the Customer file indicates that salesperson number 137 is responsible for customer number 0121. This is a different fact about the business environment. The appearance of salesperson number 137 in the third record of the Customer file indicates that salesperson number 137 is responsible for customer number 0933. This is yet another distinct fact about the business environment. And so on through the other appearances of salesperson number 137 in the Customer file.

Retrieving data from each of the files of Figure 4.7 individually is straightforward and can be done on a direct basis if the files are set-up for direct access. Thus, if there is a requirement to find the name or commission percentage or year of hire of salesperson number 204, it can be satisfied by retrieving the record for salesperson number 204 in the Salesperson file. Similarly, if there is a requirement to find the name or responsible salesperson (by salesperson number!) or headquarters city of customer number 1525, we simply retrieve the record for customer number 1525 in the Customer file.

But, what if there is a requirement to find the name of the salesperson responsible for a particular customer account, say for customer number 1525? Can this requirement be satisfied by retrieving data from only one of the two files of Figure 4.7? No, it cannot! The information about which salesperson is responsible for which customers is recorded only in the Customer file and the salesperson names are recorded only in the Salesperson file. Thus, finding the salesperson name will be an exercise in data integration. In order to find the name of the salesperson responsible for a particular customer, first the record for the customer in the Customer file would have to be retrieved. Then, using the salesperson number found in that record, the correct salesperson record can be retrieved from the Salesperson file to find the salesperson name. For example, if there is a need to find the name of the salesperson responsible for customer number 1525, the first operation would be to retrieve the record for customer number 1525 in the Customer file. As shown in Figure 4.7b, this would yield salesperson number 361 as the number of the responsible salesperson. Then, accessing the record for salesperson 361 in the Salesperson file in Figure 4.7a determines that the name of the salesperson responsible for customer 1525 is Carlyle. While it's true that the data in the record in the Salesperson file and the data in the record in the Customer file have been integrated, the data integration process has been awfully laborious.

This kind of custom-made, multicommand, multifile access (which, by the way, could easily require more than two files, depending on the query and the files involved) is clumsy, potentially error prone, and expensive in terms of performance. While the two files have the benefit of holding data non-redundantly, what is lacking is a good level of data integration. That is, it is overly difficult to find and retrieve pieces of data in the two files that are related to each other. For example, customer number 1525 and salesperson name Carlyle in the two files in Figure 4.7 are related to each other by virtue of the fact that the two records they are in both include a reference to salesperson number 361. Yet, as shown above, ultimately finding the salesperson name Carlyle by starting with the customer number 1525 is an unacceptably laborious process.

A fair question to ask is, if we knew that data integration was important in this application environment and if we knew that there would be a frequent need to find the name of the salesperson responsible for a particular customer, why were the files structured as in Figure 4.7 in the first place? An alternative arrangement is shown in Figure 4.8. The single file in Figure 4.8 combines the data in the two files of Figure 4.7. Also, the Customer Number field values of both are identical.

Figure 4.8: General Hardware Company combined file

The file in Figure 4.8 was created by merging the salesperson data from Figure 4.7a into the records of Figure 4.7b, based on corresponding salesperson numbers. As a result, notice that the number of records in the file in Figure 4.8 is identical to the number of records in the Customer file of Figure 4.7b. This is actually a result of the "direction" of the one-to-many relationship in which each salesperson can be associated with several customers. The data was "integrated" in this merge operation. Notice, for example, that in Figure 4.7b, the record for customer number 1525 is associated with salesperson number 361. In turn, in Figure 4.7a, the record for salesperson number 361 is shown to have the name Carlyle. Those two records were merged, based on the common salesperson number, into the record for customer number 1525 in Figure 4.8. (Notice, by the way, that the Salesperson Number field appears twice in Figure 4.8 because it appeared in each of the files of Figure 4.7. The field values in each of those two fields are identical in each record in the file in Figure 4.8, which must be the case since it was on those identical values that the record merge that created the file in Figure 4.8 was based. That being the case, certainly one of the two Salesperson Number fields in the file in Figure 4.8 could be deleted without any loss of information.)

The file in Figure 4.8 is certainly well integrated. Finding the name of the salesperson who is responsible for customer number 1525 now requires a single record access of the record for customer number 1525. The salesperson name, Carlyle, is right there in that record. This appears to be the solution to the earlier multifile access problem. Unfortunately, integrating the two files caused another problem: data redundancy. Notice in Figure 4.8 that, for example, the fact that salesperson number 137 is named Baker is repeated four times, as are his commission percentage and year of hire. This is, indeed, data redundancy, as it repeats the same facts about the business environment multiple times within the one file. If a given salesperson is responsible for several customer accounts, then the data about the salesperson must appear in several records in the merged or integrated file. It would make no sense from a logical or a retrieval standpoint to specify, for example, the salesperson name, commission percentage, and year of hire for one customer that the salesperson services and not for another. This would imply a special relationship between the salesperson and that one customer that does not exist and would remove the linkage between the salesperson and his other customers. To be complete, the salesperson data must be repeated for every one of his customers.

The combined file in Figure 4.8 also illustrates what have come to be referred to as anomalies in poorly structured files. The problems arise when two different kinds of data, like salesperson and customer data in this example, are merged into one file. Look at the record in Figure 4.8 for customer number 2198, Western Hardware. The salesperson for this customer is Dickens, salesperson number 204. Look over the table and note that Western Hardware happens to be the only customer that Dickens currently has. If Western Hardware has gone out of business or General Hardware has stopped selling to it and they decide to delete the record for Western Hardware from the file, they also lose everything they know about Dickens: his commission percentage, his year of hire, even his name associated with his salesperson number, 204. This situation, which is called the deletion anomaly, occurs because salesperson data doesn't have its own file, as in Figure 4.7a. The only place in the combined file of Figure 4.8 that you can store salesperson data is in the records with the customers. If you delete a customer and that record was the only one for that salesperson, the salesperson's data is gone.

Conversely, in the insertion anomaly, General Hardware can't record data in the combined file of Figure 4.8 about a new salesperson the company just hired until she is assigned at least one customer. After all, the identifying field of the records of the combined file is Customer Number! Finally, the update anomaly notes that the redundant data of the combined file, such as Baker's commission percentage of 10 repeated four times, must be updated each place it exists when it changes (for example, if Baker is rewarded with an increase to a commission percentage of 15).

There appears to be a very significant tradeoff in the data structures between data integration and data redundancy. The two files of Figure 4.7 are non-redundant but have poor data integration. Finding the name of the salesperson responsible for a particular customer account requires a multicommand, multifile access that can be slow and error-prone. The merged file of Figure 4.8, in which the data is very well integrated, eliminates the need for a multicommand, multifile access for this query, but is highly data redundant. Neither of these situations is acceptable. A poor level of data integration slows down the company's information systems and, perhaps, its business! Redundant data can cause data accuracy and other problems. Yet both the properties of data integration and of non-redundant data are highly desirable. And, while the above example appears to show that the two are hopelessly incompatible, over the years a few—very few—ways have been developed to achieve both goals in a single data management system. In fact, this concept is so important that it is the primary defining feature of database management systems:

A database management system is a software utility for storing and retrieving data that gives the end-user the impression that the data is well integrated even though the data can be stored with no redundancy at all.

Any data storage and retrieval system that does not have this property should not be called a database management system. Notice a couple of fine points in the above definition. It says, "data can be stored with no redundancy," indicating that non-redundant storage is feasible but not required. In certain situations, particularly involving performance issues, the database designer may choose to compromise on the issue of data redundancy. Also, it says, "that gives the end-user the impression that the data is well integrated." Depending on the approach to database management taken by the particular database management system, data can be physically integrated and stored that way on the disk or it can be integrated at the time that a data retrieval query is executed. In either case, the data will, "give the end-user the impression that the data is well integrated." Both of these fine points will be explored further later in this book. Multiple Relationships

Entities can relate to each other in unary, binary, and ternary one-to-one, one-to-many, and many-to-many relationships. Clearly, a database management system must be able to store data about the entities in a way that reflects and preserves these relationships. Furthermore, this must be accomplished in such a way that it does not compromise the fundamental properties of data integration and non-redundant data storage described above. Consider the following problems with attempting to handle multiple relationships in simple linear files, using the binary one-to-many relationship between General Hardware Company's salespersons and customers as an example.

First, the Customer file of Figure 4.7 does the job with its Salesperson Number field. The fact that, for example, salesperson number 137 is associated with four of the customers (it appears in four of the records) while, for example, customer number 1826 has only one salesperson associated with it demonstrates that the one-to-many relationship has been achieved. However, as has already been shown, the two files of this figure lack an efficient data integration mechanism; i.e., trying to link detailed salesperson data with associated customer data is laborious. (Actually, as will be seen later in this book, the structures of Figure 4.7 are quite viable in the relational DBMS environment. In that case, the relational DBMS software will handle the data integration requirement. But without that relational DBMS software, these structures are deficient in terms of data integration.) Also, the combined file of Figure 4.8 supports the one-to-many relationship but, of course, introduces data redundancy.

Figure 4.9 shows a "horizontal" solution to the problem. The Salesperson Number field has been removed from the Customer file. Instead, each record in the Salesperson file lists all the customers, by customer number, that the particular salesperson is responsible for. This could conceivably be implemented as one variable-length field of some sort containing all the associated customer numbers for each salesperson, or it could be implemented as a series of customer number fields. While this arrangement does represent the one-to-many relationship, it is unacceptable for two reasons. One is that the record length could be highly variable depending on how many customers a particular salesperson is responsible for. This can be tricky from a space management point of view. If a new customer is added to a salesperson's record, the new larger size of the record may preclude its being stored in the same place on the disk as it came from, but putting it somewhere else may cause performance problems in future retrievals. The other reason is that once a given salesperson record is retrieved, the person or program that retrieved it would have a difficult time going through all the associated customer numbers looking for the one desired. With simple files like these, the normal expectation is that there will be one value of each field type in each record (e.g. one salesperson number, one salesperson name, and so on). In the arrangement in Figure 4.9, the end-user or supporting software would have to deal with a list of values, i.e. of customer numbers, upon retrieving a salesperson record. This would be an unacceptably complex process.

Figure 4.9: General Hardware Company combined files: One-to-many relationship horizontal variation

Figure 4.10 shows a "vertical" solution to the problem. In a single file, each salesperson record is immediately followed by the records for all of the customers for which the salesperson is responsible. While this does preserve the one-to-many relationship, the complexities involved in a system that has to manage multiple record types in a single file make this solution unacceptable, too.

Figure 4.10: General Hardware Company combined files: One-to-many relationship vertical variation

A database management system must be able to handle all of the various unary, binary, and ternary relationships in a logical and efficient way that does not introduce data redundancy or interfere with data integration. The database management system approaches that are in use today all satisfy this requirement. In particular, the way that the relational approach to database management handles it will be explained in detail. Data Control Issues

The people responsible for managing the data in an information systems environment must be concerned with several data control issues. This is true regardless of which database management system approach is in use. It is even true if no database management system is in use, that is, if the data is merely stored in simple files. Most prominent among these data control issues are data security, backup and recovery, and concurrency control, Figure 4.11. These are introduced here and will be covered in more depth later in this book. The reason for considering these data control issues in this discussion of the essence of the database management system concept is that such systems should certainly be expected to handle these issues frequently for all the data stored in the system's databases.

Figure 4.11: Three data control issues

Computer security has become a very broad topic with many facets and concerns. These include protecting the physical hardware environment, defending against hacker attacks, encrypting data transmitted over networks, educating employees on the importance of protecting the company's data, and many more. All computer security exposures potentially affect a company's data. Some exposures represent direct threats to data while others are more indirect. For example, the theft of transmitted data is a direct threat to data while a computer virus, depending on its nature, may corrupt programs and systems in such a way that the data is affected on an incidental or delayed basis. The types of direct threats to data include outright theft of the data, unauthorized exposure of the data, malicious corruption of the data, unauthorized updates of the data, and loss of the data. Protecting a company's data assets has become a responsibility that is shared by its operating systems, special security utility software, and its database management systems. All database management systems incorporate features that are designed to help protect the data in their databases.

Data can be lost or corrupted in any of a variety of ways, not just from the data security exposures just mentioned. Entire files, portions of databases, or entire databases can be lost when a disk drive suffers a massive accidental or deliberate failure. At the extreme, all of a company's data can be lost to a disaster such as a fire, a hurricane, or an earthquake. Hackers, computer viruses, or even poorly written application programs can corrupt from a few to all of the records of a file or database. Even an unintentional error in entering data into a single record can be propagated to other records that use its values as input into the creation of their values. Clearly, every company (and even every PC user!) must have more than one copy of every data file and database. Furthermore, some of the copies must be kept in different buildings, or even different cities, to prevent a catastrophe from destroying all copies of the data. The process of using this duplicate data, plus other data, special software, and even specially designed disk devices to recover lost or corrupted data is known as "backup and recovery." As a key issue in data management, backup and recovery must be considered and incorporated within the database management system environment.

In today's multi-user environments, it is quite common for two or more users to attempt to access the same data record simultaneously. If they are merely trying to read the data without updating it, this does not cause a problem. However, if two or more users are trying to update a particular record simultaneously, say a bank account balance or the number of available seats on an airline flight, they run the risk of generating what is known as a "concurrency problem." In this situation, the updates can interfere with each other in such a way that the resulting data values will be incorrect. This intolerable possibility must be guarded against and, once again, the database management system must be designed to protect its databases from such an eventuality.

A fundamental premise of the database concept is that these three data control issues—data security, backup and recovery, and concurrency—must be managed by or coordinated with the database management system. This means that when a new application program is written for the database environment, the programmers can concentrate on the details of the application and not have to worry about writing code to manage these data control issues. It means that there is a good comfort level that the potential problems caused by these issues are under control since they are being managed by long-tested components of the DBMS. It means that the functions are standard for all of the data in the environment, which leads to easier management and economies of scale in assigning and training personnel to be responsible for the data. This kind of commonality of control is a hallmark of the database approach. Data Independence

In the earlier days of "data processing," many decisions involving the way that application programs were written were made in concert with the specific file designs and the choice of file organization and access method used. The program logic itself was dependent upon the way in which the data is stored. In fact, the "data dependence" was often so strong that if for any reason the storage characteristics of the data had to be changed, the program itself had to be modified, often extensively. That was a very undesirable characteristic of the data storage and programming environments because of the time and expense involved in such efforts. In practice, storage structures sometimes have to change, to reflect improved storage techniques, application changes, attempts at sharing data, and performance tuning, to name a few reasons. Thus, it is highly desirable to have a data storage and programming environment in which as many types of changes in the data structure as possible would not require changes in the application programs that use them. This goal of "data independence" is an objective of today's database management systems. Practice: Data Integration and Data Redundancy

3.5 DBMS Approaches

We have established a set of principles for the database concept and said that a database management system is a software utility that embodies those concepts. The next question concerns the nature of a DBMS in terms of how it organizes data and how it permits its retrieval. Considering that the database concept is such a crucial component of the information systems environment and that there must be a huge profit motive tied up with it, you might think that many people have worked on the problem over the years and come up with many different approaches to designing DBMSs. It's true that many very bright people have worked on this problem for a long time but, interestingly, you can count the number of different viable approaches that have emerged on the fingers of one hand. In particular, the central issue of providing a non-redundant data environment that also looks as though it is integrated is a very hard nut to crack. Let's just say that we're fortunate that even a small number of practical ways to solve this problem have been discovered.

Basically, there are four major DBMS approaches:

   Hierarchical
   Network
   Relational
   Object-Oriented

The hierarchical and network approaches to database are both called "navigational" approaches because of the way that programs have to "navigate" through hierarchies and networks of data to find the data they need. Both of these technologies were developed in the 1960s and, relative to the other approaches, are somewhat similar in structure. IBM's Information Management System (IMS), a DBMS based on the hierarchical approach, was released in 1969. It was followed in the early 1970s by several network-based DBMSs developed by such computer manufacturers of the time as UNIVAC, Honeywell, Burroughs, and Control Data. There was also a network-based DBMS called Integrated Data Management Store (IDMS) produced by an independent software vendor originally called Cullinane Systems, which was eventually absorbed into Computer Associates. These navigational DBMSs, which were suitable only for mainframe computers, were an elegant solution to the redundancy/integration problem at the time that they were developed. But they were complex, difficult to work with in many respects, and, as we said, required a mainframe computer. Now often called "legacy systems," some of them interestingly have survived to this very day for certain applications that require a lot of data and fast data response times.

CONCEPTS IN ACTION 3-B LANDAU UNIFORMS Landau Uniforms is a premier supplier of professional apparel to the healthcare community, offering a comprehensive line of healthcare uniforms and related apparel. Headquartered in Olive Branch, MS, the company, which dates back to 1938, has continuously expanded its operations both domestically and internationally and today includes corporate apparel among its products. Landau sells its apparel though authorized dealers throughout the U.S. and abroad.

Controlling Landau's product flow in its warehouse is a sophisticated information system that is anchored in database management. Their order filling system, implemented in 2001, is called the Garment Sortation System It begins with taking orders that are then queued in preparation for "waves" of as many as 80 orders to be filled simultaneously. Each order is assigned a bin at the end of a highly automated conveyor line. The garments for the orders are picked from the shelves and placed onto the beginning of the conveyor line. Scanning devices then automatically direct the bar-coded garments into the correct bin. When an order is completed, it is boxed and sealed. The box then goes on another conveyor where it is automatically weighed, a shipping label is printed and attached to it, and it is routed to one of several shipping docks, depending on which shipper is being used. In addition, a bill is automatically generated and sent to the customer. In fact, Landau bills its more sophisticated customers electronically using an electronic data interchange (EDI) system.

There are two underlying relational databases. The initial order processing is handled using a DB2 database running on an IBM "i" series computer. The orders are passed on to the Garment Sortation System's Oracle database running on PCs. The shipping is once again under the control of the DB2/"i" series system. The relational tables include an order table, a customer table, a style master table, and, of course, a garment table with 2.4 million records.

Photo Courtesy of Landau Uniforms

The relational database approach became commercially viable in about 1980. After several years of user experimentation, it became the preferred DBMS approach and has remained so ever since. The object-oriented approach has proven useful for a variety of niche applications. It is interesting to note that some key object-oriented database concepts have found their way into some of the mainstream relational DBMSs and some are described as taking a hybrid "object/relational" approach to database. 3.6 The Relational Database Model: Introduction

In 1970, Dr. Edgar F. (Ted) Codd of IBM published in Communications of the ACM a paper entitled "A Relational Model of Data for Large Shared Data Banks." This paper marked the beginning of the field of relational databases. During the 1970s, the relational approach to databases progressed from being a technical curiosity to a subject of serious interest in the information systems community. But it was not until the early 1980s that commercially viable relational database management systems became available. There were two basic reasons for this. One was that, while the relational database was very tempting in concept, its application in a real-world environment was elusive for performance-related reasons. The second reason was that at exactly the time that Codd's paper was published, the earlier hierarchical and network database management systems were just coming onto the commercial scene and were the focus of intense marketing efforts by the software and hardware vendors of the day. Eventually, both of these obstacles were overcome and the relational model became and remains the database model of choice. 3.7 Introduction to Relational Database Concept

Several factors converged in the early 1980s to begin turning the tide toward relational database. One was that the performance issues that held back its adoption in the 1970s began to be resolved. Another was that, after a decade of use of hierarchical and network database management systems, information systems professionals were interested in an alternative that would move toward simplifying the database design process and produce database structures that were easier to use and understand at all levels. Also, at this time there was increasing interest in a DBMS environment that would allow easier, more intuitive access to the data by an increasingly broad range of personnel. Finally, the early 1980s saw the advent of the personal computer. As software developers began trying to create all manner of applications and supporting software utilities for the PC, it quickly became clear that the existing hierarchical and network database approaches would not work in the PC environment, for two reasons. One was that these DBMSs were simply too large to store and use on the early PCs. The other was that they were too complex to be used by the very broad array of non-information-systems professionals at whom the PCs were targeted.

Today, the relational approach to database management is by far the primary database management approach used in all levels of information systems and for most application purposes, from accounting to banking to manufacturing to sales on the World Wide Web. Relational database management is represented today by such products as Microsoft Access and SQL Server, Oracle, Sybase, and IBM's DB2 and Informix. While these and other relational database systems differ in their features and implementations, they all share a common data structure philosophy and a common data access tool: Structured Query Language (SQL) (often pronounced "sequel"). This chapter will focus on the basic concepts of how data is stored and retrieved in a relational database by a relational DBMS. 3.8 The Relational Database Concept

Relational Terminology

In spite of the apparent conflict between non-redundant, linear file data storage and data integration, the relative simplicity of simple, linear files or structures that resemble them in a true database environment is very desirable. After all, the linear file arrangement is the most basic and commonly used data structure there is. This is precisely one of the advantages of relational database management.

CONCEPTS IN ACTION 5-A BLACK & DECKER Black & Decker is one of the world's largest producers of electric power tools and power tool accessories; it is among the largest-selling residential lock manufacturers in the U.S., and is a major manufacturer of faucets sold in the U.S. It is also the world's largest producer of certain types of technology-based industrial fastening systems. The company's brand names include Black & Decker and DeWalt power tools, Emhart Teknologies, Kwikset locks and other home security products, and Price Pfister plumbing fixtures. Based in Towson, MD, Black & Decker has manufacturing plants in ten countries and markets its products in over 100 countries around the globe.

One of the major factors in Black & Decker's Power Tools Division's leadership position is its highly advanced, database-focused information system that assures a steady and accurate supply of raw materials to the manufacturing floor. Using Manugistics’ Demand and Supply Planning software, the system forecasts demand for Black & Decker's power tools and then generates a raw material supply plan based on the forecast and on the company's manufacturing capacity. These results are fed into SAP's Plant Planning System that takes into account suppliers’ capabilities and lead-time constraints to set up orders for the raw materials.

Both the Manugistics and SAP software use Oracle databases to keep track of all the data involved in these processes. Black & Decker runs the system, which became fully integrated in 1998, on clustered Compaq Alphas. The databases are also shared by the company's purchasing, receiving, finance, and accounting departments, assuring very high degrees of accuracy and speed throughout the company's operations and procedures. Included among the major database tables that support this information system are a material master table, a vendor master table, a bill-of-materials table (indicating which parts go into making which other parts), a routing table (indicating the work stations that the part will move through during manufacturing), planning, purchase order, customer, and other tables.

Printed by permission of Black & Decker

To begin with, consider the data structure used in relational databases. In a relational database, the data appears to be stored in what we have been referring to as simple, linear files. Following the conventions of the area of mathematics on which relational database is based, we will begin calling those simple linear files relations, although in common practice they are also referred to as "tables." In the terminology of files, each row is called a "record," while in a relation, each row is called a tuple. In files, each column is called a "field," while in a relation each column is called an attribute. In practice, in speaking about relational database, people commonly use the terms relation, table, and file synonymously. Similarly, tuple, row, and record are often used synonymously, as are attribute, column, and field, Figure 4.12. We will use an appropriate term in each particular situation during our discussion. In particular, we will use the term "relation" in this chapter and the next, in which we are talking about relational database concepts. Following common usage, we will generally use the word "table" in the more applied parts of the book, such as in the corporate database stories in each chapter and in the discussion of SQL.

Figure 4.12: Relational database terminology

It is important to note that there are technical differences between the concept of a file and the concept of a relation (which is why we say that in a relational database the data only appears to be stored in structures that look like files).The differences include:

   The columns of a relation can be arranged in any order without affecting the meaning of the data. This is not true of a file.
   Similarly, the rows of a relation can be arranged in any order, which is not true of a file.
   Every row/column position, sometimes referred to as a "cell," can have only a single value, which is not necessarily true in a file.
   No two rows of a relation are identical, which is not necessarily true in a file.

A relational database is simply a collection of relations that, as a group, contain the data describing a particular business environment. Quiz: Relational Terminology

3.9 Primary and Candidate Keys

Primary Keys

Figure 4.13 contains two relations, the SALESPERSON relation and the CUSTOMER relation, from General Hardware Company's relational database. The SALESPERSON relation has four rows, each representing one salesperson. Also, the SALESPERSON relation has four columns, each representing a characteristic of salespersons. Similarly, the CUSTOMER relation has nine rows, each representing a customer, and four columns.

Figure 4.13: General Hardware Company relational database

A relation always has a unique primary key. A primary key (sometimes shortened in practice to just "the key") is an attribute or group of attributes whose values are unique throughout all rows of the relation. In fact, the primary key represents the characteristic of a collection of entities that uniquely identifies each one. For example, in the situation described by the relations in Figure 4.13, each salesperson has been assigned a unique salesperson number and each customer has been assigned a unique customer number. Therefore the Salesperson Number attribute is the primary key of the SALESPERSON relation and the Customer Number attribute is the primary key of the CUSTOMER relation. As in Figure 4.13, we will start marking the primary key attribute(s) with a single, solid underline.

The number of attributes involved in the primary key is always the minimum number of attributes that provide the uniqueness quality. For example, in the SALESPERSON relation, it would make no sense to have the combination of Salesperson Number and Salesperson Name as the primary key because Salesperson Number is unique by itself. However, consider the situation of a SALESPERSON relation that does not include a Salesperson Number attribute, but instead has a First Name attribute, a Middle Name attribute, and a Last Name attribute. The primary key might then be the combination of the First, Middle, and Last Name attributes (assuming this would always produce a unique combination of values. If it did not, then a fourth attribute could be added to the relation and to the primary key as a sequence field to produce, for example, John Alan Smith #1, John Alan Smith #2, and so forth). Some attribute or combination of attributes of a relation has to be unique and this can serve as the unique primary key, since, by definition, no two rows can be identical. In the worst case, all of the relation's attributes combined could serve as the primary key if necessary (but this situation is uncommon in practice). Candidate Keys

If a relation has more than one attribute or minimum group of attributes that represents a way of uniquely identifying the entities, then they are each called a candidate key. (Actually, if there is only one unique attribute or minimum group of attributes it can also be called a candidate key.) For example, in a personnel relation, an employee number attribute and a Social Security Number attribute (each of which is obviously unique) would each be a candidate key of that relation. When there is more than one candidate key, one of them must be chosen to be the primary key of the relation. That is where the term "candidate key" comes from, since each one is a candidate for selection as the primary key. The decision of which candidate key to pick to be the primary key is typically based on which one will be the best for the purposes of the applications that will use the relation and the database. Sometimes the term alternate key is used to describe a candidate key that was not chosen to be the primary key of the relation, Figure 4.14.

Figure 4.14: Candidate keys become either primary or alternate keys 3.10 Foreign Keys and Binary Relationships

Foreign Keys

If, in a collection of relations that make up a relational database, an attribute or group of attributes serves as the primary key of one relation and also appears in another relation, then it is called a foreign key in that other relation. Thus Salesperson Number, which is the primary key of the SALESPERSON relation, is considered a foreign key in the CUSTOMER relation, Figure 4.15. As shown in Figure 4.15, we will start marking the foreign key attribute(s) with a dashed underline. The concept of the foreign key is crucial in relational databases, as the foreign key is the mechanism that ties relations together to represent unary, binary, and ternary relationships. We begin the discussion by considering how binary relationships are stored in relational databases. These are both the most common and the easiest to deal with. The unary and ternary relationships will come later. Recall from the discussion of the entity-relationship model that the three kinds of binary relationships among the entities in the business environment are the one-to-one, one-to-many, and many-to-many relationships. The first case is the one-to-many relationship, which is typically the most common of the three.

Figure 4.15: A foreign key Quiz: The concept of Keys in DBMS

One-to-Many Binary Relationship

Consider the SALESPERSON and CUSTOMER relations of Figure 4.13, repeated in Figure 4.15. As one would expect in most sales-oriented companies, notice that each salesperson is responsible for several customers while each customer has a single salesperson as their point of contact with General Hardware. This one-to-many binary relationship can be represented as:

Salesperson Customer

For example, the Salesperson Number attribute of the CUSTOMER relation shows that salesperson 137 is responsible for customers 0121, 0933, 1047, and 1826. Looking at it from the point of view of the customer, the same relation shows that the only salesperson associated with customer 0121 is salesperson 137, Figure 4.16. This last point has to be true. After all, there is only one record for each customer in the CUSTOMER relation (the Customer Number attribute is unique since it is the relation's primary key) and there is only one place to put a salesperson number in it. The bottom line is that the Salesperson Number foreign key in the CUSTOMER relation effectively establishes the one-to-many relationship between salespersons and customers.

Figure 4.16: A salesperson and his four customers

By the way, notice that, in this case, the primary key of the SALESPERSON relation and the corresponding foreign key in the CUSTOMER relation both have the same attribute name, Salesperson Number. This will often be the case but it does not have to be. What is necessary is that both attributes have the same domain of values; that is, they must both have values of the same type, such as (in this case) three-digit whole numbers that are the identifiers for salespersons.

It is the presence of a salesperson number in a customer record that indicates which salesperson the customer is associated with. Fundamentally, that is why the Salesperson Number attribute is in the CUSTOMER relation and that is the essence of its being a foreign key in that relation. Note that when building a one-to-many relationship into a relational database, it will always be the case that the unique identifier of the entity on the "one side" of the relationship (Salesperson Number, in this example) will be placed as a foreign key in the relation representing the entity on the "many side" of the relationship (the CUSTOMER relation, in this example).

Here's something else about foreign keys. There are situations in which a relation doesn't have a single, unique attribute to serve as its primary key. Then, it requires a combination of two or more attributes to reach uniqueness and serve as its primary key. Sometimes one or more of the attributes in that combination can be a foreign key! Yes, when this happens, a foreign key is actually part of the relation's primary key! This was not the case in the CUSTOMER relation of Figure 4.13b. In this relation, the primary key only consists of one attribute, Customer Number, which is unique all by itself. The foreign key, Salesperson Number, is clearly not a part of the primary key.

Here is an example of a situation in which a foreign key is part of a relation's primary key. Figure 4.17 adds the CUSTOMER EMPLOYEE relation, Figure 4.17c, to the General Hardware database. Remember that General Hardware's customers are the hardware stores, home improvement stores, or chains of such stores that it supplies. Figure 4.17c, the CUSTOMER EMPLOYEE relation, lists the employees of each of General Hardware's customers. In fact, there is a one-to-many relationship between customers and customer employees. A customer (like a hardware store) has many employees but an employee, a person, works in only one store:

Customer Customer Employee

For example, Figure 4.17c shows that customer 2198 has four employees, Smith, Jones, Garcia, and Kaplan. Each of those people is assumed to work for only one customer company, customer 2198. Following the rule we developed for setting up a one-to-many relationship with a foreign key, the Customer attribute must appear in the CUSTOMER EMPLOYEE relation as a foreign key, and indeed it does.

Now, what about finding a legitimate primary key for the CUSTOMER EMPLOYEE relation? The assumption here is that employee numbers are unique only within a company; they are not unique across all of the customer companies. Thus, as shown in the CUSTOMER EMPLOYEE relation in Figure 4.17c, there can be an employee of customer number 0121 who is employee number 30441 in that company's employee numbering system, an employee of customer number 0933 who is employee number 30441 in that company's system, and also an employee of customer number 2198 who is also employee number 30441. That being the case, the Employee Number is not a unique attribute in this relation. Neither it nor any other single attribute of the CUSTOMER EMPLOYEE relation is unique and can serve, alone, as the relation's primary key. But the combination of Customer Number and Employee Number is unique. After all, we know that customer numbers are unique and within each customer company, employee numbers are unique. That means that, as shown in Figure 4.17c, the combination of Customer Number and Employee Number can be and is the relation's primary key. Further, that means that Customer Number is both a foreign key in the CUSTOMER EMPLOYEE relation and a part of its primary key. As shown in Figure 4.17c, we will start marking attributes that are both a foreign key and a part of the primary key with an underline consisting of a dashed line over a solid line.

Figure 4.17: General Hardware Company relational database including the CUSTOMER EMPLOYEE relation Many-to-Many Binary Relationship

Storing the Many-to-Many Binary Relationship Figure 4.18 expands the General Hardware database by adding two more relations, the PRODUCT relation, Figure 4.18d, and the SALES relation, Figure 4.18e. The PRODUCT relation simply lists the products that General Hardware sells, one row per product, with Product Number as the unique identifier and thus the primary key of the relation. Each of General Hardware's salespersons can sell any or all of the company's products and each product can be sold by any or all of its salespersons. Therefore the relationship between salespersons and products is a many-to-many relationship.

Salesperson Product

So, the database will somehow have to keep track of this many-to-many relationship between salespersons and products. The way that a many-to-many relationship is represented in a relational database is by the creation of an additional relation, in this example, the SALES relation in Figure 4.18e. The SALES relation of Figure 4.18e is intended to record the lifetime sales of a particular product by a particular salesperson. Thus, there will be a single row in the relation for each applicable combination of salesperson and product (i.e., when a particular salesperson has actually sold some of the particular product). For example, the first row of the SALES relation indicates that salesperson 137 has sold product 19440. Since it is sufficient to record that fact once, the combination of the Salesperson Number and Product Number attributes always produces unique values. So, in general, the new relation created to record the many-to-many relationship will have as its primary key the combined unique identifiers of the two entities in the many-to-many relationship. That's why, in this example, the Salesperson Number and Product Number attributes both appear in the SALES relation. Each of the two is a foreign key in the SALES relation since each is the primary key of another relation in the database. The combination of these two attributes is unique, and combined they comprise the primary key of the newly created SALES relation.

Figure 4.18: General Hardware Company relational database including the PRODUCT and SALES relation

The new SALES relation of Figure 4.18e effectively records the many-to-many relationship between salespersons and products. This is illustrated from the "salesperson side" of the many-to-many relationship by looking at the first three rows of the SALES relation and seeing that salesperson 137 sells products 19440, 24013, and 26722. It is illustrated from the "product side" of the many-to-many relationship by scanning down the Product Number column of the SALES relation, looking for the value 19440, and seeing that product 19440 is sold by salespersons 137 and 186, Figure 4.19.

Figure 4.19: Many-to-many relationship between salespersons and products as shown in the SALES relation

Intersection Data What about the Quantity attribute in the SALES relation? In addition to keeping track of which salespersons have sold which products, General Hardware wants to record how many of each particular product each salesperson has sold since the product was introduced or since the salesperson joined the company. So, it sounds like there has to be a "Quantity" attribute. And, an attribute describes an entity, right? Then, which entity does the Quantity attribute describe? Does it describe salespersons the way the Year of Hire does in the SALESPERSON relation? Does it describe products the way Unit Price does in the PRODUCT relation? Each salesperson has exactly one date of hire. Each product has exactly one unit price. But a salesperson doesn't have just one "quantity" associated with her because she sells many products and similarly, a product doesn't have just one "quantity" associated with it because it is sold by many salespersons.

While year of hire is clearly a characteristic of salespersons and unit price is clearly a characteristic of products, "quantity" is a characteristic of the relationship between salesperson and product. For example, the fact that salesperson 137 appears in the first row of the SALES relation of Figure 4.18e along with product 19440 indicates that he has a history of selling this product. But do we know more about his history of selling it? Yes! That first row of Figure 4.18e indicates that salesperson 137 has sold 473 units of product 19440. Quantity describes the many-to-many relationship between salespersons and products. In a sense it falls at the intersection between the two entities and is thus called "intersection data," Figure 4.20.

Figure 4.20: Intersection data that indicates that salesperson 137 has sold 473 units of product 19440

Since the many-to-many relationship has its own relation in the database and since it can have attributes, does that mean that we should think of it as a kind of entity? Yes! Many people do just that and refer to it as an "associative entity."

Additional Many-to-Many Concepts Before leaving the subject of many-to-many relationships, there are a few more important points to make. First, will the combination of the two primary keys representing the two entities in the many-to-many relationship always serve as a unique identifier or primary key in the additional relation representing the many-to-many relationship? The answer is that this depends on the precise nature of the many-to-many relationship. For example, in the situation of the SALES relation in Figure 4.18e, the combination of the two entity identifier attributes works perfectly as the primary key, as described above. But, what if General Hardware decides it wants to keep track of each salesperson's annual sales of each product instead of their lifetime sales? Fairly obviously, a new attribute, Year, would have to be added to the SALES relation, as shown in Figure 4.21. Moreover, as demonstrated by a few sample rows of that relation, the combination of Salesperson Number and Product Number is no longer unique. For example, salesperson 137 sold many units of product 19440 in each of 1999, 2000, and 2001. The first three records of the relation all have the salesperson number, product number combination of 137, 19440. Clearly, the way to solve the problem in this instance is to add the Year attribute to the Salesperson Number and Product Number attributes to form a three-attribute unique primary key. It is quite common in practice to have to add such a "timestamp" to a relation storing a many-to-many relationship in order to attain uniqueness and have a legitimate primary key. Sometimes, as in the example in Figure 4.21, this is accomplished with a Year attribute. A Date attribute is required if the data may be stored two or more times in a year. A Time attribute is required if the data may be stored more than once in a day.

Figure 4.21: Modified SALES relation of the General Hardware Company relational database, including a Year attribute

Next is the question of why an additional relation is necessary to represent a many-to-many relationship. For example, could the many-to-many relationship between salespersons and products be represented in either the SALESPERSON or PRODUCT relations? The answer is no! If, for instance, you tried to represent the many-to-many relationship in the SALESPERSON relation, you would have to list all of the products (by Product Number) that a particular salesperson has sold in that salesperson's record. Furthermore, you would have to carry the Quantity intersection data along with it in some way. For example, in the SALESPERSON relation, the row for salesperson 137 would have to be extended to include products 19440, 24013, and 26722, plus the associated intersection data, Figure 4.22a. Alternatively, one could envision a single additional attribute in the SALESPERSON relation into which all the related product number and intersection data for each salesperson would somehow be stuffed, Figure 4.22b (although, aside from other problems, this would violate the rule that every cell in a relation must have only a single value). In either case, it would be unworkable. Because, in general, each salesperson has been involved in selling different numbers of product types, each record of the SALESPERSON relation would be a different length. Furthermore, additions, deletions, and updates of product/quantity pairs would be a nightmare. Also, trying to access the related data from the "product side," for example looking for all of the salespersons who have sold a particular product, would be very difficult. And, incidentally, trying to make this work by putting the salesperson data into the PRODUCT relation, instead of putting the product data into the SALESPERSON relation as in Figure 4.22, would generate an identical set of problems. No, the only way that's workable is to create an additional relation to represent the many-to-many relationship. Each combination of a related salesperson and product has its own record, making the insertion, deletion, and update of related items feasible, providing a clear location for intersection data, and avoiding the issue of variable-length records.

Figure 4.22: Unacceptable ways of storing a binary many-to-many relationship

Finally, there is the question of whether an additional relation is required to represent a many-to-many relationship if there is no intersection data. For example, suppose that General Hardware wants to track which salespersons have sold which products, but has no interest in how many units of each product they have sold. The SALES relation of Figure 4.18e would then have only the Salesperson Number and Product Number attributes, Figure 4.23. Could this information be stored in some way other than with the additional SALES relation? The answer is that the additional relation is still required. Note that in the explanation above of why an additional relation is necessary in general to represent a many-to-many relationship, the intersection data played only a small role. The issues would still be there, even without intersection data.

Figure 4.23: The many-to-many SALES relation without intersection data

One-to-One Binary Relationship After considering one-to-many and many-to-many binary relationships in relational databases, the remaining binary relationship is the one-to-one relationship. Each of General Hardware's salespersons has exactly one office and each office is occupied by exactly one salesperson, Figure 4.24.

Salesperson Office

Figure 4.24: A one-to-one binary relationship

Figure 4.25f shows the addition of the OFFICE relation to the General Hardware relational database. The SALESPERSON relation has the Office Number attribute as a foreign key so that the company can look up the record for a salesperson and see to which office she is assigned. Because this is a one-to-one relationship and each salesperson has only one office, the company can also scan down the Office Number column of the SALESPERSON relation, find a particular office number (which can only appear once, since it's a one-to-one relationship), and see which salesperson is assigned to that office. In general, this is the way that one-to-one binary relationships are built into relational databases. The unique identifier, the primary key, of one of the two entities in the one-to-one relationship is inserted into the other entity's relation as a foreign key. The question of which of the two entities is chosen as the "donor" of its primary key and which as the "recipient."

Figure 4.25: General Hardware Company relational database including the OFFICE relation

But there is another interesting question about this arrangement. Could the SALESPERSON and OFFICE relations of Figure 4.25 be combined into one relation? After all, a salesperson has only one office and an office has only one salesperson assigned to it. So, if an office and its unique identifier, Office Number, "belongs" to one particular salesperson, so does that office's Telephone Number and Size. Indeed, when we want to contact a salesperson, we ask for her phone number, not for "her office's phone number!" So, could we combine the SALESPERSON and OFFICE relations of Figure 4.25 into the single relation of Figure 4.26? The answer is, it's possible in some cases, but you have to be very careful about making such a decision. In the General Hardware case, how would you store an unoccupied office in the database? The relation of Figure 4.26 allows data about an office to be stored only if the office is occupied. After all, the primary key of Figure 4.26's relation is Salesperson Number! You can't have a record with office data in it and no salesperson data. A case where it might work is a database of U.S. states and their governors. Every state always has exactly one governor and anyone who is a governor must be associated with one state. There can't be a state without a governor or a governor without a state.

Figure 4.26: Combining the SALESPERSON and OFFICE relations into a single relation

At any rate, in practice, there are a variety of reasons for keeping the two relations involved in the one-to-one relationship separate. It may be that because each of the two entities involved is considered sufficiently important in its own right, this separation simply adds clarity to the database. It may be because most users at any one time seek data about only one of the two entities. It may have to do with splitting the data between different geographic sites. It can even be done for system performance in the case where the records would be unacceptably long if the data was all contained in one relation. These issues will be discussed later in this book but it is important to have at least a basic idea of the intricacies of the one-to-one relationship, at this point. Practice: Relationships

3.11 Data Retrieval From A Relational Database

Extracting Data from a Relation

Thus far, the discussion has concentrated on how a relational database is structured. But building relations and loading them with data is only half of the story. The other half is the effort to retrieve the data in a way that is helpful and beneficial to the business organization that built the database. If the database management system did not provide any particular help in this effort, then the problem would revert to simply writing a program in some programming language to retrieve data from the relations, treating them as if they were simple, linear files. But the crucial point is that a major, defining feature of a relational DBMS is the ability to accept high-level data retrieval commands, process them against the database's relations, and return the desired data. The data retrieval mechanism is a built-in part of the DBMS and does not have to be written from scratch by every program that uses the database. As we shall soon see, this is true even to the extent of matching related records in different relations (integrating data), as in the earlier example of finding the name of the salesperson on a particular customer account. We shall address what relational retrieval might look like, first in terms of single relations and then across multiple relations.

Since a relation can be viewed as a tabular or rectangular arrangement of data values, it would seem to make sense to want to approach data retrieval horizontally, vertically, or in a combination of the two. Taking a horizontal slice of a relation implies retrieving one or more rows of the relation. In effect, that's an expression for retrieving one or more records or retrieving the data about one or more entities. Taking a vertical slice of a relation means retrieving one or more entire columns of the relation (down through all of its rows). Taken in combination, we can retrieve one or more columns of one or more rows, the minimum of which is a single column of a single row, or a single attribute value of a single record. That's as fine a sense of retrieval as we would ever want.

Using terminology from a database formalism called relational algebra and an informal, hypothetical command style for now, there are two commands called Select and Project that are capable of the kinds of horizontal and vertical manipulations just suggested. (Note: the use of the word "Select" here is not the same as its use in the SQL data retrieval language.) The Relational Select Operator

Consider the database of Figure 4.25 and its SALESPERSON relation, Figure 4.25a. To begin with, suppose that we want to find the row or record for salesperson number 204. In a very straightforward way, the informal command might be:

Select rows from the SALESPERSON relation in which Salesperson Number = 204.

The result would be: Salesperson Number Salesperson Name Commission Percentage Year Of Hire 204 Dickens 10 1998

Notice that the result of the Select operation is itself a relation, in this case consisting of only one row. The result of a relational operation will always be a relation, whether it consists of many rows with many columns or one row with one column (i.e., a single attribute value).

In order to retrieve all of the records with a common value in a particular (nonunique) attribute, for example all salespersons with a commission percentage of 10, the command looks the same as when dealing with a unique attribute:

Select rows from the SALESPERSON relation in which Commission Percentage = 10.

But the result of the operation may include several rows: Salesperson Number Salesperson Name Commission Percentage Year Of Hire 137 Baker 10 1995 204 Dickens 10 1998

If the requirement is to retrieve the entire relation, the command would be:

Select all rows from the SALESPERSON relation. The Relational Project Operator

To retrieve what we referred to earlier as a vertical slice of the relation requires the Project operator. For example, the command to retrieve the number and name of each salesperson in the file might look like:

Project the Salesperson Number and Salesperson Name over the SALESPERSON relation.

The result will be a long narrow relation: Salesperson Number Salesperson Name 137 Baker 186 Adams 204 Dickens 361 Carlyle

If we project a nonunique attribute, then a decision must be made on whether or not we want duplicates in the result (although, since the result is itself a relation, technically there should not be any duplicate rows). For example, whether:

Project the Year of Hire over the SALESPERSON relation.

produces Year Of Hire 1995 2001 1998 2001

or (eliminating the duplicates in the identical rows) produces Year Of Hire 1995 2001 1998

would depend on exactly how this hypothetical informal command language was implemented. Combination of the Relational Select and Project Operators

More powerful still is the combination of the Select and Project operators. Suppose we apply them serially, with the relation that results from one operation being used as the input to the next operation. For example, to retrieve the numbers and names of the salespersons working on a 10 % commission, we would issue:

Select rows from the SALESPERSON relation in which Commission Percentage = 10.

Project the Salesperson Number and Salesperson Name over that result.

The first command "selects out" the rows for salespersons 137 and 204. Then the second command "projects" the salesperson numbers and names from those two rows, resulting in: Salesperson Number Salesperson Name 137 Baker 204 Dickens

The following combination illustrates the ability to retrieve a single attribute value. Suppose that there is a need to find the year of hire of salesperson number 204. Since Salesperson Number is a unique attribute, only one row of the relation can possibly be involved. Since the goal is to find one attribute value in that row, the result must be just that: a single attribute value. The command is:

Select rows from the SALESPERSON relation in which Salesperson Number = 204.

Project the Year of Hire over that result.

The result is the single value: Year of Hire 1998 Extracting Data Across Multiple Relations: Data Integration

Previously, the issue of data integration was broached and the concept was defined. First, the data in the Salesperson and Customer files of Figure 4.7 was shown to be non-redundant. Then it was shown that integrating data would require extracting data from one file and using that extracted data as a search argument to find the sought-after data in the other file. For example, recall that finding the name of the salesperson who was responsible for customer number 1525 required finding the salesperson number in customer 1525's record in the Customer file (i.e. salesperson number 361) and then using that salesperson number as a search argument in the Salesperson file to discover that the sought-after name was Carlyle. The alternative was the combined file of Figure 4.8 that introduced data redundancy.

A fundamental premise of the database approach is that a DBMS must be able to store data non-redundantly while also providing a data integration facility. But it seems that we may have a problem here. Since relations appear to be largely similar in structure to simple, linear files, do the lessons learned from the files of Figure 4.7 and Figure 4.8 lead to the conclusion that it is impossible to have simultaneously non-redundant data storage and data integration with relations in a relational database? In fact, one of the elegant features of relational DBMSs is that they automate the cross-relation data extraction process in such a way that it appears that the data in the relations is integrated while also remaining non-redundant. The data integration takes place at the time that a relational query is processed by the relational DBMS for solution. This is a unique feature of relational databases and is substantially different from the functional equivalents in the older navigational database systems and in some of the newer object-oriented database systems, in both of which the data integration is much more tightly built into the data structure itself. In relational algebra terms, the integration function is known as the Join command.

Now, focus on the SALESPERSON and CUSTOMER relations of Figure 4.25, which outwardly look just like the SALESPERSON and CUSTOMER files of Figure 4.7. Adding the Join operator to our hypothetical, informal command style, consider the following commands designed to find the name of the salesperson responsible for customer number 1525. Again, this was the query that seemed to be so problematic previously.

Join the SALESPERSON relation and the CUSTOMER relation, using the Salesperson Number of each as the join fields.

Select rows from that result in which Customer Number = 1525.

Project the Salesperson Name over that last result.

Obviously, the first sentence represents the use of the join command. The join operation will take advantage of the common Salesperson Number attribute, which for this purpose is called the join field, in both relations. The Salesperson Number attribute is, of course, the primary key of the SALESPERSON relation and is a foreign key in the CUSTOMER relation. Remember that the point of the foreign key is to represent a one-to-many (in this case) relationship between salespersons and customers. Some rows of the SALESPERSON relation are related to some rows of the CUSTOMER relation by virtue of having the same salesperson number. The Salesperson Number attribute serves to identify each salesperson in the SALESPERSON relation, while the Salesperson Number attribute indicates which salesperson is responsible for a particular customer in the CUSTOMER relation. Thus, the rows of the two relations that have identical Salesperson Number values are related. It is these related rows that the join operation will bring together in order to satisfy the query that was posed.

The join operation tries to find matches between the join field values of the rows in the two relations. For example, it finds a match between the Salesperson Number value of 137 in the first row of the SALESPERSON relation and the Salesperson Number value of 137 in the first, third, fourth, and seventh rows of the CUSTOMER relation. When it finds such a pair of rows, it takes all the attribute values from both rows and creates a single new row out of them in the resultant relation. In its most basic form, as shown here, the join is truly an exhaustive operation, comparing every row of one relation to every row of the other relation, looking for a match in the join fields. (Comparing every possible combination of two sets, in this case rows from the two relations, is known as taking the "Cartesian product.") So the result of the join command, the first of the three commands in the example command sequence we're executing, is: SalesPerson Number SalesPerson Name Commission Percentage Year of Hire Customer Number Customer Name SalesPerson Number HQ City 137 Baker 10 1995 0121 Main St. Hardware 137 New York 137 Baker 10 1995 0933 ABC Home Stores 137 Los Angeles 137 Baker 10 1995 1047 Acme Hardware Store 137 Los Angeles 137 Baker 10 1995 1826 City Hardware 137 New York 186 Adams 15 2001 0839 Jane's Stores 186 Chicago 186 Adams 15 2001 2267 Central Stores 186 New York 204 Dickens 10 1998 2198 Western Hardware 204 New York 361 Carlyle 20 2001 1525 Fred's Tool Stores 361 Atlanta 361 Carlyle 20 2001 1700 XYZ Stores 361 Washington

Notice that the first and seventh columns are identical in all of their values, row by row. They represent the Salesperson Number attributes from the SALESPERSON and CUSTOMER relations respectively. Remember that two rows from the SALESPERSON and CUSTOMER relations would not be combined together to form a row in the resultant relation unless their two join field values were identical in the first place. This leads to identical values of the two Salesperson Number attributes within each of the rows of the resultant relation. This type of join is called an "equijoin." If, as seems reasonable, one of the two identical join columns is eliminated in the process, the result is called a "natural join."

Continuing with the command sequence to eventually find the name of the salesperson responsible for customer number 1525, the next part of the command issued is:

Select rows from that result (the relation that resulted from the join) in which Customer Number = 1525.

This produces: SalesPerson Number SalesPerson Name Commission Percentage Year of Hire Customer Number Customer Name SalesPerson Number HQ City 361 Carlyle 20 2001 1525 Fred's Tool Stores 361 Atlanta

Finally, we issue the third command

Project the Salesperson Name over that last result.

and get: SalesPerson Name Carlyle

Notice that the process could have been streamlined considerably if the relational DBMS had more "intelligence" built into it. The query dealt with only a single customer, customer 1525, and there is only one row for each customer in the CUSTOMER relation, since Customer Number is the unique key attribute. Therefore, the query needed to look at only one row in the CUSTOMER relation, the one for customer 1525. Since this row references only one salesperson, salesperson 361, it follows that, in turn, it needed to look at only one row in the SALESPERSON relation, the one for salesperson 1525. Practice: Join Operator

3.12 Example: Good Reading Book Stores

Figure 4.27 shows the relational database for the Good Reading Book Stores example described earlier. Since publishers are in a one-to-many relationship to books, the primary key of the PUBLISHER Relation, Publisher Name, is inserted into the BOOK relation as a foreign key. There are two many-to-many relationships. One, between books and authors, keeps track of which authors wrote which books. Recall that a book can have multiple authors and a particular author may have written or partly written many books. The other many-to-many relationship, between books and customers, records which customers bought which books.

Figure 4.27: Good Reading Bookstores relational database

The WRITING relation handles the many-to-many relationship between books and authors. The primary key is the combination of Book Number and Author Number. There is no intersection data! Could there be a reason for having intersection data in this relation? If, for example, this database belonged to a publisher instead of a bookstore chain, an intersection data attribute might be Royalty Percentage, i.e. the percentage of the royalties to which a particular author is entitled for a particular book. The SALE relation takes care of the many-to-many relationship between books and customers. Certainly Book Number and Customer Number are part of the primary key of the SALE relation, but is the combination of the two the entire primary key? The answer is that this depends on whether the assumption is made that a given customer can or cannot buy copies of a given book on different days. If the assumption is that a customer can only buy copies of a particular book on one single day, then the combination of Book Number and Customer Number is fine as the primary key. If the assumption is that a customer may indeed buy copies of a given book on different days, then the Date attribute must be part of the primary key to achieve uniqueness. 3.13 Example: World Music Association

Figure 4.28 shows the relational database for the World Music Association example described earlier. There is a one-to-many relationship from orchestras to musicians and, in turn, a one-to-many relationship from musicians to degrees. Thus, the primary key of the ORCHESTRA relation, Orchestra Name, appears in the MUSICIAN relation as a foreign key. In turn, the primary key of the MUSICIAN relation, Musician Number, appears in the DEGREE relation as a foreign key. In fact, since the DEGREE attribute is unique only within a musician, the Musician Number attribute and the Degree attribute together serve as the compound primary key of the DEGREE relation. A similar situation exists between composers and compositions. The one-to-many relationship from composers to compositions requires that the primary key of the COMPOSER relation, Composer Name, appear as a foreign key in the COMPOSITION relation. Since composition names are unique only within composers, the combination of Composition Name and Composer Name serves as the compound primary key of the COMPOSITION relation.

Figure 4.28: World Music Association relational database

The many-to-many relationship between orchestras and compositions indicates which orchestras have recorded which compositions and which compositions have been recorded by which orchestras. As a many-to-many relationship, it requires that an additional relation be created. The primary key of this new RECORDING relation has three attributes: Orchestra Name, Composition Name, and Composer Name. Orchestra Name is the unique identifier of orchestras. The combination of Composition Name and Composer Name is the unique identifier of compositions. The combination of Orchestra Name, Composition Name, and Composer Name is the unique identifier of the RECORDING relation. The Year and Price attributes are intersection data in the RECORDING relation. If a particular orchestra could have recorded a particular composition multiple times in different years (although we assume that this is limited to once per year), Year must also be part of the primary key of the RECORDING relation to provide uniqueness. Example: Lucky Rent-A-Car

Figure 4.29 shows the relational database for the Lucky Rent-A-Car example described earlier. There is a one-to-many relationship from manufacturers to cars and another one-to-many relationship from cars to maintenance events. The former requires the manufacturer primary key, Manufacturer Name, to be placed in the CAR relation as a foreign key. The latter requires the car primary key, Car Serial Number, to be placed in the MAINTENANCE relation as a foreign key. The many-to-many relationship among cars and customers requires the creation of a new relation, the RENTAL relation. Each record of the RENTAL relation records the rental of a particular car by a particular customer. Note that the combination of the Car Serial Number and Customer Number attributes is not sufficient as the primary key of the RENTAL relation. A given customer might have rented a given car more than once. Adding Rental Date to the primary key achieves the needed uniqueness.

Figure 4.29: Lucky Rent-A-Car relational database Knowledge Check: Today's Database Environment

3.14 File Organizations And Access Methods

The Goal: Locating a Record

Depending on application requirements, we might want to retrieve the records of a file on either a sequential or a direct-access basis. Disk devices can store records in some logical sequence, if we wish, and can access records in the middle of a file. But that's still not enough to accomplish direct access. Direct access requires the combination of a direct access device and the proper accompanying software.

Say that a file consists of many thousands or even a few million records. Further, say that there is a single record that you want to retrieve and you know the value of its unique identifier, its key. The question is, how do you know where it is on the disk? The disk device may be capable of going directly into the middle of a file to pull out a record, but how does it know where that particular record is? Remember, what we're trying to avoid is having it read through the file in sequence until it finds the record being sought. It's not magic (nothing in a computer ever is) and it is important to have a basic understanding of each of the steps in working with simple files, including this step, before we talk about databases. This brings us to the subject known as "file organizations and access methods," which refers to how we store the records of a file on the disk and how we retrieve them. We refer to the way that we store the data for subsequent retrieval as the file organization. The way that we retrieve the data, based on it being stored in a particular file organization, is called the access method. (Note in passing that the terms "file organization" and "access method" are often used synonymously, but this is technically incorrect.)

What we are primarily concerned with is how to achieve direct access to the records of a file, since this is the predominant mode of file operation, today. In terms of file organizations and access methods, there are basically two ways of achieving direct access. One involves the use of a tool known as an "index." The other is based on a way of storing and retrieving records known as a "hashing method." The idea is that if we know the value of a field of a record we want to retrieve, the index or hashing method will pinpoint its location in the file and tell the hardware mechanisms of the disk device where to find it. The Index

The interesting thing about the concept of an index is that, while we are interested in it as a tool for direct access to the records in files, the principle involved is exactly the same as of the index in the back of a book. After all, a book is a storage medium for information about some subject. And, in both books and files, we want to be able to find some portion of the contents "directly" without having to scan sequentially from the beginning of the book or file until we find it. With a book, there are really three choices for finding a particular portion of the contents. One is a sequential scan of every page starting from the beginning of the book and continuing until the desired content is found. The second is using the table of contents. The table of contents in the front of the book summarizes what is in the book by major topics, and it is written in the same order as the material in the book. To use the table of contents, you have to scan through it from the beginning and, because the items it includes are summarized and written at a pretty high level, there is a good chance that you won't find what you're looking for. Even if you do, you will typically be directed to a page in the vicinity of the topic you're looking for, not to the exact page. The third choice is to use the index at the back of the book. The index is arranged alphabetically by item. As humans, we can do a quick, efficient search through the index, using the fact that the items in it are in alphabetic order, to quickly home in on the topic of interest. Then what? Next to the located item in the index appears a page number. Think of the page number as the address of the item you're looking for. In fact, it is a "direct pointer" to the page in the book where the material appears. You proceed directly to that page and find the material there, Figure 4.30.

Figure 4.30: The index in a book

The index in the back of a book has three key elements that are also characteristic of information systems indexes:

   The items of interest are copied over into the index but the original text is not disturbed in any way.
   The items copied over into the index are sorted (alphabetized in the index at the back of a book).
   Each item in the index is associated with a "pointer" (in a book index this is a page number) pointing to the place in the text where the item can be found.

Simple Linear Index

The indexes used in information systems come in a variety of types and styles. We will start with what is called a "simple linear index," because it is relatively easy to understand and is very close in structure to the index in the back of a book. On the right-hand side of Figure 4.31 is the Salesperson file. As before, it is in order by the unique Salesperson Number field. It is reasonable to assume that the records in this file are stored on the disk in the sequence shown in Figure 4.31. (We note in passing that retrieving the records in physical sequence, as they are stored on the disk, would also be retrieving them in logical sequence by salesperson number, since they were ordered on salesperson number when they were stored.) Figure 4.31 also shows that we have numbered the records of the file with a "Record Number" or a "Relative Record Number" ("relative" because the record number is relative to the beginning of the file). These record numbers are a handy way of referring to the records of the file and using such record numbers is considered another way of "physically" locating a record in a file, just as a cylinder and track address is a physical address.

Figure 4.31: Salesperson file on the right with index built over the Salesperson Name field, on the left

On the left-hand side of Figure 4.31 is an index built over the Salesperson Name field of the Salesperson file. Notice that the three rules for building an index in a book were observed here, too. The indexed items were copied over from the file to the index and the file was not disturbed in any way. The items in the index were sorted. Finally, each indexed item was associated with a physical address, in this case the relative record number (the equivalent of a page number in a book) of the record of the Salesperson file from which it came. The first "index record" shows Adams 3 because the record of the Salesperson file with salesperson name Adams is at relative record location 3 in the Salesperson file. Notice the similarity between this index and the index in the back of a book. Just as you can quickly find an item you are looking for in a book's index because the items are in alphabetic order, a programmed procedure could quickly find one of the salespersons' names in the index because they are in sorted order. Then, just as the item that you found in the book's index has a page number next to it telling you where to look for the detailed information you seek, the index record in the index of Figure 4.31 has the relative record number of the record of the Salesperson file that has the information, i.e. the record, that you are looking for.

Figure 4.32, with an index built over the City field, demonstrates another point about indexes. An index can be built over a field with non-unique values.

Figure 4.32: Salesperson file on the right with index built over the City field, on the left

Figure 4.33 shows the Salesperson file with an index built over the Salesperson Number field. This is an important concept known as an "indexed-sequential file." In an indexed-sequential file, the file is stored on the disk in order based on a set of field values (in this case the salesperson numbers) and an index is built over that same field. This allows both sequential and direct access by the key field, which can be an advantage when applications with different retrieval requirements share the file. The odd thing about this index is that since the Salesperson file was already in sequence by the Salesperson Number field, when the salesperson numbers were copied over into the index they were already in sorted order! Further, for the same reason, the record addresses are also in order. In fact, in Figure 4.33, the Salesperson Number field in the Salesperson file, with the list of relative record numbers next to it, appears to be identical to the index. But then, why bother having an index built over the Salesperson Number field at all? In principle, the reason is that when the search algorithm processes the salesperson numbers, they have to be in primary memory. Again in principle, it would be much more efficient to bring the smaller index into primary memory for this purpose than to bring the entire Salesperson file in just to process the Salesperson Number field.

Figure 4.33: Salesperson file on the right with index built over the Salesperson Number field, on the left

Why, in the last couple of sentences, did we keep using the phrase, "in principle?" The answer to this is closely tied to the question of whether simple linear indexes are practical for use in even moderately sized information systems applications. And the answer is that they are not. One reason (and here is where the "in principle" in the last paragraph come in) is that, even if the simple linear index is made up of just two columns, it would still be clumsy to try to move all or even parts of it into primary memory to use it in a search. At best, it would require many read operations to the disk on which the index is located. The second reason has to do with inserting new disk records. Look once again at the Salesperson file and the index in Figure 4.31. Say that a new salesperson named French is hired and assigned salesperson number 452. Her record can be inserted at the end of the Salesperson file, where it would become record number 8. But the index would have to be updated, too: an index record, French 8, would have to be inserted between the index records for Dickens and Green to maintain the crucial alphabetic or sorted sequence of the index, Figure 4.34. The problem is that there is no obvious way to accomplish that insertion unless we move all the index records from Green to Taylor down one record position. In even a moderate-size file, that would clearly be impractical!

Figure 4.34: Salesperson file with the insertion of a record for #452 French. But how can you squeeze the index record into the proper sequence?

Indeed, the simple linear index is not a good solution for indexing the records of a file. This leads us to another kind of index that is suitable for indexing even very large files, the B+-tree index. B+-Tree Index

The B+-tree index, in its many variations (and there are many, including one called the B*-tree), is far and away the most common data-indexing system in use today. Assume that the Salesperson File now includes records for several hundred salespersons. Figure 4.35 is a variation of how the B+-tree index works. The figure shows the salesperson records arranged in sequence by the Salesperson Number field on ten cylinders (numbered 1–10) of a disk. Above the ten cylinders is an arrangement of special index records in what is known as a "tree." There is a single index record, known as the "root," at the top, with "branches" leading down from it to other "nodes." Sometimes the lowest-level nodes are called "leaves." For the terminology, think of it as a real tree turned upside-down with the roots clumped into a single point at the top, Figure 4.36. Alternatively, you can think of it as a family tree, which normally has this same kind of top-to-bottom orientation.

Figure 4.35: Salesperson file with a B+-tree index

Figure 4.36: A real tree, upside down, with the roots clumped together into a single point

Notice the following about the index records in the tree:

   The index records contain salesperson number key values copied from the salesperson records.
   Each key value in the tree is associated with a pointer that is the address of either a lower-level index record or a cylinder containing the salesperson records.
   Each index record, at every level of the tree, contains space for the same number of key value/pointer pairs (four in this example). This index record capacity is arbitrary, but once it is set, it must be the same for every index record at every level of the index.
   Each index record is at least half full (in this example each record actually contains at least two key value/pointer pairs).

How are the key values in the index tree constructed and how are the pointers arranged? The lowest level of the tree contains the highest key value of the salesperson records on each of the 10 data cylinders. That's why there are 10 key values in the lowest level of the index tree. Each of those 10 key values has a pointer to the data cylinder from which it was copied. For example, the leftmost index record on the lowest level of the tree contains key values 140, 192, and 253, which are the highest key values on cylinders 1, 2, and 3, respectively. The root index record contains the highest key value of each of the index records at the next (which happens to be the last in this case) level down. Looking down from the root index record, notice that 253 is the highest key value of the first index record at the next level down, and so on for key values 477 and 641 in the root.

Let's say that you want to perform a direct access for the record for salesperson 361. A stored search routine would start at the root and scan its key values from left to right, looking for the first key value greater than or equal to 361, the key value for which you are searching. Starting from the left, the first key value in the root greater than or equal to 361 is 477. The routine would then follow the pointer associated with key value 477 to the second of the three index records at the next level. The search would be repeated in that index record, following the same rules. This time, key value 368 is the first one from the left that is higher than or equal to 361. The routine would then follow the pointer associated with key value 368 to cylinder 5. Additional search cues within the cylinder could then point to the track and possibly even the position on the track at which the record for salesperson 361 is to be found.

There are several additional points to note about this B+-tree arrangement:

   The tree index is small and can be kept in main memory indefinitely for a frequently accessed file.
   The file and index of Figure 4.35 fit the definition of an indexed-sequential file, because the file is stored in sequence by salesperson numbers and the index is built over the Salesperson Number field.
   The file can be retrieved in sequence by salesperson number by pointing from the end of one cylinder to the beginning of the next, as is typically done, without even using the tree index.
   B+-tree indexes can be and are routinely used to also index non-key, non-unique fields, although the tree can be deeper and/or the structures at the end of the tree can be more complicated.
   In general, the storage unit for groups of records can be (as in the above example) but need not be the cylinder or any other physical device sub-unit.

The final point to make about B+-tree indexes is that, unlike simple linear indexes, they are designed to comfortably handle the insertion of new records into the file and the deletion of records. The principle for this is based on the idea of unit splits and contractions, both at the record storage level and at the index tree level. For example, say that a new record with salesperson number 365 must be inserted. Starting from the root and following the same procedure for a record search, the computer determines that this record should be located on Cylinder 5 in order to maintain the sequence of the records based on the salesperson number key. If there is room on the track on the cylinder that it should go into to maintain the sequence, the other records can be shifted over and there is no problem. If the track it should go into is full but another track on the cylinder has been left empty as a reserve, then the set of records on the full track plus the one for 365 can be "split," with half of them staying on the original track and the other half moving to the reserve track. There would also have to be a mechanism to maintain the proper sequence of tracks within the cylinder, as the split may have thrown it off.

But suppose that cylinder 5 is completely full. Then the collection of records on the entire cylinder has to be split between cylinder 5 and an empty reserve cylinder, say cylinder 11, Figure 4.37. That's fine, except that the key value of 368 in the tree index's lowest level still points to cylinder 5 while the record with key value 368 is now on cylinder 11. Furthermore, there is no key value/pointer pair representing cylinder 11 in the tree index, at all! If the lowest-level index record containing key value 368 had room, a pointer to the new cylinder could be added and the keys in the key value/pointer pairs adjusted. But, as can be seen in Figure 4.35, there is no room in that index record.

Figure 4.37: The records of cylinder 5 plus the newly added record, divided between cylinder 5 and an empty reserve cylinder, cylinder 11

Figure 4.38 shows how this situation is handled. The index record into which the key for the new cylinder should go (the middle of the three index records at the lower level), which happens to be full, is split into two index records. The now five instead of four key values and their associated pointers are divided, as equally as possible, between them. But, in Figure 4.35, there were three key values in the record at the next level up (which happens to be the root), and now there are four index records instead of the previous three at the lower level. As shown in Figure 4.38, the empty space in the root index record is used to accommodate the new fourth index record at the lower level. What would have happened if the root index record had already been full? It would have been split in half and a new root at the next level up would have been created, expanding the index tree from two levels of index records to three levels.

Figure 4.38: The B+-tree index after the cylinder 5 split

Remember the following about indexes:

   An index can be built over any field of a file, whether or not the file is in physical sequence based on that or any other field. The field need not have unique values.
   An index can be built on a single field but it can also be built on a combination of fields. For example, an index could be built on the combination of City and State in the Salesperson file.
   In addition to its direct access capability, an index can be used to retrieve the records of a file in logical sequence based on the indexed field. For example, the index in Figure 4.31 could be used to retrieve the records of the Salesperson file in sequence by salesperson name. Since the index is in sequence by salesperson name, a simple scan of the index from beginning to end lists the relative record numbers of the salesperson records in order by salesperson name.
   Many separate indexes into a file can exist simultaneously, each based on a different field or combination of fields of the file. The indexes are quite independent of each other.
   When a new record is inserted into a file, an existing record is deleted, or an indexed field is updated, all of the affected indexes must be updated.

Creating an Index with SQL

Creating an index with SQL entails naming the index, specifying the table being indexed, and specifying the column on which the index is being created. So, for example, to create index A in Figure 4.39, which is an index built on the Salesperson Number attribute of the SALESPERSON table, you would write:

CREATE INDEX A ON SALESPERSON(SPNUM);

Figure 4.39: The General Hardware Company relational database with some indexes Practice: Variations

3.15 Optional: Exercises

Consider a hospital in which each doctor is responsible for many patients while each patient is cared for by just one doctor. Each doctor has a unique employee number, name, telephone number, and office number. Each patient has a unique patient number, name, home address, and home telephone number.
What kind of relationship is there between doctors and patients?
Develop sample doctor and patient data and construct two files in the style of Figure 4.5 in which to store your sample data.
Do any fields have to be added to one or the other of the two files to record the relationship between doctors and patients? Explain.
Merge these two files into one, in the style of Figure 4.6. Does this create any problems with the data? Explain.
The Dynamic Chemicals Corp. keeps track of its customers and its orders. Customers typically have several outstanding orders while each order was generated by a single customer. Each customer has a unique customer number, a customer name, address, and telephone number. An order has a unique order number, a date, and a total cost.
What kind of relationship is there between customers and orders?
Develop sample customer and order data and construct two files in the style of Figure 4.5 in which to store your sample data.
Do any fields have to be added to one or the other of the two files to record the relationship between customers and orders? Explain.
Merge these two files into one, in the style of Figure 4.6. Does this create any problems with the data? Explain.
The main relation of a motor vehicle registration bureau's relational database includes the following attributes:
Vehicle Identification Number License Plate Number Owner Serial Number Manufacturer Model Year Color

The Vehicle Identification Number is a unique number assigned to the car when it is manufactured. The License Plate Number is, in effect, a unique number assigned to the car by the government when it is registered. The Owner Serial Number is a unique identifier of each owner. Each owner can own more than one vehicle. The other attributes are not unique. What is/are the candidate key(s) of this relation? If there is more than one candidate key, choose one as the primary key and indicate which is/are the alternate key(s).
A relation consists of attributes A, B, C, D, E, F, G, and H.
No single attribute has unique values.
The combination of attributes A and E is unique.
The combination of attributes B and D is unique.
The combination of attributes B and G is unique.
Select a primary key for this relation and indicate and alternate keys.
In the General Hardware Corp. relational database of Figure 4.25:
How many foreign keys are there in each of the six relations?
List the foreign keys in each of the six relations.
Identify the relations that support many-to-many relationships, the primary keys of those relations, and any intersection data in the General Hardware Corp. database.
Consider the General Hardware Corp. relational database. Using the informal relational command language described in this chapter, write commands to:
List the product name and unit price of all of the products.
List the employee names and titles of all the employees of customer 2198.
Retrieve the record for office number 1284.
Retrieve the records for customers headquartered in Los Angeles.
Find the size of office number 1209.
Find the name of the salesperson assigned to office number 1209.
List the product name and quantity sold of each product sold by salesperson 361.
Consider the General Hardware Corp. relational database and the data stored in it, as shown in Figure 4.25. Find the answer to each of the following queries (written in the informal relational command language described in this chapter).
Select rows from the CUSTOMER EMPLOYEE relation in which Customer Number = 2198.
Select rows from the CUSTOMER EMPLOYEE relation in which Customer Number = 2198. Project Employee Number and Employee Name over that result.
Select rows from the PRODUCT relation in which Product Number = 21765.
Select rows from the PRODUCT relation in which Product Number = 21765. Project Unit Price over that result.
Join the SALESPERSON and CUSTOMER relations using the Salesperson Number attribute of each as the join fields. Select rows from that result in which Salesperson Name = Baker. Project Customer Name over that result.
Join the PRODUCT relation and the SALES relation using the Product Number attribute of each as the join fields. Select rows in which Product Name = Pliers. Project Salesperson Number and Quantity over that result.
For each of Exercise 8, describe in words what the query is trying to accomplish.

3.16 Optional: Minicases

Worldwide, vacation cruises on increasingly larger ships have been steadily growing in popularity. People like the all‐inclusive price for food, room, and entertainment, the variety of shipboard activities, and the ability to unpack just once and still visit several different places. The first of the two minicases used throughout this book is the story of Happy Cruise Lines. Happy Cruise Lines has several ships and operates (begins its cruises) from a number of ports. It has a variety of vacation cruise itineraries, each involving several ports of call. The company wants to keep track of both its past and future cruises and of the passengers who sailed on the former and are booked on the latter. Actually, you can think of a cruise line as simply a somewhat specialized instance of any passenger transportation company, including airlines, trains, and buses. Beyond that, a cruise line is, after all, a business and like any other business of any kind it must be concerned about its finances, employees, equipment, and so forth.
Using this introductory description of (and hints about) Happy Cruise Lines, make a list of the things in Happy Cruise Lines' business environment about which you think the company would want to maintain data. Do some or all of these qualify as "corporate resources?" Explain.
Develop some ideas about how the data you identified in part a above can be used by Happy Cruise Lines to gain a competitive advantage over other cruise lines.
Sports are universally enjoyed around the globe. Whether the sport is a team or individual sport, whether a person is a participant or a spectator, and whether the sport is played at the amateur or professional level, one way or another this kind of activity can be enjoyed by people of all ages and interests. Furthermore, professional sports today are a big business involving very large sums of money. And so, the second of the two minicases to be used throughout this book is the story of the professional Super Baseball League. Like any sports league, the Super Baseball League wants to maintain information about its teams, coaches, players, and equipment, among other things. If you are not particularly familiar with baseball or simply prefer another sport, bear in mind that most of the issues that will come up in this minicase easily translate to any team sport at the amateur, college, or professional levels. After all, all team sports have teams, coaches, players, fans, equipment, and so forth. When specialized equipment or other baseball‐specific items come up, we will explain them.
Using this introductory description of (and hints about) the Super Baseball League, list the things in the Super Baseball League's business environment about which you think the league would want to maintain data. Do some or all of these qualify as "corporate resources," where the term is broadened to include the resources of a sports league? Explain.
Develop some ideas about how the data that you identified in part a above can be used by the Super Baseball League to gain a competitive advantage over other sports leagues for the fans' interest and entertainment dollars (Euros, pesos, yen, etc.)
Answer the following questions based on the following Happy Cruise Lines' data.

   (a) Ship table
   Ship Number	Ship Name	Year Built	Weight (Tons)
   005	Sea Joy	1999	80,000
   009	Ocean IV	2003	75,000
   012	Prince Al	2004	90,000
   020	Queen Shirley	1999	80,000

   (b) Crew Member table
   Sailor Number	Sailor Name	Ship Number	Home Country	Job Title
   00536	John Smith	009	USA	Purser
   00732	Ling Chang	012	China	Engineer
   06988	Maria Gonzalez	020	Mexico	Purser
   16490	Prashant Kumar	005	India	Navigator
   18535	Alan Jones	009	UK	Cruise Director
   20254	Jane Adams	012	USA	Captain
   23981	Rene Lopez	020	Philippines	Captain
   27467	Fred Jones	020	UK	Waiter
   27941	Alain DuMont	009	France	Captain
   28184	Susan Moore	009	Canada	Wine Steward
   31775	James Collins	012	USA	Waiter
   32856	Sarah McLachlan	012	Ireland	Cabin Steward

   	Happy Cruise Lines' data	
   Download

Regarding the Happy Cruise Lines Crew Member file.
Describe the file's record type.
Show a record occurrence.
Describe the set or range of values that the Ship Number field can take.
Describe the set or range of values that the Home Country field can take.
Assume that the records of the Crew Memberfile are physically stored in the order shown.
Retrieve all of the records of the file physically sequentially.
Retrieve all of the records of the file logically sequentially based on the Sailor Name field.
Retrieve all of the records of the file logically sequentially based on the Sailor Number field.
Retrieve all of the records of the file logically sequentially based on the Ship Number field.
Perform a direct retrieval of the records with a Sailor Number field value of 27467.
Perform a direct retrieval of the records with a Ship Number field value of 020.
Perform a direct retrieval of the records with a Job Title field value of Captain.
The value 009 appears as a ship number once in the Ship file and four times in the Crew Member file. Does this constitute data redundancy? Explain.
Merge the Ship and Crew Member files based on the common ship number field (in a manner similar to Figure 4.8 for the General Hardware database). Is the merged file an improvement over the two separate files in terms of:
Data redundancy? Explain.
Data integration? Explain.
Explain why the Ship Number field is in the Crew Member file.
Explain why ship number 012 appears three times in the Crew Member file.
How many files must be accessed to find:
The year that ship number 012 was built?
The home country of sailor number 27941?
The name of the ship on which sailor number 18535 is employed?
Describe the procedure for finding the weight of the ship on which sailor number 00536 is employed.
What is the mechanism for recording the one-to-many relationship between crew members and ships in the Happy Cruise Lines database above?
Answer the following questions based on the following Super Baseball League data.

   (a) TEAM file.
   Team Number	Team Name	City	Manager
   137	Eagles	Orlando	Smith
   275	Cowboys	San Jose	Jones
   294	Statesmen	Springfield	Edwards
   368	Pandas	El Paso	Adams
   422	Sharks	Jackson	Vega

   (b) PLAYER file.
   Player Number	Player Name	Age	Position	Team Number
   1209	Steve Marks	24	Catcher	294
   1254	Roscoe Gomez	19	Pitcher	422
   1536	Mark Norton	32	First Baseman	368
   1953	Alan Randall	24	Pitcher	137
   2753	John Harbor	22	Shortstop	294
   2843	John Yancy	27	Center Fielder	137
   3002	Stuart Clark	20	Catcher	422
   3274	Lefty Smith	31	Third Baseman	137
   3388	Kevin Taylor	25	Shortstop	294
   3740	Juan Vidora	25	Catcher	368

   	Super Baseball League data	
   Download

Regarding the Super Baseball League Player file shown below.
Describe the file's record type.
Show a record occurrence.
Describe the set or range of values that the Player Number field can take.
Assume that the records of the Player file are physically stored in the order shown.
Retrieve all of the records of the file physically sequentially.
Retrieve all of the records of the file logically sequentially based on the Player Name field.
Retrieve all of the records of the file logically sequentially based on the Player Number field.
Retrieve all of the records of the file logically sequentially based on the Team Number field.
Perform a direct retrieval of the records with a Player Number field value of 3834.
Perform a direct retrieval of the records with a Team Number field value of 20.
Perform a direct retrieval of the records with an Age field value of 24.
The value 294 appears as a team number once in the Team file and three times in the Player file. Does this constitute data redundancy? Explain.
Merge the Team and Player files based on the common Team Number field (in a manner similar to Figure 4.8 for the General Hardware database). Is the merged file an improvement over the two separate tables in terms of:
Data redundancy? Explain.
Data integration? Explain.
Explain why the Team Number field is in the Player file.
Explain why team number 422 appears twice in the Player file.
How many files must be accessed to find:
The age of player number 1953?
The name of the team on which player number 2288 plays?
The number of the team on which player number 2288 plays?
Describe the procedure for finding the name of the city in which player number 3002 is based.
What is the mechanism for recording the one-to-many relationship between players and teams in the Super Baseball League database, above?
Consider the following relational database for Happy Cruise Lines. It keeps track of ships, cruises, ports, and passengers. A "cruise" is a particular sailing of a ship on a particular date. For example, the seven-day journey of the ship Pride of Tampa that leaves on June 13, 2009, is a cruise. Note the following facts about this environment.
Both ship number and ship name are unique in the SHIP Relation.
A ship goes on many cruises over time. A cruise is associated with a single ship.
A port is identified by the combination of port name and country.
As indicated by the VISIT Relation, a cruise includes visits to several ports, and a port is typically included in several cruises.
Both Passenger Number and Social Security Number are unique in the PASSENGER Relation. A particular person has a single Passenger Number that is used for all of the cruises that she takes.
The VOYAGE Relation indicates that a person can take many cruises and a cruise, of course, has many passengers.
SHIP Relation
Ship Number Ship Name Ship Builder Launch Date Gross Weight
CRUISE Relation
Cruise Number Start Date End Date Cruise Director Ship Number
PORT Relation
Port Name Country Number of Docks Port Manager
VISIT Relation
Cruise Number Port Name Country Arrival Date Departure Date
PASSENGER Relation
Passenger Number Passenger Name Social Security Number Home Address Telephone Number
VOYAGE Relation
Passenger Number Cruise Number Stateroom Number Fare

   	Happy Cruise Lines	
   Download

       Identify the candidate keys of each relation.
       Identify the primary key and any alternate keys of each relation.
       How many foreign keys does each relation have?
       Identify the foreign keys of each relation.
       Indicate any instances in which a foreign key serves as part of the primary key of the relation in which it is a foreign key. Why does each of those relations require a multi-attribute primary key?
       Identify the relations that support many-to-many relationships, the primary keys of those relations, and any intersection data.
       Using the informal relational command language described in this chapter, write commands to:
           Retrieve the record for passenger number 473942.
           Retrieve the record for the port of Nassau in the Bahamas.
           List all of the ships built by General Shipbuilding, Inc.
           List the port name and number of docks of every port in Mexico.
           List the name and number of every ship.
           Who was the cruise director on cruise number 38232?
           What was the gross weight of the ship used for cruise number 39482?
           List the home address of every passenger on cruise number 17543.

3.17 Summary

Recognition of the commercial importance of data, of storing it, and of retrieving it can be traced back to ancient times. As trade routes lengthened and cities grew larger, data became increasingly important. Eventually, the importance of data led to the development of electromechanical calculating devices and then to modern electronic computers, complete with magnetic and optical disk‐based data storage media.

While the use of data has given many companies a competitive advantage in their industries, the storage and retrieval of today's vast amounts of data holds many challenges. These include speedy retrieval of data when many people try to access the data at the same time, maintaining the accuracy of the data, the issue of data security, and the ability to recover the data if it is lost.

The recognition that data is a critical corporate resource and that managing data is a complex task has led to the development and continuing refinement of specialized software known as database management systems, the subject of this book. There are five major components in the database concept. One is the development of a datacentric environment that promotes the idea of data being a significant corporate resource and encourages the sharing of data. Another, which is really the central premise of database management, is the ability to achieve data integration while at the same time storing data in a non-redundant fashion. The third, which at the structural level is actually closely related to the integration/redundancy paradigm, is the ability to store data representing entities involved in multiple relationships without introducing redundancy. Another component is the presence of a set of data controls that address such issues as data security, backup and recovery, and concurrency control. The final component is that of data independence, the ability to modify data structures without having to modify programs that access them.

There are basically four approaches to database management: the early hierarchical and network approaches, the current standard relational approach, and the specialty object-oriented approach, many features of which are incorporated into today's expanded relational database management systems.

The relational approach to database management is by far the primary database management approach used in all levels of information systems applications today. The basic structural component of a relational database is the relation, which appears to be a simple linear file but has some technical differences.

Every relation has a unique primary key consisting of one or more attributes that have unique values in that relation. Multiple such unique attributes or combinations of attributes that have the uniqueness property are called candidate keys. The candidate keys that are not chosen to be the one primary key are called alternate keys. If the primary key of one relation in the database also appears in another relation of the database, it called a foreign key in that second relation. Foreign keys tie relations together in the sense that they implement relationships between the entities represented by the relations. A one-to-many relationship is implemented by adding the primary key on the "one side" of the relationship to the relation representing the "many side" of the relationship. Many-to-many relationships are implemented by constructing an additional relation that includes the primary keys of the two entities in the many-to-many relationship. Additional attributes that describe the many-to-many relationship are called intersection data.

Three basic relational algebra commands permit data retrieval from a relational database. The Select command retrieves one or more rows of a relation. The Project command retrieves one or more columns of a relation. The Join command accomplishes data integration by tying together relations that have a common primary key/foreign key pair. These three commands can be used in combination to retrieve the specific data required in a particular query.

Glossary

Attribute-An attribute is a property of, a characteristic of, or a fact that we know about an entity.

Candidate key -If a relation has more than one attribute or minimum group of attributes that represents a way of uniquely identifying the entities, then they are each called a candidate key.

Computer security-It includes protecting the physical hardware environment, defending against hacker attacks, encrypting data transmitted over networks, educating employees on the importance of protecting the company's data, and many more.

Concurrency problem-If two or more users are trying to update a particular record simultaneously, they run the risk of generating what is known as a "concurrency problem."

Data integration-It refers to the ability to tie together pieces of related data within an information system.

Data redundancy-It refers to the same fact about the business environment being stored more than once within an information system.

Data retrieval-It involves the fetching of desired data from a database.

Direct access-It is a retrieval of a single record of a file or a subset of the records of a file based on one or more values of a field or a combination of fields in the file.

Entity-A "thing" or "object" in our environment that we want to keep track of is called an entity.

Entity set-A collection of entities of the same type (e.g., all the company's employees) is called an entity set.

Equijoin-An equijoin combines two or more tables based on a column that is common to the tables.

Fields-The columns representing the facts are called fields.

File-The entire structure is called a file.

Foreign key-If, in a collection of relations that make up a relational database, an attribute or group of attributes serves as the primary key of one relation and also appears in another relation, then it is called a foreign key in that other relation.

JOIN operator-It specifies how to relate tables in the query.

Logical sequential access-In "logical" sequential access, the records are retrieved in order based on the values of one or a combination of the fields.

Natural Join- The natural join matches each row in a table against each row in another table based on common values found in columns sharing a common name and data type.

Physical sequential access-In "physical" sequential access, the records are retrieved one after the other, just as they are stored on the disk device.

Primary key-It is an attribute or group of attributes whose values are unique throughout all rows of the relation.

Record-A record is a collection of related data items.

Reengineering-In reengineering, data and information systems are aggressively used to redesign business processes for maximum efficiency.

Relational algebra-It is a formal system for manipulating relations.

Retrieve or Read- It refers to the data without changing it.

Sequential access-It means the retrieval of all or a portion of the records of a file one after another, in some sequence, starting from the beginning, until all the required records have been retrieved.

Simple linear file-It is a collection of records listed one after the other in a long line.

Tuple -Rows are referred to as tuples.

Quiz

1) Which of the following keys is an attribute or group of attributes whose values are unique throughout all rows of the relation?

-Primary key

2) Which of the following allows you to refer to the data for some business purpose without changing it?

-RETRIEVE

3) Which type of join combines two or more tables based on a column that is common to the tables?

-Equi join

4) Which of the following commands allows you to add new records to a table?

-INSERT

5) Which of the following commands allows you to eliminate records from a table?

-DELETE

6) Which command is used to change one or more of a record's field values?

-UPDATE

7) Which of the following is referred to as a field?

-Column

8) Which type of join matches each row in a table against each row in another table based on common values found in columns sharing a common name and data type?

-Natural Join

9) Which of the following is NOT included in the four major DBMS approaches?

-Inverted File Model

10) Which of the following involves the fetching of desired data from a database?

-Data Retrieval

11) Which key establishes a relationship that has more than one attribute or minimum group of attributes that represents a way of uniquely identifying the entities?

-Candidate Key

12) Which of the following is referred to as a tuple?

-Row

13) Which of the following will prevent the risk of updating a particular record simultaneously?

-Concurrency Control

14) Which of the following refers to the ability to tie together pieces of related data within an information system?

-Data Integration

15) Which of the following refers to the same fact about the business environment being stored more than once within an information system?

-Data Redundancy

Lesson 4

Lesson 4 SQL Basics Completing this lesson will help you gain the following competency:

Competency 4017.1.3 The Fundamentals of SQL The graduate demonstrates an understanding of SQL concepts.

This topic highlights the following objectives:

   Define declarative language.
   Use the SQL SELECT command.
   Use SQL SELECT statements using comparison operators.
   Use SQL SELECT statements to demonstrate filtering of results.
   Demonstrate the use of SQL SELECT commands that join relational tables.
   Describe SQL SELECT statements writing strategy.

4.1 Introduction

There are two aspects of data management: data definition and data manipulation. Data definition, which is operationalized with a data definition language (DDL), involves instructing the DBMS software on what tables will be in the database, what attributes will be in the tables, which attributes will be indexed, and so forth. Data manipulation refers to the four basic operations that can and must be performed on data stored in any DBMS (or in any other data storage arrangement, for that matter): data retrieval, data update, insertion of new records, and deletion of existing records. Data manipulation requires a special language with which users can communicate data manipulation commands to the DBMS. Indeed, as a class, these are known as data manipulation languages (DMLs).

A standard language for data management in relational databases, known as Structured Query Language or SQL, was developed in the early 1980s. SQL incorporates both DDL and DML features. It was derived from an early IBM research project in relational databases called "System R." SQL has long since been declared a standard by the American National Standards Institute (ANSI) and by the International Standards Organization (ISO). Indeed, several versions of the standards have been issued over the years. Using the standards, many manufacturers have produced versions of SQL that are all quite similar, at least at the level at which we will look at SQL in this book. These SQL versions are found in such mainstream DBMSs as DB2, Oracle, MS Access, Informix, and others. SQL in its various implementations is used very heavily in practice today by companies and organizations of every description, Advance Auto Parts being one of countless examples.

SQL is a comprehensive database management language. The most interesting aspect of SQL and the aspect that we want to explore in this chapter is its rich data retrieval capability. The other SQL data manipulation features, as well as the SQL data definition features, will be considered in the database design chapters that come later in this book. Knowledge Check: Structured Query Language

4.2 Data Retrieval With The SQL SELECT Command

Introduction to the SQL SELECT Command

Data retrieval in SQL is accomplished with the SELECT command. There are a few fundamental ideas about the SELECT command that you should understand before looking into the details of using it. The first point is that the SQL SELECT command is not the same thing as the relational algebra Select operator. It's a bit unfortunate that the same word is used to mean two different things, but that's the way it is. The fact is that the SQL SELECT command is capable of performing relational Select, Project, and Join operations singly or in combination.

CONCEPTS IN ACTION 4-A ADVANCE AUTO PARTS Advance Auto Parts is the second largest retailer of automotive parts and accessories in the U. S. The company was founded in 1932 with three stores in Roanoke, VA, where it is still headquartered today. In the 1980s, with fewer than 175 stores, the company developed an expansion plan that brought it to over 350 stores by the end of 1993. It has rapidly accelerated its expansion since then and, with mergers and acquisitions, now has more than 2,400 stores and over 32,000 employees throughout the United States. Advance Auto Parts sells over 250,000 automotive components. Its innovative "Parts Delivered Quickly" (PDQ) system, which was introduced in 1982, allows its customers access to this inventory within 24 hours.

One of Advance Auto Parts' key database applications, its Electronic Parts Catalog, gives the company an important competitive advantage. Introduced in the early 1990s and continually upgraded since then, this system allows store personnel to look up products they sell based on the customer's vehicle type. The system's records include part descriptions, images, and drawings. Once identified, store personnel pull an item from the store's shelves if it's in stock. If it's not in stock, then using the system they send out a real-time request for the part to the home office to check on the part's warehouse availability. Within minutes the part is picked at a regional warehouse and it's on its way. In addition to its in-store use, the system is used by the company's purchasing and other departments.

The system runs on an IBM mid-range system at company headquarters and is built on the SQL Server DBMS. Parts catalog data, in the form of updates, is downloaded weekly from this system to a small server located in each store. Additional data retrieval at headquarters is accomplished with SQL. The 35-table database includes a Parts table with 2.5 million rows that accounts not only for all of the items in inventory but for different brands of the same item. There is also a Vehicle table with 31,000 records. These two lead to a 45-million-record Parts Application table that describes which parts can be used in which vehicles.

Photo Courtesy of Advance Auto Parts

SQL SELECT commands are considered, for the most part, to be "declarative" rather than "procedural" in nature. This means that you specify what data you are looking for rather than provide a logical sequence of steps that guide the system in how to find the data. Indeed, as we will see later in this chapter, the relational DBMS analyzes the declarative SQL SELECT statement and creates an access path, a plan for what steps to take to respond to the query. The exception to this, and the reason for the qualifier "for the most part" at the beginning of this paragraph, is that a feature of the SELECT command known as "subqueries" permits the user to specify a certain amount of logical control over the data retrieval process.

Another point is that SQL SELECT commands can be run in either a "query" or an "embedded" mode. In the query mode, the user types the command at a workstation and presses the Enter key. The command goes directly to the relational DBMS, which evaluates the query and processes it against the database. The result is then returned to the user at the workstation. Commands entered this way can normally also be stored and retrieved at a later time for repetitive use. In the embedded mode, the SELECT command is embedded within the lines of a higher-level language program and functions as an input or "read" statement for the program. When the program is run and the program logic reaches the SELECT command, the program executes the SELECT. The SELECT command is sent to the DBMS which, as in the query-mode case, processes it against the database and returns the results, this time to the program that issued it. The program can then use and further process the returned data. The only tricky part to this is that traditional higher-level language programs are designed to retrieve one record at a time. The result of a relational retrieval command is itself, a relation. A relation, if it consists of a single row, can resemble a record, but a relation of several rows resembles, if anything, several records. In the embedded mode, the program that issued the SQL SELECT command and receives the resulting relation back, must treat the rows of the relation as a list of records and process them one at a time.

SQL SELECT commands can be issued against either the actual, physical database tables or against a "logical view" of one table or of several joined tables. Good business practice dictates that in the commercial environment, SQL SELECT commands should be issued against such logical views rather than directly against the base tables. As we will see later in this book, this is a simple but effective security precaution.

Finally, the SQL SELECT command has a broad array of features and options and we will only cover some of them at this introductory level. But what is also very important is that our discussion of the SELECT command and the features that we will cover will work in all of the major SQL implementations, such as Oracle, MS Access, SQL Server, DB2, Informix, and so on, possibly with minor syntax variations in some cases. Practice: SQL command

4.3 Basic Functions

The Basic SELECT Format

In the simplest SELECT command, we will indicate from which table of the database we want to retrieve data, which rows of that table we are interested in, and which attributes of those rows we want to retrieve. The basic format of such a SELECT statement is:

SELECT<columns>

FROM

WHERE<predicates identifying rows to be included> We will illustrate the SQL SELECT command with the General Hardware Co. database of Figure 3.1, which is derived from the General Hardware entity-relationship diagram of Figure 2.9. If you have not as yet covered the database design chapters in this book, just keep in mind that some of the columns are present to tie together related data from different tables. For example, the SPNUM column in the CUSTOMER table is present to tie together related salespersons and customers. Figure 3.1: The General Hardware Company relational database As is traditional with SQL, the SQL statements will be shown in all capital letters, except for data values taken from the tables. Note that the attribute names in Figure 3.1 have been abbreviated for convenience and set in capital letters to make them easily recognizable in the SQL statements. Also, spaces in the names have been removed. Using the General Hardware database, an example of a simple query that demonstrates the basic SELECT format is: "Find the commission percentage and year of hire of salesperson number 186." The SQL statement to accomplish this would be: SELECT COMMPERCT, YEARHIRE FROM SALESPERSON WHERE SPNUM=186; How is this command constructed? The desired attributes are listed in the SELECT clause, the required table is listed in the FROM clause, and the restriction or predicate indicating which row(s) is involved is shown in the WHERE clause in the form of an equation. Notice that SELECT statements always end with a single semicolon (;) at the very end of the entire statement. The result of this statement is: COMMPERCT YEARHIRE 15 2001 As is evident from this query, an attribute like SPNUM that is used to search for the required rows, also known as a "search argument," does not have to appear in the query result, as long as its absence does not make the result ambiguous, confusing, or meaningless. To retrieve the entire record for salesperson 186, the statement would change to: SELECT * FROM SALESPERSON WHERE SPNUM=186; resulting in: SPNUM SPNAME COMMPERCT YEARHIRE OFFNUM 186 Adams 15 2001 1253 The "*" in the SELECT clause indicates that all attributes of the selected row are to be retrieved. Notice that this retrieval of an entire row of the table is, in fact, a relational Select operation! A relational Select operation can retrieve one or more rows of a table, depending, in this simple case, on whether the search argument is a unique or non-unique attribute. The search argument is non-unique in the following query: "List the salesperson numbers and salesperson names of those salespersons who have a commission percentage of 10." SELECT SPNUM, SPNAME FROM SALESPERSON WHERE COMMPERCT=10; which results in: SPNUM SPNAME 137 Baker 204 Dickens The SQL SELECT statement can also be used to accomplish a relational Project operation. This is a vertical slice through a table involving all rows and some attributes. Since all of the rows are included in the Project operation, there is no need for a WHERE clause to limit which rows of the table are included. For example, "List the salesperson number and salesperson name of all of the salespersons." SELECT SPNUM, SPNAME FROM SALESPERSON; results in: SPNUM SPNAME 137 Baker 186 Adams 204 Dickens 361 Carlyle To retrieve an entire table, that is to design an SQL SELECT statement that places no restrictions on either the rows or the attributes, you would issue: SELECT * FROM SALESPERSON; and have as the result: SPNUM SPNAME COMMPERCT YEARHIRE OFFNUM 137 Baker 10 1995 1284 186 Adams 15 2001 1253 204 Dickens 10 1998 1209 361 Carlyle 20 2001 1227 Practice: Use of the SELECT statement Comparisons In addition to equal (=), the standard comparison operators, greater than (>), less than (<), greater than or equal to (>=), less than or equal to (<=), and not equal to (<>) can be used in the WHERE clause. "List the salesperson numbers, salesperson names, and commission percentages of the salespersons whose commission percentage is less than 12." SELECT SPNUM, SPNAME, COMMPERCT FROM SALESPERSON WHERE COMMPERCT<12; This results in: SPNUM SPNAME COMMPERCT 137 Baker 10 204 Dickens 10 As another example: "List the customer numbers and headquarters cities of the customers that have a customer number of at least 1700." SELECT CUSTNUM, HQCITY FROM CUSTOMER WHERE CUSTNUM>=1700; results in: CUSTNUM HQCITY 1700 Washington 1826 New York 2198 New York 2267 New York Practice: Use of the standard comparison operators ANDs and ORs Frequently, there is a need to specify more than one limiting condition on a table's rows in a query. Sometimes, for a row to be included in the result it must satisfy more than one condition. This requires the Boolean AND operator. Sometimes a row can be included if it satisfies one of two or more conditions. This requires the Boolean OR operator. AND An example in which two conditions must be satisfied is: "List the customer numbers, customer names, and headquarters cities of the customers that are headquartered in New York and that have a customer number higher than 1500." SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER WHERE HQCITY=‘New York’ AND CUSTNUM>1500; resulting in: CUSTNUM CUSTNAME HQCITY 1826 City Hardware New York 2198 Western Hardware New York 2267 Central Stores New York Notice that customer number 0121, which is headquartered in New York, was not included in the results because it failed to satisfy the condition of having a customer number greater than 1500. With the AND operator, it had to satisfy both conditions to be included in the result. OR To look at the OR operator, let's change the last query to: "List the customer numbers, customer names, and headquarters cities of the customers that are headquartered in New York or that have a customer number higher than 1500." SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER WHERE HQCITY=‘New York’ OR CUSTNUM>1500; results in: CUSTNUM CUSTNAME HQCITY 0121 Main St. Hardware New York 1525 Fred's Tool Stores Atlanta 1700 XYZ Stores Washington 1826 City Hardware New York 2198 Western Hardware New York 2267 Central Stores New York Notice that the OR operator really means one or the other or both. Customer 0121 is included because it is headquartered in New York. Customers 1525 and 1700 are included because they have customer numbers higher than 1500. Customers 1826, 2198, and 2267 are included because they satisfy both conditions. Both AND and OR What if both AND and OR are specified in the same WHERE clause? AND is said to be "higher in precedence" than OR, and so all ANDs are considered before any ORs are considered. The following query, which has to be worded very carefully, illustrates this point: "List the customer numbers, customer names, and headquarters cities of the customers that are headquartered in New York or that satisfy the two conditions of having a customer number higher than 1500 and being headquartered in Atlanta." SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER WHERE HQCITY=‘New York’ OR CUSTNUM>1500 AND HQCITY=‘Atlanta’; The result of this query is: CUSTNUM CUSTNAME HQCITY 0121 Main St. Hardware New York 1525 Fred's Tool Stores Atlanta 1826 City Hardware New York 2198 Western Hardware New York 2267 Central Stores New York Notice that since the AND is considered first, one way for a row to qualify for the result is if its customer number is greater than 1500 and its headquarters city is Atlanta. With the AND taken first, it's that combination or the headquarters city has to be New York. Considering the OR operator first would change the whole complexion of the statement. The best way to deal with this, especially if there are several ANDs and ORs in a WHERE clause, is by using parentheses. The rule is that anything in parentheses is done first. If the parentheses are nested, then whatever is in the innermost parentheses is done first and then the system works from there towards the outermost parentheses. Thus, a "safer" way to write the last SQL statement would be: SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER WHERE HQCITY=‘New York’ OR (CUSTNUM>1500 AND HQCITY=‘Atlanta’); If you really wanted the OR to be considered first, you could force it by writing the query as: SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER WHERE (HQCITY=‘New York’ OR CUSTNUM>1500) AND HQCITY=‘Atlanta’; This would mean that, with the AND outside of the parentheses, both of two conditions have to be met for a row to qualify for the results. One condition is that the headquarters city is New York or the customer number is greater than 1500. The other condition is that the headquarters city is Atlanta. Since for a given row, the headquarters city can't be both Atlanta and New York, the situation looks grim. But, in fact, customer number 1525 qualifies. Its customer number is greater than 1500, which satisfies the OR of the first of the two conditions, and its headquarters city is Atlanta, which satisfies the second condition. Thus, both conditions are met for this and only this row. Practice: Use of the ANDs and ORs operator BETWEEN, IN, and LIKE BETWEEN, IN, and LIKE are three useful operators. BETWEEN allows you to specify a range of numeric values in a search. IN allows you to specify a list of character strings to be included in a search. LIKE allows you to specify partial character strings in a "wildcard" sense. BETWEEN Suppose that you want to find the customer records for those customers whose customer numbers are between 1000 and 1700 inclusive (meaning that both 1000 and 1700, as well as all numbers in between them, are included). Using the AND operator, you could specify this as: SELECT * FROM CUSTOMER WHERE (CUSTNUM>=1000 AND CUSTNUM<=1700); Or, you could use the BETWEEN operator and specify it as: SELECT * FROM CUSTOMER WHERE CUSTNUM BETWEEN 1000 AND 1700; With either way of specifying it, the result would be: CUSTNUM CUSTNAME SPNUM HQCITY 1047 Acme Hardware Store 137 Los Angeles 1525 Fred's Tool Stores 361 Atlanta 1700 XYZ Stores 361 Washington IN Suppose that you want to find the customer records for those customers headquartered in Atlanta, Chicago, or Washington. Using the OR operator, you could specify this as: SELECT * FROM CUSTOMER WHERE (HQCITY=‘Atlanta’ OR HQCITY=‘Chicago’ OR HQCITY=‘Washington’); Or, you could use the IN operator and specify it as: SELECT * FROM CUSTOMER WHERE HQCITY IN (‘Atlanta’, ‘Chicago’, ‘Washington’); With either way of specifying it, the result would be: CUSTNUM CUSTNAME SPNUM HQCITY 0839 Jane's Stores 186 Chicago 1525 Fred's Tool Stores 361 Atlanta 1700 XYZ Stores 361 Washington LIKE Suppose that you want to find the customer records for those customers whose names begin with the letter "A". You can accomplish this with the LIKE operator and the "%" character used as a wildcard to represent any string of characters. Thus, ‘A%’ means the letter "A" followed by any string of characters, which is the same thing as saying ‘any word that begins with "A".’ SELECT * FROM CUSTOMER WHERE CUSTNAME LIKE ‘A%’; The result would be: CUSTNUM CUSTNAME SPNUM HQCITY 0933 ABC Home Stores 137 Los Angeles 1047 Acme Hardware Store 137 Los Angeles Note that, unlike BETWEEN and IN, there is no easy alternative way in SQL of accomplishing what LIKE can do. In a different kind of example, suppose that you want to find the customer records for those customers whose names have the letter "a" as the second letter of their names. Could you specify ‘%a%’? No, because the ‘%a’ portion of it would mean any number of letters followed by "a", which is not what you want. In order to make sure that there is just one character followed by "a", which is the same thing as saying that "a" is the second letter, you would specify ‘_a%’. The "_" wildcard character means that there will be exactly one letter (any one letter) followed by the letter "a". The "%", as we already know, means that any string of characters can follow afterwards. SELECT * FROM CUSTOMER WHERE CUSTNAME LIKE ‘_a%’; The result would be: CUSTNUM CUSTNAME SPNUM HQCITY 0121 Main St. Hardware 137 New York 0839 Jane's Stores 186 Chicago Notice that both the words "Main" and "Jane's" have "a" as their second letter. Also notice that, for example, customer number 2267 was not included in the result. Its name, "Central Stores", has an "a" in it but it is not the second letter of the name. Again, the single "_" character in the operator LIKE ‘_a%’ specifies that there will be one character followed by "a". If the operator had been LIKE ‘%a%’, then Central Stores would have been included in the result. Filtering the Results of an SQL Query Two ways to modify the results of an SQL SELECT command are by the use of DISTINCT and the use of ORDER BY. It is important to remember that these two devices do not affect what data is retrieved from the database but rather how the data is presented to the user. DISTINCT There are circumstances in which the result of a SQL query may contain duplicate items and this duplication is undesirable. Consider the following query: "Which cities serve as headquarters cities for General Hardware customers?" This could be taken as a simple relational Project that takes the HQCITY column of the CUSTOMER table as its result. The SQL command would be: SELECT HQCITY FROM CUSTOMER; which results in: HQCITY New York Chicago Los Angeles Los Angeles Atlanta Washington New York New York New York Technically, this is the correct result, but why is it necessary to list New York four times or Los Angeles twice? Not only is it unnecessary to list them more than once, but doing so produces unacceptable clutter. Based on the way the query was stated, the result should have each city listed once. The DISTINCT operator is used to eliminate duplicate rows in a query result. Reformulating the SELECT statement as: SELECT DISTINCT HQCITY FROM CUSTOMER; results in: HQCITY New York Chicago Los Angeles Atlanta Washington ORDER BY The ORDER BY clause simply takes the results of a SQL query and orders them by one or more specified attributes. Consider the following query: "Find the customer numbers, customer names, and headquarters cities of those customers with customer numbers greater than 1000. List the results in alphabetic order by headquarters cities." SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER WHERE CUSTNUM>1000 ORDER BY HQCITY; This results in: CUSTNUM CUSTNAME HQCITY 1525 Fred's Tool Stores Atlanta 1047 Acme Hardware Store Los Angeles 1826 City Hardware New York 2198 Western Hardware New York 2267 Central Stores New York 1700 XYZ Stores Washington If you wanted to have the customer names within the same city alphabetized, you would write: SELECT CUSTNUM, CUSTNAME, HQCITY FROM CUSTOMER WHERE CUSTNUM>1000 ORDER BY HQCITY, CUSTNAME; This results in: CUSTNUM CUSTNAME HQCITY 1525 Fred's Tool Stores Atlanta 1047 Acme Hardware Store Los Angeles 2267 Central Stores New York 1826 City Hardware New York 2198 Western Hardware New York 1700 XYZ Stores Washington The default order for ORDER BY is ascending. The clause can include the term ASC at the end to make ascending explicit or it can include DESC for descending order. Practice: Use of BETWEEN, IN, LIKE, DISTINCT, and ORDER BY 4.4 Built-In Functions A number of so-called "built-in functions" give the SQL SELECT command additional capabilities. They involve the ability to perform calculations based on attribute values or to count the number of rows that satisfy stated criteria. AVG and SUM Recall that the SALES table shows the lifetime quantity of particular products sold by particular salespersons. For example, the first row indicates that Salesperson 137 has sold 473 units of Product Number 19440 dating back to when she joined the company or when the product was introduced. Consider the following query: "Find the average number of units of the different products that Salesperson 137 has sold (i.e., the average of the quantity values in the first three records of the SALES table)."Using the AVG operator, you would write: SELECT AVG(QUANTITY) FROM SALES WHERE SPNUM=137; and the result would be: AVG(QUANTITY) 443.67 To find the total number of units of all products that she has sold, you would use the SUM operator and write: SELECT SUM(QUANTITY) FROM SALES WHERE SPNUM=137; and the result would be: SUM(QUANTITY) 1331 4.5 The Join Up to this point, all the SELECT features we have looked at have been shown in the context of retrieving data from a single table. The time has come to look at how the SQL SELECT command can integrate data from two or more tables or "join" them. There are two specifications to make in the SELECT statement to make a join work. One is that the tables to be joined must be listed in the FROM clause. The other is that the join attributes in the tables being joined must be declared and matched to each other in the WHERE clause. And there is one more point. Since two or more tables are involved in a SELECT statement that involves a join, there is the possibility that the same attribute name can appear in more than one of the tables. When this happens, these attribute names must be "qualified" with a table name when used in the SELECT statement. All of this is best illustrated in an example. Consider the following query, which we discussed earlier in this book: "Find the name of the salesperson responsible for Customer Number 1525." The SELECT statement to satisfy this query is: SELECT SPNAME FROM SALESPERSON, CUSTOMER WHERE SALESPERSON.SPNUM=CUSTOMER.SPNUM AND CUSTNUM=1525; and the result is: SPNAME Carlyle Let's take a careful look at this last SELECT statement. Notice that the two tables involved in the join, SALESPERSON and CUSTOMER, are listed in the FROM clause. Also notice that the first line of the WHERE clause: SALESPERSON.SPNUM=CUSTOMER.SPNUM links the two join attributes: the SPNUM attribute of the SALESPERSON table (SALESPERSON.SPNUM) and the SPNUM attribute of the CUSTOMER table (CUSTOMER.SPNUM). The notational device of having the table name "." the attribute name is known as "qualifying" the attribute name. As we said earlier, this qualification is necessary when the same attribute name is used in two or more tables in a SELECT statement. By the way, notice in the SELECT statement that the attributes SPNAME and CUSTNUM don't have to be qualified because each appears in only one of the tables included in the SELECT statement. Here is an example of a join involving three tables, assuming for the moment that salesperson names are unique: "List the names of the products of which salesperson Adams has sold more than 2000 units." The salesperson name data appears only in the SALESPERSON table and the product name data appears only in the PRODUCT table. The SALES table shows the linkage between the two, including the quantities sold. And so the SELECT statement will be: SELECT PRODNAME FROM SALESPERSON, PRODUCT, SALES WHERE SALESPERSON.SPNUM=SALES.SPNUM AND SALES.PRODNUM=PRODUCT.PRODNUM AND SPNAME=‘Adams’ AND QUANTITY>2000; which results in: PRODNAME Hammer Saw Practice: SQL Concept 4.6 Subqueries A variation on the way that the SELECT statement works is when one SELECT statement is "nested" within another in a format known as a subquery. This can go on through several levels of SELECT statements, with each successive SELECT statement contained in a pair of parentheses. The execution rule is that the innermost SELECT statement is executed first and its results are then provided as input to the SELECT statement at the next level up. This procedure can be an alternative to the join. Furthermore, there are certain circumstances in which this procedure must be used. These latter circumstances are common enough and important enough to include in this treatment of the SQL SELECT command. Subqueries as Alternatives to Joins Let's reconsider the first join example given above: "Find the name of the salesperson responsible for Customer Number 1525." If you methodically weave through the database tables to solve this, as we discussed earlier in the book, you start at the CUSTOMER table, find the record for Customer Number 1525 and discover in that record that the salesperson responsible for this customer is Salesperson Number 361. You then take that information to the SALESPERSON table where you look up the record for Salesperson Number 361 and discover in it that the salesperson's name is Carlyle. Using a subquery, this logic can be built into an SQL statement as: SELECT SPNAME FROM SALESPERSON WHERE SPNUM= (SELECT SPNUM FROM CUSTOMER WHERE CUSTNUM=1525); and the result will again be: SPNAME Carlyle Follow the way that the description given above of methodically solving the problem is reconstructed as a SELECT statement with a subquery. Since the innermost SELECT (the indented one), which constitutes the subquery, is considered first, the CUSTOMER table is queried first, the record for Customer Number 1525 is found and 361 is returned as the SPNUM result. How do we know that only one salesperson number will be found as the result of the query? Because CUSTNUM is a unique attribute, Customer Number 1525 can only appear in one record and that one record only has room for one salesperson number! Moving along, Salesperson Number 361 is then fed to the outer SELECT statement. This, in effect, makes the main query, that is the outer SELECT, look like: SELECT SPNAME FROM SALESPERSON WHERE SPNUM=361; and this results in: SPNAME Carlyle Notice, by the way, that in the SELECT statement, there is only one semicolon at the end of the entire statement, including the subquery. When a Subquery is Required There is a very interesting circumstance in which a subquery is required. This situation is best explained with an example up front. Consider the following query: "Which salespersons with salesperson numbers greater than 200 have the lowest commission percentage?" (We'll identify salespersons by their salesperson number.) This seems like a perfectly reasonable request, and yet it turns out to be deceptively difficult. The reason is that the query really has two very different parts. First, the system has to determine what the lowest commission percentage is for salespersons with salesperson numbers greater than 200. Then, it has to see which of these salespersons has that lowest percentage. It's really tempting to try to satisfy this type of query with an SQL SELECT statement like: SELECT SPNUM, MIN(COMMPERCT) FROM SALESPERSON WHERE SPNUM>200; or, perhaps: SELECT SPNUM FROM SALESPERSON WHERE SPNUM>200 AND COMMPERCT=MIN(COMMPERCT); But these will not work! It's like asking SQL to perform two separate operations and somehow apply one to the other in the correct sequence. This turns out to be asking too much. But there is a way to do it and it involves subqueries. In fact, what we will do is ask the system to determine the minimum commission percentage first, in a subquery, and then use that information in the main query to determine which salespersons have it: SELECT SPNUM FROM SALESPERSON WHERE SPNUM>200 AND COMMPERCT= (SELECT MIN(COMMPERCT) FROM SALESPERSON) WHERE SPNUM>200); which results in: SPNUM 204 The minimum commission percentage across all of the salespersons with salesperson numbers greater than 200 is determined first in the subquery and the result is 10. The main query then, in effect, looks like: SELECT SPNUM FROM SALESPERSON WHERE SPNUM>200 AND COMMPERCT=10; which yields the result of salesperson number 204, as shown. Actually, this is a very interesting example of a required subquery. What makes it really interesting is why the predicate, SPNUM>200, appears in both the main query and the subquery. Clearly it has to be in the subquery because you must first find the lowest commission percentage among the salespersons with salesperson numbers greater than 200. But then why does it have to be in the main query, too? The answer is that the only thing that the subquery returns to the main query is a single number, specifically a commission percentage. No memory is passed on to the main query of how the subquery arrived at that value. If you remove SPNUM>200 from the main query, so that it now looks like: SELECT SPNUM FROM SALESPERSON WHERE COMMPERCT= (SELECT MIN(COMMPERCT) FROM SALESPERSON) WHERE SPNUM>200); you would find every salesperson with any salesperson number whose commission percentage is equal to the lowest commission percentage of the salespersons with salesperson numbers greater than 200. (Of course, if for some reason you do want to find all of the salespersons, regardless of salesperson number, who have the same commission percentage as the salesperson who has the lowest commission percentage of the salespersons with salesperson numbers greater than 200, then this last SELECT statement is exactly what you should write!) Practice: Subqueries A Strategy for Writing SQL SELECT Commands Before we go on to some more examples, it will be helpful to think about developing a strategy for writing SQL SELECT statements. The following is an ordered list of steps. Determine what the result of the query is to be and write the needed attributes and functions in the SELECT clause. This may seem an obvious instruction, but it will really pay to think this through carefully before going on. In fact, it is at this very first step that you must determine whether the query will require a GROUP BY clause or a subquery. If either of these is required, you should start outlining the overall SELECT statement by writing the GROUP BY clause or the nested SELECT for the subquery further down the page (or screen). Determine which tables of the database will be needed for the query and write their names in the FROM clause. Include only those tables that are really necessary for the query. Sometime this can be tricky. For example, you might need an attribute that is the primary key of a table and you might be tempted immediately to include that table in the FROM clause. However, it could be that the attribute in question is a foreign key in another table that is already in the FROM clause for other reasons. It is then unnecessary to include the table in which it is the primary key unless, of course, other attributes from that table are needed, too. Begin constructing the WHERE clause by equating the join attributes from the tables that are in the FROM clause. Once this job is out of the way, you can begin considering the row limitations that must be stated in the WHERE clause. Continue filling in the details of the WHERE clause, the GROUP BY clause, and any subqueries. One final piece of advice: If you are new to writing SQL SELECT commands but you have a programming background, you may be tempted to avoid setting up joins and try writing subqueries instead. Resist this temptation, for two reasons! One is that joins are an essential part of the relational database concept. Embrace them; don't be afraid of them. The other is that writing multiple levels of nested subqueries can be extremely error prone and difficult to debug. 4.7 Example: Good Reading Book Stores The best way to gain confidence in understanding SQL SELECT statements is to write some! And there are some further refinements of the SQL SELECT that we have yet to present. We will use the same three example databases that appeared in previous chapters but, as with the General Hardware database, we will shorten the attribute names. We will state a variety of queries and then give the SELECT statements that will satisfy them, plus commentary as appropriate. You should try to write the SELECT statements yourself before looking at our solutions! Figure 3.2 is the Good Reading Bookstores relational database. Here is a list of queries for Good Reading Bookstores. Figure 3.2: Good reading Bookstores Relational database "Find the book number, book name, and number of pages of all the books published by London Publishing Ltd. List the results in order by book name." This query obviously requires the PUBNAME attribute but it does not require the PUBLISHER table. All of the information needed is in the BOOK table, including the PUBNAME attribute, which is there as a foreign key. The SELECT statement is: SELECT BOOKNUM, BOOKNAME, PAGES FROM BOOK WHERE PUBNAME=‘London Publishing Ltd.’ ORDER BY BOOKNAME; "How many books of at least 400 pages does Good Reading Bookstores carry that were published by publishers based in Paris, France?" This is a straightforward join between the PUBLISHER and BOOK tables that uses the built-in function COUNT. All of the attribute names are unique between the two tables, except for PUBNAME, which must be qualified with a table name every time it is used. Notice that ‘Good Reading Bookstores’ does not appear as a condition in the SELECT statement, although it was mentioned in the query. The entire database is about Good Reading Bookstores and no other! There is no BOOKSTORE CHAIN table in the database and there is no STORENAME or CHAINNAME attribute in any of the tables. SELECT COUNT(*) FROM PUBLISHER, BOOK WHERE PUBLISHER.PUBNAME=BOOK.PUBNAME AND CITY=‘Paris’ AND COUNTRY=‘France’ AND PAGES>=400; "List the publishers in Belgium, Brazil, and Singapore that publish books written by authors who were born before 1920." Sometimes a relatively simple-sounding query can be fairly involved. This query actually requires four tables of the database! To begin with, we need the PUBLISHER table because that's the only place that a publisher's country is stored. But we also need the AUTHOR table because that's where author birth years are stored. The only way to tie the PUBLISHER table to the AUTHOR table is to connect PUBLISHER to BOOK, then to connect BOOK to WRITING, and finally to connect WRITING to AUTHOR. With simple, one-attribute keys such as those in these tables, the number of joins will be one fewer than the number of tables. The FROM clause below shows four tables and the first three lines of the WHERE clause show the three joins. Also, notice that since a publisher may have published more than one book with the stated specifications, DISTINCT is required to prevent the same publisher name from appearing several, perhaps many, times in the result. Finally, since we want to include publishers in three specific countries, we list the three countries as Belgium, Brazil, and Singapore. But, in the SELECT statement, we have to indicate that for a record to be included in the result, the value of the COUNTRY attribute must be Belgium, Brazil or Singapore. SELECT DISTINCT PUBNAME FROM PUBLISHER, BOOK, WRITING, AUTHOR WHERE PUBLISHER.PUBNAME=BOOK.PUBNAME AND BOOK.BOOKNUM=WRITING.BOOKNUM AND WRITING.AUTHORNUM=AUTHOR.AUTHORNUM AND COUNTRY IN (‘Belgium’, ‘Brazil’, ‘Singapore’) AND YEARBORN < 1920; "How many books did each publisher in Oslo, Norway; Nairobi, Kenya; and Auckland, New Zealand, publish in 2001?" The keyword here is "each." This query requires a separate total for each publisher that satisfies the conditions. This is a job for the GROUP BY clause. We want to group together the records for each publisher and count the number of records in each group. Each line of the result must include both a publisher name and count of the number of records that satisfy the conditions. This SELECT statement requires both a join and a GROUP BY. Notice the seeming complexity but really the unambiguous beauty of the ANDs and ORs structure regarding the cities and countries. SELECT PUBNAME, CITY, COUNTRY, COUNT(*) FROM PUBLISHER, BOOK WHERE PUBLISHER.PUBNAME=BOOK.PUBNAME AND ((CITY=‘Oslo’ AND COUNTRY=‘Norway’) OR (CITY=‘Nairobi’ AND COUNTRY=‘Kenya’) OR (CITY=‘Auckland’ AND COUNTRY=‘New Zealand’)) AND PUBYEAR=2001 GROUP BY PUBNAME; "Which publisher published the book that has the earliest publication year among all the books that Good Reading Bookstores carries?" All that is called for in this query is the name of the publisher, not the name of the book. This is a case that requires a subquery. First the system has to determine the earliest publication year, then it has to see which books have that earliest publication year. Once you know the books, their records in the BOOK table give you the publisher names. Since more than one publisher may have published a book in that earliest year, there could be more than one publisher name in the result. And, since a particular publisher could have published more than one book in that earliest year, DISTINCT is required to avoid having that publisher's name listed more than once. SELECT DISTINCT PUBNAME FROM BOOK WHERE PUBYEAR= (SELECT MIN(PUBYEAR) FROM BOOK); 4.8 Example: World Music Association Figure 3.3 is the World Music Association relational database. Here is a list of queries for the World Music Association. Figure 3.3: World Music Association relational database "What is the total annual salary cost for all the violinists in the Berlin Symphony Orchestra?" SELECT SUM(ANNSALARY) FROM MUSICIAN WHERE ORCHNAME=‘Berlin Symphony Orchestra’ AND INSTRUMENT=‘Violin’ "Make a single list, in alphabetic order, of all of the universities attended by the cellists in India." SELECT DISTINCT UNIVERSITY FROM ORCHESTRA, MUSICIAN, DEGREE WHERE ORCHESTRA.ORCHNAME=MUSICIAN.ORCHNAME AND MUSICIAN.MUSNUM=DEGREE.MUSNUM AND INSTRUMENT=‘Cello’ AND COUNTRY=‘India’ ORDER BY UNIVERSITY; "What is the total annual salary cost for all of the violinists of each orchestra in Canada? Include in the result only those orchestras whose total annual salary for its violinists is in excess of $150,000." Since this query requires a separate total for each orchestra, the SELECT statement must rely on the GROUP BY clause. Since the condition that the total must be over 150,000 is based on figures calculated by the SUM built-in function, it must be placed in a HAVING clause rather than in the WHERE clause. SELECT ORCHNAME, SUM(ANNSALARY) FROM ORCHESTRA, MUSICIAN WHERE ORCHESTRA.ORCHNAME=MUSICIAN.ORCHNAME AND COUNTRY=‘Canada’ AND INSTRUMENT=‘Violin’ GROUP BY ORCHNAME HAVING SUM(ANNSALARY)>150,000; "What is the name of the most highly paid pianist?" It should be clear that a subquery is required. First the system has to determine what the top salary of pianists is and then it has to find out which pianists have that salary. SELECT MUSNAME FROM MUSICIAN WHERE INSTRUMENT=‘Piano’ AND ANNSALARY= (SELECT MAX(ANNSALARY) FROM MUSICIAN WHERE INSTRUMENT=‘Piano’); This is another example in which a predicate, INSTRUMENT=‘Piano’ in this case, appears in both the main query and the subquery. Clearly it has to be in the subquery because you must first find out how much money the highest-paid pianist makes. But then why does it have to be in the main query, too? The answer is that the only thing that the subquery returns to the main query is a single number, specifically a salary value. No memory is passed on to the main query of how the subquery arrived at that value. If you remove INSTRUMENT=‘Piano’ from the main query so that it now looks like: SELECT MUSNAME FROM MUSICIAN WHERE ANNSALARY= (SELECT MAX(ANNSALARY) FROM MUSICIAN WHERE INSTRUMENT=‘Piano’); you would find every musician who plays any instrument whose salary is equal to the highest- paid pianist. Of course, if for some reason you do want to find all of the musicians, regardless of the instrument they play, who have the same salary as the highest-paid pianist, then this last SELECT statement is exactly what you should write. "What is the name of the most highly paid pianist in any orchestra in Australia?" This is the same idea as the last query but involves two tables, both of which must be joined in both the main query and the subquery. The reasoning for this is the same as in the last query. The salary of the most highly paid pianist in Australia must be determined first in the subquery. Then that result must be used in the main query, where it must be compared only to the salaries of Australian pianists. SELECT MUSNAME FROM MUSICIAN, ORCHESTRA WHERE MUSICIAN.ORCHNAME=ORCHESTRA.ORCHNAME AND INSTRUMENT=‘Piano’ AND COUNTRY=‘Australia’ AND ANNSALARY= (SELECT MAX(ANNSALARY) FROM MUSICIAN, ORCHESTRA WHERE MUSICIAN.ORCHNAME=ORCHESTRA.ORCHNAME AND INSTRUMENT=‘Piano’ AND COUNTRY=‘Australia’); 4.9 Interactive Online Practice: SQL Tutorial Hi! If you are ready to roll up your sleeves and get your hands dirty with some more SQL practice, go to w3schools.com: Section: SQL Basics and follow the SQL Basic Tutorial. In each detailed chapter you will find helpful lessons on SQL, queries, commands, and operators, as well as practice and run your own SQL code.

4 Glossary

in the SELECT clause - It indicates that all attributes of the selected row are to be retrieved.

AND operator-It displays a record if more than one condition is true.

AVG() function-It returns the average value of a numeric column.

BETWEEN operator-It allows you to specify a range of numeric values in a search.

Data definition-It is operationalized with a data definition language (DDL), involves instructing the DBMS software on what tables will be in the database, what attributes will be in the tables, which attributes will be indexed, and so forth.

Data management-There are two aspects of data management: data definition and data manipulation.

DISTINCT operator-It is used to eliminate duplicate rows in a query result.

IN operator-It allows you to specify a list of character strings to be included in a search.

JOIN clause-It is used to combine rows from more than one table, based on a common field between them.

LIKE operator-It allows you to specify partial character strings in a "wildcard" sense.

OR operator-It displays a record if either the first condition OR the second condition is true.

ORDER BY clause-It simply takes the results of a SQL query and orders them by one or more specified attributes.

SELECT command-Data retrieval in SQL is accomplished with the SELECT command.

Structured Query Language-It is a standard language for data management in relational databases, known as Structured Query Language or SQL.

Subquery-When one SELECT statement is "nested" within another in a format, it is known as subquery.

Quiz 4

1) Which operator allows you to retrieve a record if both the first condition and the second condition are true?

-AND

2) Which of the following functions is used to return the greatest value within a specified range of values?

-MAX

3) Which of the following operators allows you to specify a list of character strings to be included in a search?

-IN

4) Which operator allows you to retrieve a record if either of the first condition or the second condition is true?

-OR

5) Which of the following functions returns the number of records in a table?

-COUNT

6) Which of the following functions is used to return the least value within a specified range of values?

-MIN

7) Which function returns the total of all the values in a specified column?

-SUM

8) Which of the following operators allows you to specify partial character strings in a "wildcard" sense?

-LIKE

9) Which command allows you to retrieve the desired data from a database?

-SELECT

10) Which of the following functions returns the mean value of a numeric column?

-AVG

11) Which of the following in the SELECT clause indicates that all attributes of the selected row are to be retrieved?

-*

12) Which of the following is used to eliminate duplicate rows in a query result?

-DISTINCT

13) Which of the following operators allows you to specify a range of numeric values in a search?

-BETWEEN

14) Which of the following is used to sort the result-set by one or more columns?

-ORDER BY

Lesson 5

Lesson 5 Normalization Completing this lesson will help you gain the following competency:

Competency 4017.1.5: Normalization The graduate demonstrates appropriate strategies to normalize data.

This topic highlights the following objectives:

   Explain what data is and how it is stored.
   Identify First Normal form.
   Identify normalization techniques.

5.1 The Data Normalization Process

Data normalization was the earliest formalized database design technique and at one time was the starting point for logical database design. Today, with the popularity of the Entity-Relationship model and other such diagramming tools and the ability to convert its diagrams to database structures, data normalization is used more as a check on database structures produced from E-R diagrams than as a full-scale database design technique. That's one of the reasons for learning about data normalization. Another reason is that the data normalization process is another way of demonstrating and learning about such important topics as data redundancy, foreign keys, and other ideas that are so central to a solid understanding of database management.

Data normalization is a methodology for organizing attributes into tables so that redundancy among the non-key attributes is eliminated. Each of the resultant tables deals with a single data focus, which is just another way of saying that each resultant table will describe a single entity type or a single many-to-many relationship. Furthermore, foreign keys will appear exactly where they are needed. In other words, the output of the data normalization process is a properly structured relational database.

Figure 5.1: The Lucky Rent-A-Car relational database Introduction to the Data Normalization Technique

The input required by the data normalization process has two parts. One is a list of all the attributes that must be incorporated into the database: that is, all of the attributes in all of the entities involved in the business environment under discussion plus all of the intersection data attributes in all of the many-to-many relationships between these entities. The other input, informally, is a list of all of the defining associations among the attributes. Formally, these defining associations are known as functional dependencies. And what are defining associations or functional dependencies? They are a means of expressing that the value of one particular attribute is associated with a specific single value of another attribute. If we know that one of these attributes has a particular value, then the other attribute must have some other value. For example, for a particular Salesperson Number, 137, there is exactly one Salesperson Name, Baker, associated with it. Why is this true? In this example, a Salesperson Number uniquely identifies a salesperson and, after all, a person can have only one name! And this is true for every person! Informally, we might say that Salesperson Number defines Salesperson Name. If I give you a Salesperson Number, you can give me back the one and only name that goes with it. (It's a little like the concept of independent and dependent variables in mathematics. Take a value of the independent variable, plug it into the formula and you get back the specific value of the dependent variable associated with that independent variable.) These defining associations are commonly written with a right-pointing arrow like this:

Salesperson Number Salesperson

In the more formal terms of functional dependencies, Salesperson Number, in general the attribute on the left side, is referred to as the determinant. Why? Because its value determines the value of the attribute on the right side. Conversely, we also say that the attribute on the right is functionally dependent on the attribute on the left.

Data normalization is best explained with an example and this is a good place to start one. In order to demonstrate the main points of the data normalization process, we will modify part of the General Hardware Co. business environment and focus on the salesperson and product entities. Let's assume that salespersons are organized into departments and each department has a manager who is not herself a salesperson. Then the list of attributes we will consider is shown in Figure 5.2. The list of defining associations or functional dependencies is shown in Figure 5.3.

Figure 5.2: List of attributes for salespersons and products

Figure 5.3: List of defining associations (functional dependencies) for the attributes of salespersons and products

Notice a couple of fine points about the list of defining associations in Figure 5.3. The last association:

Salesperson Number,Product Number Quantity

shows that the combination of two or more attributes may possibly define another attribute. That is, the combination of a particular Salesperson Number and a particular Product Number defines or specifies a particular Quantity. Put another way, in this business context, we know how many units of a particular product a particular salesperson has sold. Another point, which will be important in demonstrating one step of the data normalization process, is that Manager Name is defined, independently, by two different attributes: Salesperson Number and Department Number:

Salesperson Number Manager Name Department Number Manager Name

Both these defining associations are true! If I identify a salesperson by his Salesperson Number, you can tell me who his manager is. Also, if I state a department number, you can tell me who the manager of the department is. How did we wind up with two different ways to define the same attribute? Very easily! It simply means that during the systems analysis process, both these equally true defining associations were discovered and noted. By the way, the fact that I know the department that a salesperson works in:

Salesperson Number Department Number

(and that each of these two attributes independently define Manager Name) will also be an issue in the data normalization process. More about this later. Steps in the Data Normalization Process

The data normalization process is known as a "decomposition process." Basically, we are going to line up all the attributes that will be included in the relational database and start subdividing them into groups that will eventually form the database's tables. Thus, we are going to "decompose" the original list of all of the attributes into subgroups. To do this, we are going to step through a number of normal forms. First, we will demonstrate what unnormalized data looks like. After all, if data can exist in several different normal forms, then there should be the possibility that data is in none of the normal forms, too! Then we will basically work through the three main normal forms in order:

   First Normal Form
   Second Normal Form
   Third Normal Form

There are certain "exception conditions" that have also been described as normal forms. These include the Boyce-Codd Normal Form, Fourth Normal Form, and Fifth Normal Form. They are less common in practice and will not be covered here.

Here are three additional points to remember:

   Once the attributes are arranged in third normal form (and if none of the exception conditions are present), the group of tables that they comprise is, in fact, a well-structured relational database with no data redundancy.
   A group of tables is said to be in a particular normal form if every table in the group is in that normal form.
   The data normalization process is progressive. If a group of tables is in second normal form it is also in first normal form. If they are in third normal form they are also in second normal form.

Unnormalized Data

Figure 5.4 shows the salesperson and product-related attributes listed in Figure 5.2 arranged in a table with sample data. The salesperson and product data is taken from the General Hardware Co. relational database of Figure 4.25, with the addition of Department Number and Manager Name data. Note that salespersons 137, 204, and 361 are all in department number 73 and their manager is Scott. Salesperson 186 is in department number 59 and his manager is Lopez.

Figure 5.4: The salesperson and product attributes, unnormalized with sample data

The table in Figure 5.4 is unnormalized. The table has four records, one for each salesperson. But, since each salesperson has sold several products and there is only one record for each salesperson, several attributes of each record must have multiple values. For example, the record for salesperson 137 has three product numbers, 19440, 24013, and 26722, in its Product Number attribute, because salesperson 137 has sold all three of those products. Having such multivalued attributes is not permitted in first normal form, and so this table is unnormalized. First Normal Form

The table in Figure 5.5 is the first normal form representation of the data. The attributes under consideration have been listed out in one table and a primary key has been established. As the sample data of Figure 5.6 shows, the number of records has been increased (over the unnormalized representation) so that every attribute of every record has just one value. The multivalued attributes of Figure 5.4 have been eliminated. Indeed, the definition of first normal form is a table in which every attribute value is atomic, that is, no attribute is multivalued.

Figure 5.5: The salesperson and product attributes in first normal form

Figure 5.6: The salesperson and product attributes in first normal form with sample data

The combination of the Salesperson Number and Product Number attributes constitutes the primary key of this table. What makes this combination of attributes a legitimate primary key? First of all, the business context tells us that the combination of the two provides unique identifiers for the records of the table and that there is no single attribute that will do the job. That, of course, is how we have been approaching primary keys all along. Secondly, in terms of data normalization, according to the list of defining associations or functional dependencies of Figure 5.3, every attribute in the table is either part of the primary key or is defined by one or both attributes of the primary key. Salesperson Name, Commission Percentage, Year of Hire, Department Number, and Manager Name are each defined by Salesperson Number. Product Name and Unit Price are each defined by Product Number. Quantity is defined by the combination of Salesperson Number and Product Number.

Are these two different ways of approaching the primary key selection equivalent? Yes! If the combination of a particular Salesperson Number and a particular Product Number is unique, then it identifies exactly one record of the table. And, if it identifies exactly one record of the table, then that record shows the single value of each of the non-key attributes that is associated with the unique combination of the key attributes.

But that is the same thing as saying that each of the non-key attributes is defined by or is functionally dependent on the primary key! For example, consider the first record of the table in Figure 5.6. Sales-person Number Product Number Sales-person Name Commission Percentage Year of Hire Department Number Manager Name Product Name Unit Price Quantity 137 19440 Baker 10 1995 73 Scott Hammer 17.50 473

The combination of Salesperson Number 137 and Product Number 19440 is unique. There is only one record in the table that can have that combination of Salesperson Number and Product Number values. Therefore, if someone specifies those values, the only Salesperson Name that can be associated with them is Baker, the only Commission Percentage is 10, and so forth. But that has the same effect as the concept of functional dependency. Since Salesperson Name is functionally dependent on Salesperson Number, given a particular Salesperson Number, say 137, there can be only one Salesperson Name associated with it, Baker. Since Commission Percentage is functionally dependent on Salesperson Number, given a particular Salesperson Number, say 137, there can be only one Commission Percentage associated with it, 10. And so forth.

First normal form is merely a starting point in the normalization process. As can immediately be seen from Figure 5.6, there is a great deal of data redundancy in first normal form. There are three records involving salesperson 137 (the first three records) and so there are three places in which his name is listed as Baker, his commission percentage is listed as 10, and so on. Similarly, there are two records involving product 19440 (the first and fifth records) and this product's name is listed twice as Hammer and its unit price is listed twice as 17.50. Intuitively, the reason for this is that attributes of two different kinds of entities, salespersons and products, have been mixed together in one table. Second Normal Form

Since data normalization is a decomposition process, the next step will be to decompose the table of Figure 5.5 into smaller tables to eliminate some of its data redundancy. And, since we have established that at least some of the redundancy is due to mixing together attributes about salespersons and attributes about products, it seems reasonable to want to separate them out at this stage. Informally, what we are going to do is to look at each of the non-key attributes of the table in Figure 5.5 and, on the basis of the defining associations of Figure 5.3, decide which attributes of the key are really needed to define it. For example, Salesperson Name really only needs Salesperson Number to define it; it does not need Product Number. Product Name needs only Product Number to define it; it does not need Salesperson Number. Quantity indeed needs both attributes, according to the last defining association of Figure 5.3.

More formally, second normal form, which is what we are heading for, does not allow partial functional dependencies. That is, in a table in second normal form, every non-key attribute must be fully functionally dependent on the entire key of that table. In plain language, a non-key attribute cannot depend on only part of the key, in the way that Salesperson Name, Product Name, and most of the other non-key attributes of Figure 5.5 do.

Figure 5.7 shows the salesperson and product attributes arranged in second normal form. There is a SALESPERSON Table in which Salesperson Number is the sole primary key attribute. Every non-key attribute of the table is fully defined just by Salesperson Number, as can be verified in Figure 5.3. Similarly, the PRODUCT Table has Product Number as its sole primary key attribute and the non-key attributes of the table are dependent just on it. The QUANTITY Table has the combination of Salesperson Number and Product Number as its primary key because its non-key attribute, Quantity, requires both of them together to define it, as indicated in the last defining association of Figure 5.3.

Figure 5.7: The salesperson and product attributes in second normal form

Figure 5.8 shows the sample salesperson and product data arranged in the second normal form structure of Figure 5.7. Indeed, much of the data redundancy visible in Figure 5.6 has been eliminated. Now, only once is salesperson 137's name listed as Baker, his commission percentage listed as 10, and so forth. Only once is product 19440's name listed as Hammer and its unit price listed as 17.50.

Figure 5.8: The salesperson and product attributes in second normal form with sample data

Second normal form is thus a great improvement over first normal form. But, has all of the redundancy been eliminated? In general, that depends on the particular list of attributes and defining associations. It is possible, and in practice it is often the case, that second normal form is completely free of data redundancy. In such a case, the second normal form representation is identical to the third normal form representation.

A close look at the sample data of Figure 5.8 reveals that the second normal form structure of Figure 5.7 has not eliminated all the data redundancy. At the right-hand end of the SALESPERSON Table, the fact that Scott is the manager of department 73 is repeated three times and this certainly constitutes redundant data. How could this have happened? Aren't all the non-key attributes fully functionally dependent on Salesperson Number? They are, but that is not the nature of the problem. It's true that Salesperson Number defines both Department Number and Manager Name and that's reasonable. If I'm focusing in on a particular salesperson, I should know what department she is in and what her manager's name is. But, as indicated in the next-to- last defining association of Figure 5.3, one of those two attributes defines the other: given a department number, I can tell you who the manager of that department is. In the SALESPERSON Table, one of the non-key attributes, Department Number, defines another one of the non-key attributes, Manager Name. This is what is causing the problem. Third Normal Form

In third normal form, non-key attributes are not allowed to define other non-key attributes. Stated more formally, third normal form does not allow transitive dependencies in which one non-key attribute is functionally dependent on another.

Again, there is one example of this in the second normal form representation in Figure 5.7. In the SALESPERSON table, Department Number and Manager Name are both non-key attributes and, as shown in the next-to-last association in Figure 5.3, Department Number defines Manager Name. Figure 5.9 shows the third normal form representation of the attributes. Note that the SALESPERSON Table of Figure 5.7 has been further decomposed into the SALESPERSON and DEPARTMENT Tables of Figure 5.9. The Department Number and Department Manager attributes, which were the problem, were split off to form the DEPARTMENT Table, but a copy of the Department Number attribute (the primary key attribute of the new DEPARTMENT Table) was left behind in the SALESPERSON Table. If this had not been done, there no longer would have been a way to indicate which department each salesperson is in.

Figure 5.9: The salesperson and product attributes in third normal form

The sample data for the third normal form structure of Figure 5.9 is shown in Figure 5.10. Now, the fact that Scott is the manager of department 73 is shown only once, in the second record of the DEPARTMENT Table. Notice that the Department Number attribute in the SALESPERSON Table continues to indicate which department a salesperson is in.

There are several important points to note about the third normal form structure of Figure 5.9:

   It is completely free of data redundancy.
   All foreign keys appear where needed to logically tie together related tables.
   It is the same structure that would have been derived from a properly drawn entity-relationship diagram of the same business environment.

Finally, there is one exception to the rule that in third normal form, non-key attributes are not allowed to define other non-key attributes. The rule does not hold if the defining non-key attribute is a candidate key of the table. Let's say, just for the sake of argument here, that the Salesperson Name attribute is unique. That makes Salesperson Name a candidate key in Figure 5.9's SALESPERSON Table. But, if Salesperson Name is unique, then it must define Commission Percentage, Year of Hire, and Department Number just as the unique Salesperson Number attribute does. Since it was not chosen to be the primary key of the table, Salesperson Name is technically a non-key attribute that defines other non-key attributes. Yet it does not appear from the sample data of Figure 5.10 to be causing any data redundancy problems. Since it was a candidate key, its defining other non-key attributes is not a problem.

Figure 5.10: The salesperson and product attributes in third normal form with sample data Example: General Hardware Co.

If the entire General Hardware Co. example, including the newly added Department Number and Manager Name attributes, were organized for the data normalization process, the list of defining associations or functional dependencies of Figure 5.3 would be expanded to look like Figure 5.11. Several additional interesting functional dependencies in this expanded list are worth pointing out. First, although Salesperson Number is a determinant, defining several other attributes, it is in turn functionally dependent on another attribute, Customer Number:

CustomerNumber SalespersonNumber

As we have already established, this functional dependency makes perfect sense. Given a particular customer, I can tell you who the salesperson is who is responsible for that customer. This is part of the one-to-many relationship between salespersons and customers. The fact that, in the reverse direction, a particular salesperson has several customers associated with him makes no difference in this functional dependency analysis. Also, the fact that Salesperson Number is itself a determinant, defining several other attributes, does not matter. Next:

Customer Number, Employee Number Employee Name Customer Number, Employee Number Title

Remember that in the General Hardware business environment, employee numbers are unique only within a customer company. Thus, this functional dependency correctly shows that the combination of the Customer Number and Employee Number attributes is required to define the Employee Name and Title attributes.

Figure 5.11: List of defining associations (functional dependencies) for the attributes of the General Hardware Company example

Figure 5.12 shows the General Hardware Co. attributes, including the added Department Number and Manager Name attributes, arranged in first normal form. Moving to second normal form would produce the database structure in Figure 5.13, except that the Department Number and Manager Name attributes would be split out in moving from second to third normal form, as previously shown.

Figure 5.12: The General Hardware Company attributes in first normal form

Figure 5.13: The General Hardware Company relational database Example: Good Reading Bookstores

In the General Hardware Co. example, the reason that the table representing the many-to-many relationship between salespersons and products Sales person Number Product Number Quantity

fell out so easily in the data normalization process was because of the presence of the functional dependency needed to define the intersection data attribute, Quantity:

Salesperson Number, Product Number Quantity

A new twist in the Good Reading Bookstores example is the presence of the many-to-many relationship between the book and author entities with no intersection data. This is shown in the WRITING Table of Figure 5.14. The issue is how to show this in a functional dependencies list. There are a couple of possibilities. One is to show the two attributes defining "null":

Book Number, Author Number null

The other is to show paired "multivalued dependencies" in which the attribute on the left determines a list of attribute values on the right, instead of the usual single attribute value on the right. A double-headed arrow is used for this purpose:

Book Number Author Number Author Number Book Number

These literally say that given a book number, a list of authors of the book can be produced and that given an author number, a list of the books that an author has written or co-written can be produced. In either of the two possibilities shown, the null and the paired multivalued dependencies, the notation in the functional dependency list can be used as a signal to split the attributes off into a separate table in moving from first to second normal form.

The other interesting point in the Good Reading Bookstores example involves the many-to-many relationship of the SALE Table in Figure 5.14. Recall that Date and Price were intersection data attributes that, because of the requirements of the company, had to be part of the primary key of the table. This would be handled very simply and naturally with a functional dependency that looks like this:

Book Number, Customer Number, Date, Price Quantity

The complete list of functional dependencies is shown in Figure 5.15. First normal form for the Good Reading Bookstores example would consist of the list of its attributes with the following attributes in the primary key:

   Publisher Name
   Author Number
   Book Number
   Customer Number
   Date.

Figure 5.14: The Good Reading Bookstores relational database

Figure 5.15: List of defining associations (functional dependencies) for the attributes of the Good Reading Bookstores example

Moving from first to second normal form, including incorporating the rule described above for the many-to-many relationship with no intersection data, would directly yield the tables of Figure 5.14. As there are no instances of a non-key attribute defining another non-key attribute, this arrangement is already in third normal form. Example: World Music Association

The World Music Association example is straightforward in terms of data normalization. The complete list of functional dependencies is shown in Figure 5.16. Since degree is unique only within a musician and composition name is unique only within a composer, note that three of the functional dependencies are:

Musician Number, Degree University Musician Number, Degree Year Composition Name, Composer Name Year

The primary key attributes in first normal form are:

   Orchestra Name
   Musician Number
   Degree
   Composer Name
   Composition Name

Figure 5.16: List of defining associations (functional dependencies) for the attributes of the World Music Association example

With this in mind, proceeding from first to second normal form will produce the tables in Figure 5.17. These are free of data redundancy and are, indeed, also in third normal form.

Figure 5.17: The World Music Association relational database Example: Lucky Rent-A-Car

Figure 5.18 lists the Lucky Rent-A-Car functional dependencies. The primary key attributes in first normal form are:

   Manufacturer Name
   Car Serial Number
   Repair Number
   Customer Number
   Rental Date

Once again, the conversion from first to second normal form results in a redundancy-free structure, Figure 5.1, that is already in third normal form.

Figure 5.18: List of defining associations (functional dependencies) for the attributes of the Lucky Rent-A-Car example Practice: Normal forms

5.2 Testing Tables Converted From E-R Diagrams With Data Normalization

As we said earlier, logical database design is generally performed today by converting entity-relationship diagrams to relational tables and then checking those tables against the data normalization technique rules. Since we already know that the databases in Figures 5.13, 5.14, 5.17, and 5.1 (for the four example business environments we've been working) with are in third normal form, there really isn't much to check. As one example, consider the General Hardware Co. database of Figure 5.13.

The basic idea in checking the structural worthiness of relational tables with the data normalization rules is to:

   Check to see if there are any partial functional dependencies. That is, check whether any non-key attributes are dependent on or are defined by only part of the table's primary key.
   Check to see if there are any transitive dependencies. That is, check whether any non-key attributes are dependent on or are defined by any other non-key attributes (other than candidate keys).

Both of these can be verified by the business environment's list of defining associations or functional dependencies.

In the SALESPERSON Table of Figure 5.13, there is only one attribute, Salesperson Number, in the primary key. Therefore, there cannot be any partial functional dependencies. By their very definition, partial functional dependencies require the presence of more than one attribute in the primary key, so that a non-key attribute can be dependent on only part of the key! As for transitive dependencies, are any non-key attributes determined by any other non-key attributes? No! And, even if Salesperson Name is assumed to be a unique attribute and therefore it defines Commission Percentage and Year of Hire, this would be an allowable exception because Salesperson Name, being unique, would be a candidate key. The same analysis can be made for the other General Hardware tables with single-attribute primary keys: the CUSTOMER, PRODUCT, and OFFICE tables of Figure 5.13.

Figure 5.13's CUSTOMER EMPLOYEE Table has a two-attribute primary key because Employee Number is unique only within a customer. But then, by the very same logic, the non-key attributes Employee Name and Title must be dependent on the entire key, because that is the only way to uniquely identify who we are talking about when we want to know a person's name or title. Analyzing this further, Employee Name cannot be dependent on Employee Number alone because it is not a unique attribute. Functional dependency requires uniqueness from the determining side. And, obviously, Employee Name cannot be dependent on Customer Number alone. A customer company has lots of employees, not just one. Therefore, Employee Name and Title must be dependent on the entire primary key and the rule about no partial functional dependencies is satisfied. Since the non-key attributes Employee Name and Title do not define each other, the rule about no transitive dependencies is also satisfied, and thus the table is clearly in third normal form.

In the SALES Table of Figure 5.13, there is a two-attribute primary key and only one non-key attribute. This table exists to represent the many-to-many relationship between salespersons and products. The non-key attributes, just Quantity in this case, constitute intersection data. By the definition of intersection data these non-key attributes must be dependent on the entire primary key. In any case, there would be a line in the functional dependency list indicating that Quantity is dependent on the combination of the two key attributes. Thus, there are no partial functional dependencies in this table. Interestingly, since there is only one non-key attribute, transitive dependencies cannot exist. After all, there must be at least two non-key attributes in a table for one non-key attribute to be dependent on another. 5.3 Building The Data Structure With SQL

SQL has data definition commands that allow you to take the database structure you just learned how to design with the logical database design techniques and implement it for use with a relational DBMS. This process begins by the creation of "base tables." These are the actual physical tables in which the data will be stored on the disk. The command that creates base tables and tells the system what attributes will be in them is called the CREATE TABLE command. Using the CREATE TABLE command, you can also specify which attribute is the primary key. As an example, here is the command to create the General Hardware Company SALESPERSON table we have been working with shown in Figure 5.13. (Note that the syntax of these commands varies somewhat among the various relational DBMS products on the market. The commands shown in this chapter, which are based on the ORACLE DBMS, are designed to give you a general idea of the command structures. You should check the specific syntax required by the DBMS you are using.)

CREATE TABLE SALESPERSON (SPNUM CHAR(3) PRIMARY KEY, SPNAME CHAR(12) COMMPERCT DECIMAL(3,0) YEARHIRE CHAR(4) OFFNUM CHAR(3));

Notice that the CREATE TABLE command names the table SALESPERSON and lists the attributes in it (with abbreviated attribute names that we have created for brevity). Each attribute is given an attribute type and length. So SPNUM, the Salesperson Number, is specified as CHAR(3). It is three characters long (yes, it's a number, but it's not subject to calculations so it's more convenient to specify it as a character attribute). On the other hand, COMMPERCT, the Commission Percentage, is specified as DECIMAL(3,0), meaning that it is a three-position number with no decimal positions. Thus it could be a whole number from 0–999, although we know that it will always be a whole number from 0–100 since it represents a commission percentage. Finally, the command indicates that SPNUM will be the primary key of the table.

If a table in the database has to be discarded, the command is the DROP TABLE command.

DROP TABLE SALESPERSON;

A logical view (sometimes just called a "view") is derived from one or more base tables. A view may consist of a subset of the columns of a single table, a subset of the rows of a single table, or both. It can also be the join of two or more base tables. The creation of a view in SQL does not entail the physical duplication of data in a base table into a new table. Instead, the view is a mapping onto the base table(s). It's literally a "view" of some part of the physical, stored data. Views are built using the CREATE VIEW command. Within this command, you specify the base table(s) on which the view is to be based and the attributes and rows of the table(s) that are to be included in the view. Interestingly, these specifications are made within the CREATE VIEW command using the SELECT statement, which is also used for data retrieval.

For example, to give someone access to only the Salesperson Number, Salesperson Name, and Year of Hire attributes of the SALESPERSON table, you would specify:

CREATE VIEW EMPLOYEE AS SELECT SPNUM, SPNAME, YEARHIRE FROM SALESPERSON;

The name of the view is EMPLOYEE, which can then be used in other SQL commands as if it were a table name. People using EMPLOYEE as a table name would have access to the Salesperson Number, Salesperson Name, and Year of Hire attributes of the SALESPERSON table but would not have access to the Commission Percentage or Office Number attributes (in fact, they would not even know that these two attributes exist!).

Views can be discarded using the DROP VIEW command:

DROP VIEW EMPLOYEE;

Practice: Building the data structure with SQL

5.4 Manipulating The Data With SQL

Once the tables have been created, the focus changes to the standard data manipulation operations of updating existing data, inserting new rows in tables, and deleting existing rows in tables. The commands are UPDATE, INSERT, and DELETE. In the UPDATE command, you have to identify which row(s) of a table are to be updated based on data values within those rows. Then you have to specify which columns are to be updated and what the new data values of those columns in those rows will be. For example, consider the SALESPERSON table in Figure 5.10. If salesperson 204's commission percentage has to be changed from the current 10 percent to 12 percent, the command would be:

UPDATE SALESPERSON SET COMMPERCT = 12 WHERE SPNUM = ‘204’;

Notice that the command first specifies the table to be updated in the UPDATE clause, then specifies the new data in the SET clause, then specifies the affected row(s) in the WHERE clause.

In the INSERT command, you have to specify a row of data to enter into a table. To add a new salesperson into the SALESPERSON table whose salesperson number is 489, name is Quinlan, commission percentage is 15, year of hire is 2011, and department number is 59, the command would be:

INSERT INTO SALESPERSON VALUES (‘489’,‘Quinlan’,15,‘2011’,‘59’);

In the DELETE command you have to specify which row(s) of a table are to be deleted based on data values within those rows. To delete the row for salesperson 186 the command would be:

DELETE FROM SALESPERSON WHERE SPNUM = ‘186’;

Practice: Manipulating the data with SQL

5.5 Optional: Minicases

   Super Baseball League. The Super Baseball League wants to keep track of information about its players, its teams, and the minor league teams (which we will call minor league "clubs" to avoid using the word "team" twice). Minor league clubs are not part of the Super Baseball League but players train in them with the hope of eventually advancing to a team in the Super Baseball League. The intent in this problem is to keep track only of the current team on which a player plays in the Super Baseball League. However, the minor league club data must be historic and include all of the minor league clubs for which a player has played. Team names, minor league club names, manager names, and stadium names are assumed to be unique, as, of course, is player number.

   Design a well-structured relational database for this Super Baseball League environment using the data normalization technique. Progress from first to second normal form and then from second to third normal form justifying your design decisions at each step based on the rules of data normalization. The attributes and functional dependencies in this environment are as follows:

   Attributes:
       Player Number
       Player Name
       Player Age
       Team Name
       Manager Name
       Stadium Name
       Minor League Club Name
       Minor League Club City
       Minor League Club Owner
       Minor League Club Year Founded
       Start Date
       End Date
       Batting Average
   Functional Dependencies:
       Player Number Player Name
       Player Number Age
       Player Number Team Name
       Player Number Manager Name
       Player Number Stadium Name
       Minor League Club Name City
       Minor League Club Name Owner
       Minor League Club Name Year Founded
       Team Name Manager Name
       Team Name Stadium Name
       Player Number, Minor League Club Name Start Date, End Date, Batting Average

TEAM Relation Team Number Team Name City Manager COACH Relation Team Number Coach Name Coach Telephone WORK EXPERIENCE Relation Team Number Coach Name Experience Type Years Of Experience BATS Relation Team Number Serial Number Manufacturer PLAYER Relation Number Player Name Player Age AFFILIATION Relation Player Number Team Number Years Batting Average

Super Baseball League Download 5.6 Summary

Logical database design is the process of creating a database structure that is free of data redundancy and that promotes data integration. There are two techniques for logical database design. One technique involves taking the entity-relationship diagram that describes the business environment and going through a series of steps to convert it to a well-structured relational database structure. The other technique is the data normalization technique. Furthermore, the data normalization technique can be used to check the results of the E-R diagram conversion for errors.

SQL is both a data definition language and a data manipulation language. Included in the basic data definition commands are CREATE TABLE, DROP TABLE, CREATE VIEW, and DROP VIEW. Included in the basic data manipulation commands are UPDATE, INSERT, and DELETE.

5 Glossary

CREATE TABLE command-The command that creates base tables and tells the system what attributes will be in them is called the CREATE TABLE command.

CREATE VIEW command-Views are built using the CREATE VIEW command. Within this command, you specify the base table(s) on which the view is to be based and the attributes and rows of the table(s) that are to be included in the view.

DELETE command-In the DELETE command, you have to specify which row(s) of a table are to be deleted based on data values within those rows.

DROP TABLE command-It is used to discard a table from a database.

DROP VIEW command-Views can be discarded using the DROP VIEW command.

First normal form-First normal form exists when all the columns in a table are atomic, i.e., only a single value is allowed in each column.

INSERT command-It is used to add a new record in a table.

Normalization-It is the process of organizing the fields and tables of a relational database to minimize redundancy (duplication) and dependency.

Second normal form-Second normal form (2NF) is the second step in normalizing a database. 2NF builds on the first normal form (1NF). In second normal form, all non-key attributes must be fully functionally dependent on the entire key of that table.

Third normal form-In third normal form, non-key attributes are not allowed to define other non-key attributes.

UPDATE command-In the UPDATE command, you have to identify which row(s) of a table are to be updated based on data values within those rows. Then you have to specify which columns are to be updated and what the new data values of those columns in those rows will be.

Quiz

1) Which of the following syntaxes is used to create a view?

-CREATE VIEW view_name AS
-SELECT column_name(s)
-FROM table_name
-WHERE condition

2) What is the correct syntax of the INSERT command?

-INSERT INTO table_name VALUES (value1,value2,value3,...);

3) Which of the following is used to organize attributes into tables so that redundancy among the non-key attributes is eliminated?

-Data normalization

4) Which of the following queries is NOT syntactically correct?

-DROP INTO table_name 
-VALUES (value1,value2,value3,...);

5) What is the functionality of the DROP TABLE command?

-Discard a table from a database.

6) In which of the following normal forms should every non-key attribute be fully functionally dependent on the entire key of a table?

-Second

7) Which of the following commands can be used to discard a view?

-DROP VIEW

8) Which of the following commands can be used to discard a view?

-CREATE TABLE table_name(
-column_name1 data_type(size),
-column_name2 data_type(size),...);

9) In which of the following normal forms are non-key attributes NOT allowed to define other non-key attributes?

-Third

10) Which of the following statements is true about the third normal form structure?

-It is completely free of data redundancy.

11) Which of the following is called a decomposition process?

-Data Normalization

Lesson 6

Business Intelligence Completing this lesson will help you gain the following competency:

Competency 4017.1.6: Business Intelligence The graduate demonstrates an understanding of the concepts involved in business intelligence and analytical processing.

Your learning objectives for this lesson are listed below:

   Define Business Intelligence.
   Explain the concepts of data modeling and analytics in a Business Intelligence context.
   Explain the concept of Extraction/Transformation/Loading (ETL).
   Identify the six basic data mining activities.

6.1 Why a Business Intelligence Program?

What drives the desire for instituting a BI program? And more to the point, what are the primary drivers for business analytics and how can the benefits be effectively communicated to the important organizational stakeholders?

A straightforward approach for considering the value of a BI program looks at business processes, their source of derived or generated value, performance measures, and where the absence of knowledge impedes the complete achievement of business success. By categorizing dimensions of business value, we can then evaluate the ways that BI can contribute to increased performance along those dimensions. A high-level overview suggests at least these dimensions of value:

   Financial value associated with increased profitability, whether derived from lowered costs or increased revenues;
   Productivity value associated with increased throughput with decreased workloads, diminished time for executing end-to-end processes (such as manufacturing or
   operational workflows), and increasing the percentage of high quality products or outcomes;
   Trust value, such as greater customer, employee, or supplier satisfaction, as well as increasing confidence in forecasting, maintaining consistent operational and
   management reports, reductions in time spent in "analysis paralysis," and better results from decisions; and
   Risk value associated with improved visibility into credit exposure, confidence in
   capital and asset investments, and auditable compliance with jurisdictional and industry standards and regulations.

Of course, there are many other benefits to building a BI practice within an organization. Some benefits are focused on the ability to answer what might be considered the most basic questions about how a company does business. For example, it is surprising how few senior managers within a company can answer simple questions about their business, such as:

   How many employees do you have?
   For each product, how many were sold over the last 12 months within each geographic region?
   Who are your 10 best customers?
   What is the value of any particular customer?
   Who are your 20 best suppliers?

What is even more interesting is that in some organizations, not only can we not answer these questions, there may not even be a framework in which someone can even ask these questions. There is a critical point to be made here: Starting a well-conceived and comprehensive BI program that engages both the technical practitioners and the business consumers will not just provide the physical tools for answering these kinds of questions, but, more importantly, should be a catalyst for changing the way team members think about doing business and about how information is a critical catalyst for closing the loop by taking action and measuring the outcomes of those actions.

For example, before we can determine who the 20 best customers are, we must be able to articulate the difference between a "good" and a "bad" customer, as well as be able to identify a collection of metrics used to measure goodness, what data sets need to be collected for measurement, establish and integrate the methods for collecting and aggregating the data used for measuring, establish the processes required for conducting the measurement, ensure the quality of that data so as to not draw faulty conclusions, package the results into a reasonable report, and find a method to quickly and effectively disseminate the results of this process.

Although a lot of this process may be automated using off-the-shelf technology, the most important part (i.e., asking the right question) needs input from individuals with expertise and a stake in the result. It again reinforces the need for supporting the growing utility of multipurposed data. Correspondingly, there is a growing community of individuals willing to consider data as a corporate asset that can be manipulated in different ways to derive benefit along the key value drivers. 6.2 Taking Advantage of the Information Asset

Anybody involved in the BI process is concerned about the ability to take advantage of information in a way that can improve the way the organization operates. Yet the ultimate goal of BI is powered by the ability to manage access and availability of necessary information to assess business needs, identify candidate data sources, and effectively manage the flow of information into a framework suited for reporting and analysis needs.

Although a significant amount of money has been invested in attempts at building and launching BI frameworks and applications, most of that money has been spent in infrastructure, whereas very little has been invested in managing and exploiting a valuable corporate asset with company's data. And in some arenas, the concept of what constitutes "business intelligence" is so poorly defined that the business users' expectations are set based on what they are told by their incumbent contractors or the continuous parade of tool vendors.

Getting the most advantage from a data analysis means evolving good business processes in lock-step with establishing good practices for data management. The ability to design and build analytical platforms must dovetail with assessing business process performance expectations, determining the information requirements to address business needs, and establishing clear success criteria to ensure that the actual implementation meets business user expectations.

On the other hand, there are a number of organizations that have started to view their data as a corporate asset and to realize that properly collecting, aggregating, and analyzing their data opens an opportunity to discover bits of knowledge that can both improve operational processing and provide better insight into customer profiles and behavior. In these environments, there is a clear agreement on the management, governance, and technical disciplines designed to exploit actionable knowledge discovered from the company's information asset that constitute the BI program.

One of the objectives of this book is to discuss the fundamental concepts for taking advantage of the information asset. That merges the information management capabilities, the technical descriptions, how each is manifested in the business environment, along with some of the aspects involved in managing the technical, political, and personal issues that can pose a challenge to a successful implementation. 6.3 Business Intelligence and Program Success

The fact that you are reading this book implies that you are somehow involved in some aspect of BI. You may be just curious and looking to learn more, or you may be actively involved in some phase of a BI activity: the discovery phase, justification, analysis of requirements for design, creation, management, maintenance, or development of a BI program. And it may also be likely that somewhere within your corporate senior management hierarchy, there are individuals who have been convinced of the value of starting a BI program. Unfortunately, the disparate perceptions of what "business intelligence" means and how (or perhaps even "if") the knowledge derived through BI is to be employed can contribute to delays, stalls, and in some cases, overall program failure.

As with any technology investment, when we look at organizations that have started implementing reporting engines, developing data warehouses, or have purchased large-scale data mining software suites without any program management, change agents, or business goals, we see high expectations and many disappointments related to the failure in the way that data warehouse projects are conceived, designed, architected, managed, and implemented, for any, if not all, of these reasons:

   The amorphous understanding of what BI methods and products could do resulted in an absence of a proper perception of the value proposition on behalf of the business sponsor.
   The absence of clear measures of success masked the value of specific milestones and deliverables.
   A communications gap between the implementers and the end users prevented the integration of information requirements into the system development life cycle.
   The scope of the project was not fully understood, causing delays in delivering to the business sponsor.
   Insufficient technical training prevented developers from getting software products to do what the vendors said they do.
   Attempting to incorporate many inconsistent data sources failed because of variance in formats, structures, and semantics.
   Poor understanding of technology infrastructure led to poor planning and
   scheduling.
   Business users were unable to trust results due to poor data quality.
   The lack of a clear statement of success criteria, along with a lack of ways to measure program success, led to a perception of failure.

The goal of this book is to provide a high-level overview of the technical (and some political) concepts for which a savvy manager must have awareness when involved in a BI or information exploitation project in order to make that project successful. The material is intended to cast interesting technology in an operational business framework while providing the introductory technical background and highlighting important topics such as:

   Management issues
   Managing change
   Technical issues
   Performance issues
   Complexity

This book will describe the basic architectural components of a BI environment, beginning with traditional topics, such as business process modeling and data modeling, and moving on to more modern topics, such as business rule systems, data profiling, information compliance and data quality, data warehousing, and data mining. My hope is that this will be a valuable introduction to the technology, management issues, and terminology of the BI industry. But first, let's settle on a definition of business intelligence. 6.4 Business Intelligence Defined

The Data Warehousing Institute, a provider of education and training in the data warehouse and BI industry defines business intelligence as:

The processes, technologies, and tools needed to turn data into information, information into knowledge, and knowledge into plans that drive profitable business action. Business intelligence encompasses data warehousing, business analytic tools, and content/knowledge management.

This is a great working definition, especially because it completely captures the idea that there is a hierarchy imposed on the different scopes of intelligence. In addition, this definition also exposes two critical notions:

   A BI program encompasses more than just a collection of software products and visualization tools. The value of BI comes from the processes for delivering actionable knowledge to the end users, the processes for acting upon that knowledge, and the right people willing to take action. This means that without the processes and the right people, the tools are of little value.
   The value of BI is realized in the context of profitable business action. This means that if knowledge that can be used for profitable action is ignored, the practice is of little value.

Unfortunately, the words data and information are frequently used interchangeably. At the risk of clashing with any individual's understanding of the terms data, information, and knowledge, for the purposes of this book we will use these conceptual definitions:

   Data is a collection of raw value elements or facts used for calculating, reasoning, or measuring. Data may be collected, stored, or processed but not put into a context from which any meaning can be inferred.
   Information is the result of collecting and organizing data in a way that establishes relationships between data items, which thereby provides context and meaning.
   Knowledge is the concept of understanding information based on recognized patterns in a way that provides insight to information.

Turning Data Into Information

The process of turning data into information can be summarized as the process of determining what data is to be collected and managed and in what context. A good example is the process of designing a database that models a real-world set of entities, such as parties, which is a frequently used term that refers to people and organizations, along with the roles taken on by those parties.

Out of context, the individual bits of data, such as names and birth dates, are of little value. Having established which bits of data are to be used to configure a description of a party, as well as creating instances and populating those instances with the related data values, we have created the context for those pieces of data and turned them into a piece of information.

We might say that this aspect of BI involves the infrastructure of managing and presenting data, which incorporates the hardware platforms, relational or other type of database systems, and associated software tools. This aspect also incorporates query and reporting tools that provide access to the data. Last, this part of the process cannot be done without experts in the area of data management integrating and coordinating this technology. The Flash Of Insight: Turning Information Into Knowledge

Sometimes it happens when you wake up in the middle of the night or perhaps while you are stuck in traffic or even while you're daydreaming in the shower. I am referring to that flash of insight that almost magically appears that provides you with the answer to that particularly nasty problem over which you've been agonizing. I like to compare the concept of turning information into knowledge to that flash of insight. We accumulate piles of information, which are then analyzed in many different ways until some critical bits of knowledge are created. What makes that knowledge critical is that it can be used to form a plan of action for solving some business problem.

We can say that this aspect of BI involves the analytical components, such as online analytical processing (OLAP), data quality, data profiling, business rule analysis, predictive analysis, and other types of data mining. Again, acquiring tools to perform these functions is of little value in the absence of professionals who understand how to use them and how to get the right kinds of results. Turning Knowledge Into Actionable Plans

This last aspect is probably the most important, because it is in this context that any real value is derived. If you are using BI for micromarketing, finding the right customer for your product is irrelevant if you do not have a plan to contact that customer. If you are using BI for fraud detection, finding the pattern of fraud is of little value if your organization does not do something to prevent that fraudulent behavior. Being able to take action based on the intelligence that we have learned is the key point of any BI strategy. It is through these actions that a senior management sponsor can see the true return on investment for his or her information technology (IT) spending. A BI program provides benefits that increase business efficiency, increase sales, provide better customer targeting, reduce customer service costs, identify fraud, and generally increase profits while reducing costs. Because of this, we might say that when implemented properly, BI is one IT area that can be a profit center instead of the traditional cost center. 6.5 Actionable Intelligence

That last point is so critical that it is worth mentioning a second time: Discovered knowledge is of little value if there is no value-producing action that can be taken as a consequence of gaining that knowledge. This means that the business-technology partnership must work together not just to act on discovered intelligence but to do so in a timely fashion so as to derive the greatest benefit. This reinforces the sentiment that BI can succeed as the result of cooperation between technical developers and their business clients. 6.6 Data Modeling and Analytics

Practically all modern business applications that employ a data subsystem represent their data sets using data models. A data model is a discrete structured data representation of a real-world set of entities related to one another. Each entity (most often represented using a table) carries a set of characteristic attributes (described as data elements). Yet over time, our understanding of the ways that real-world objects and events are captured within a structured representation must adapt to the context in which the data sets are used and the ways the information satisfies the needs of the business processes.

Transaction processing systems need to have visibility into all aspects of a limited number of data instances at a time (such as a single customer buying a particular product). Analytical systems may need to support rapid aggregation to quickly respond to ad hoc queries relating to many data instances (such as "how many customers bought this specific product?"). In other words, there is bound to be a significant difference between how we use data in an operational/tactical manner (i.e., to "run the business") and the ways we use data in a strategic manner (i.e., to "improve the business").

The traditional modeling technique for operational systems revolves around the entity-relationship model. Unfortunately, reporting and analytical applications are generally less well-suited to utilizing data structured in the entity-relational form (although there are emerging environments that make this less true). The alternative is to restructure transaction or event data using what is called "dimensional modeling" that is better organized to provide rapid responses to different types of queries and analyses. Transaction Processing And Data Modeling

It is valuable to consider the history of database modeling to understand the drivers for cresting a segregated system and data environment used for analytical purposes. Early application systems focused solely on execution of the specific transactions.

All the aspects of a data object or transaction (such as a bank account or a store purchase) would have been likely stored in a single entry in a data file, with all the aspects of the data instance embedded within a single row in that file. Eventually, those files evolved into database tables, and each row became known as a record. To illustrate the simplicity of that model, a sales record might capture the buyer's name, address, the time of the transaction, and then a list (perhaps separated by commas) of the items and quantities of each product that was purchased (see Table 6.1).

In terms of running the business, this may have been sufficient, but it was difficult to manage. The transactions did not execute in real time, but rather were collected up and run in batch at a particular time of day. Reporting on these transactions was more of a challenge. Application code was written to handle any data manipulation necessary to extract any kind of information from these systems. Additionally, because the buyer information was collected for each transaction, there was a significant amount of repeated and redundant data being unnecessarily stored, with the possibility of many errors creeping into the system.

In the early 1980s, a number of practitioners and researchers (most notably, E.F. Codd and Chris Date) explored the concept of a relational database, in which the way that information was modeled was viewed in the context of representing

Table 6.1 A simple sales record format Name AccountNum Address City State Order David Loshin 018776 123 Main Street Springfield Heights NY 1 sprocket 10-X12, 3 widgets 10-Y39, 1 Vertical Wedge 11-8773, 2 Monc. 12-Y6554 James Banding 021745 84 Causington Way Springfield NY 4 5/8 widgets 10-Y33, 1 Horizontal Splunge 11-H6473, 1 cantiv. 19-K754, 2 sprocket 10-X12 SprockCorp 014145 10244 Washington Hwy Springfield NY 42 sprocket 10-X12, 42 sprocket holder 10-X12 Shelbyville Engineering, Inc 013189 1477 Shelbyville Tpk Shelbyville NY 13 7/8 widgets 10-Y34, 1 Diag. Corker 17-D1273, 11 cantiv. 19-K754, 2 sprocket holder 10-X12a Roger Simmons 016290 1022 Elm St. Springfield Hghts NY 12 Widget chains 10-Y72, 4 3/4 glod. 17-G511, 10 widget 10-Y39 Dave Lotion 018777 123 Main Street Springfield Hghts NY 2 sprocket 10-X12, 1 widget 10-Y39, 5 Vertical Wedge 11-8773, 2 3/4 glod. 17-G511

entities within separate tables and relating those entities within a business process context between tables using some form of cross-table linkage. In our (simplified) relational view of our sales database, we would have broken up our table into a customer table, a product table, a sales table, and a sales detail table, where each record in the sales table would represent a purchase transaction by a related customer and each item and quantity purchased would be in a related record in the sales detail (see Figure 6.1).

Using an entity-relationship modeling approach simplifies the development of transaction processing applications. Corresponding business processes can easily map to a sequence of operations to be executed on a collection of related tables, and then grouped together as a single unit of work. The result of executing the group of operations is to reflect the effects of the business transaction inside the data model. Another essential goal of the relational model is the identification and elimination of redundancy within a database. This process, called normalization, analyzes tables to find instances of replicated data within one table (such as the names and addresses in our old sales table) that can be extracted into a separate table that can be linked relationally through a foreign key.

Although the entity-relationship model significantly helps in the design of operational systems, the diffraction of the information into the relational entities is, to some extent, confusing (consider the hallway-length entity-relationship diagrams that decorate the Database Administration [DBA] department's walls). In addition, the kinds of analytical extractions that are useful for BI applications are constrained by the representation of data in the pure relational model, turning what we might think would be an intuitive extraction into a set of poorly performing queries.

Figure 6.1 Sample entity-relationship diagram for the sales database Dimensional Models

The challenge in using the standard entity-relationship model for reporting and analysis lies in the interconnectedness of the entities and the corresponding complexity in accumulating the information necessary for hierarchical aggregations. The alternative dimensional modeling technique captures the basic unit of representation as a single multikeyed entry in a slender fact table, with each key exploiting the relational model to refer to the different dimensions associated with those facts.

A maintained table of facts, each of which is related to a set of dimensions, is a much more efficient representation for data in a data warehouse and allows for information to be represented in a way that is more suitable to high-performance access. This is due to the ability to efficiently create aggregations and extractions of data specific to particular dimensional constraints quickly while being able to aggregate information.

The representation of a dimensional model is straightforward in that each row in the fact table represents a unique observable transaction or event within a specific business context. For example, a sales transaction would be represented as a specific fact record capturing all details of the event:

This representation captures both entity data and quantifiable data. The entity data items, such as customer or location are not the actual values but instead are references (or foreign keys) to the dimension tables. The quantifiable items (such as quantity or unit price) are specific pieces of information relevant to the fact and are captured in the fact record. This data is typically numeric so that it is amenable to aggregate functions (sum, max, min, etc.). Each fact represents the total quantity of a product sold to a specific customer at a particular point-of-sales location at a particular point in time. A rough visualization of this sales fact table model is shown in Figure 6.2.

This model contains six dimensions: Time, Customer, Item, Promotion, Location, and Clerk. When you look at the picture of the model in Figure 6.2, you can see that the relationships between the fact table and the dimensions resemble a star, which is why this model layout is referred to as a star schema.

This model easily adapts to support aggregation by essentially sorting and accumulating within the set of dimensions under investigation. So for example, to derive information about the sales of any particular item by sales location, you would sort the fact table records by sales location and then by product. At that point you could group the total counts and amounts by product within each sales location. With the right set of indexes introduced, reporting these aggregates is simply done using SQL queries. The fact table is related to dimensions in a star schema. Each entry in a dimension represents description of the individual entities within that dimension. For

Figure 6.2 An example star schema for dimensional analysis.

example, in our sales example, the item dimension contains information associated with each item that could be sold, with descriptions that describe the attributes and characteristics of the item, such as item category, description, manufacturer, SKU number, size, unit of measure, package type, and package size, among a myriad of other relevant pieces of reference information.

Figure 6.3: Good Reading Bookstores data warehouse star schema design with snowflake feature

The design in Figure 6.3 also has a feature that makes it a "snowflake" design: one of the dimension tables, BOOK, leads to yet another dimension table, PUBLISHER. Consistent with the rest of the star schema, the snowflake relationship is one-to-many, "inward" towards the center of the star. A publisher publishes many books but a book is associated with only one publisher. Using the Dimensional Model for Business Intelligence

This dimensional model has become the de facto standard for representing and managing data in a data warehouse, for a number of reasons, such as:

Simplicity. There is a certain elegance in the predictability of the model, since it simplifies the process of satisfying requests for reporting via the variety of knowledge delivery and reporting tools. In fact, there is a generic process for extracting information that relies on the star schema: Create a join between the fact table and the desired dimensions and then group by dimension. By virtue of the key relationship between the fact table and the dimensions, this join is basically a single pass through the fact table.
Lack of bias. Even with multiple dimensions, there is no inherent bias lent to any individual dimension. This means that as data consumers change their activity or behavior associated with the kinds of analyses or reports they desire, no specific action need be taken to rebalance the data to improve performance. In other words, the performance characteristics are not related to the data layout.
Extensibility. Because the dimensional model is easily modified, changes to the model can be handled gracefully without disrupting the operation of the data warehouse. For example, adding new values to a dimension (such as adding a new customer or a new item) involves simply adding new rows to that dimension. This can be aligned with a master data registry or repository; each enables greater consistency across systems. Adding a new dimension is done by creating the new dimension table and modifying the key values in the fact table to incorporate the references back to the new dimension. Adding new attributes to dimension values is done by altering the tables and adding the new attribute values.

Dimensional modeling is not limited to customer and sales data. The same approach can be taken to model any collection of transactions or events, such as transportation and logistics, telecommunications transactions (call detail data), insurance claims, web log transactions, social media postings, nonprofit donations, medical encounter data, and so on.

There are variations on the star schema that involve breaking out additional dimension information associated with a preexisting dimension, and this allows for additional hierarchical grouping. An example would be customer categories that are applied at the highest level and then having the different customers enumerated within dimensions of that dimension. In either event, the general star schema is a powerful representational abstraction that is ubiquitous in building models for capturing data to supplement reporting and analytics. Practice: Schemas

6.7 The Data Warehouse

We have used the term "data warehouse" as a catch-all phrase describing the location from which our reporting and analytics will be served. Basically, a data warehouse is the primary source of information that feeds the analytical processing within an organization.

There are conflicting ideas about the formal definition of a data warehouse, but there is general consensus on some fundamental aspects.

   A data warehouse is a centralized repository of information.
   A data warehouse is organized around the relevant subject areas important to the organization.
   A data warehouse provides a platform for different consumers (both human and
   automated) to submit queries about enterprise information.
   A data warehouse is used for analysis and not for transaction processing.
   The data in a data warehouse is nonvolatile.
   A data warehouse is the target location for integrating data from multiple sources, both internal and external to an enterprise.

A data warehouse is usually constructed using a dimensional model. Information is loaded into the data warehouse after a number of preprocessing steps. Initially, the BI consumers will have been engaged to provide their requirements, after which the candidate data sources will have been selected. The quality of those data sets can be assessed through data profiling. At the same time, the data analysts and modelers will design and build the dimensional models for the analytical platform.

Once the candidate sources have been selected, the data is extracted, cleansed, potentially transformed, and then prepared to be loaded into the warehouse model. This may incorporate business rules as well. That data is subsequently reformulated into dimensional form and loaded into the target warehouse. These processes compose what is referred to as the data warehouse's back end.

Once the data is in the warehouse, it may be used for any of the reporting and analysis purposes. Certain tools may draw their input directly from the data warehouse or from data marts that are extracted for specific purposes. The data warehouse can also act as a data source for algorithmic analytics performed on specialty analytical frameworks, as well as provide an "anchor point" for collecting and storing additional analytical results from these algorithms. 6.8 OLAP and CUBES

Online analytical processing is different from the typical operational or transaction processing systems. There are many proposed definitions of OLAP, most of which describe what OLAP is used for. The most frequently used terms are "multidimensional" and "slice-and-dice." Online analytical processing tools provide a means for presenting data sourced from a data warehouse or data mart in a way that allows the data consumer to view comparative metrics across multiple dimensions. In addition, these metrics are summarized in a way that allows the data consumer to drill down (which means to expose greater detail) on any particular aspect of the set of facts.

The data to be analyzed in an OLAP environment are arranged in a way that enables visibility along any of the dimensions. Usually this is described as a cube, although the organization is intended to allow the analyst to fix some set of dimensions and then see aggregates associated with the other dimensional hierarchies. Let's resume our sales analysis, and consider a sales fact table that records every sales transaction, including date, time, location, customer, item, quantity, price per product, sales clerk, sales promotion, and total sales. We might configure an OLAP cube with these dimensions:

   Customer
   Sales Location
   Product
   Time
   Clerk
   Sales Promotion

Within each of these dimensions is a hierarchical structure, such as time periods (hour, day, week, month, quarter, year), sales locations (point of sale, store, city, county, state, region), and item categories (including specialized products such as shampoo, which is contained within the hair-care product class, which is contained within the beauty aids product class). The OLAP environment provides an aggregate view of data variables across the dimensions across each dimension's hierarchy. This might mean an aggregate function applied to any individual column across all the data related to each dimension (such as "total dollar sales by time period by sales location" or "average price by region by customer class"). For example, the data analyst can explore the total sales of beauty aid products within the Western region and then drill down across another dimension, such as the product dimension (total sales of hair-care products within the Western region) or the region dimension (total sales of beauty aids in California).

Because of the cube structure, there is an ability to rotate the perception of the data to provide different views into the data using alternate base dimensions. This conceptual ability to pivot or rotate the data provides the "slice" part; the ability to drill down on any particular aggregation provides the "dice" part.

The value of an OLAP tool is derived from the ability to quickly analyze the data from multiple points of view, and so OLAP tools are designed to precalculate the aggregations and store them directly in the OLAP databases. Although this design enables fast access, it means that there must be a significant amount of preparation of the data for the OLAP presentation as well as a potentially large storage space, because the number of cells within the cube is determined by both the number of dimensions and the size of each dimension.

For example, an OLAP cube with two dimensions, customer (1000 values) and sales locations (100 entries), would need 100,000 cells. Add a third dimension, product (with 200 entries), and you suddenly need 20 million cells. Add that fourth dimension, time (52 weeks), and your space requirement jumps to 1.04 trillion! Not only that, computational requirements grow in this same manner, because those aggregations need to be calculated and stored. No wonder many vendors rely on parallel machine architectures to support the OLAP data environment. Practice: Fundamental aspects of a data warehouse

6.9 Improving Data Accessibility

Drivers for Data Integration

The need to balance increased accessibility against the performance issues relating to data latency effectively frame the disciplines associated with data integration. Essentially, we are beginning to redefine the concept of data integration; it is not limited to extracting data sets from internal sources and loading them into a data warehouse, but focuses on effectively facilitating the delivery of information to the right places within the appropriate time. Data integration goes beyond ETL, data replication, and change data capture, although these remain key components of the integration fabric.

Some of the key factors in this redefinition include:

   Volume. Petabyte and exabyte data volumes are becoming the norm, not the exception;
   Performance. Emerging demands for high-performance data integration capabilities are based on the need for linear scalability, the use of parallelism, and high-bandwidth data channels;
   Lineage. Or rather, the exposure of lineage, namely transparency when desired, and opacity otherwise;
   Speed of delivery. Reduced latency for accessing data from across nonuniform, heterogeneous platforms;
   Semantic consistency. Collaborative methods for managing and utilizing metadata shared across communities of interest;
   Quality. Embedded controls to ensure quality of the data; and
   Security. Guarantees of data protection no matter where data lives.

6.10 Extraction/Transformation/Loading

A basic concept for populating a data warehouse is that data sets from multiple sources are collected and then added to a data repository from which analytical applications can source their input data. This sounds straightforward, but actually can become quite complex. Although the data warehouse data model may have been designed very carefully with the BI clients' needs in mind, the data sets that are being used to source the warehouse typically have their own peculiarities. Yet not only do these data sets need to be migrated into the data warehouse, they will need to be integrated with other data sets either before or during the data warehouse population process.

This extract/transform/load (commonly abbreviated to ETL) process is the sequence of applications that extract data sets from the various sources, bring them to a data staging area, apply a sequence of processes to prepare the data for migration into the data warehouse, and actually load them. Here is the general theme of an ETL process.

   Get the data from the source location.
   Map the data from its original form into a data model that is suitable for manipulation at the staging area.
   Validate and clean the data.
   Apply any transformations to the data that are required before the data sets are loaded into the repository.
   Map the data from its staging area model to its loading model.
   Move the data set to the repository.
   Load the data into the warehouse.

Staging Architecture

The first part of the ETL process is to assemble the infrastructure needed for aggregating the raw data sets and for the application of the transformation and the subsequent preparation of the data to be forwarded to the data warehouse. This is typically a combination of a hardware platform and appropriate management software that we refer to as the staging area. The architecture of a staging process can be seen in Figure 6.4. Note that the staging architecture must take into account the

Figure 6.4 Staging data in preparation for loading into an analytical environment.

order of execution of the individual ETL stages, including scheduling data extractions, the frequency of repository refresh, the kinds of transformations that are to be applied, the collection of data for forwarding to the warehouse, and the actual warehouse population. Extraction

Extraction essentially boils down to two questions:

   What data should be extracted?
   How should that data be extracted?

Realize that the first question essentially relies on what the BI clients expect to see ultimately factored into their analytical applications, and will have been identified as a result of the data requirements analysis process. But it is a deeper question, because the data that we want to flow into the repository is likely to be a subset of some existing set of tables. In other words, for each data set extracted, we may only want to grab particular columns of interest, yet we may want to use the source system's ability to select and join data before it flows into the staging area. A lot of extracted data is reformulated or restructured in different ways that can be either easily manipulated in process at the staging area or forwarded directly to the warehouse.

How data should be extracted may depend on the scale of the project, the number (and disparity) of data sources, and how far into the implementation the developers are. Extraction can be as simple as a collection of simple SQL queries, the use of adapters that connect to different originating sources, yet can be as complex as to require specially designed programs written in a proprietary programming language. There are tools available to help automate the process, although their quality (and corresponding price) varies widely.

Automated extraction tools generally provide some kind of definition interface specifying the source of the data to be extracted and a destination for the extract, and they can work in one of two major ways, both of which involve code generation techniques. The first is to generate a program to be executed on the platform where the data is sourced to initiate a transfer of the data to the staging area. The other way is to generate an extraction program that can run on the staging platform that pulls the data from the source down to the staging area. Transformation

What is discovered during the profiling is put to use as part of the ETL process to help in the mapping of source data to a form suitable for the target repository, including the following tasks.

Data type conversion. This includes parsing strings representing integer and numeric values and transforming them into the proper representational form for the target machine, and converting physical value representations from one platform to another (EBCDIC to ASCII being the best example).
Data cleansing. The rules we can uncover through the profiling process can be applied, along with directed actions that can be used to correct data that is known to be incorrect and where the corrections can be automated. This component also covers data-duplicate analysis and elimination and merge/purge.
Integration. This includes exploiting the discovery of table and foreign keys for representing linkage between different tables, along with the generation of alternate (i.e., artificial) keys that are independent of any systemic business rules, mapping keys from one system to another, archiving data domains and codes that are mapped into those data domains, and maintaining the metadata (including full descriptions of code values and master key-lookup tables).
Referential integrity checking. In relation to the foreign key relationships exposed through profiling or as documented through interaction with subject matter experts, this component checks that any referential integrity constraints are not violated and highlights any nonunique (supposed) key fields and any detected orphan foreign keys.
Derivations. Any transformations based on business rules, new calculations, string manipulations, and such that need to be applied as the data moves from source to target are applied during the transformation stage. For example, a new "revenue" field might be constructed and populated as a function of "unit price" and "quantity sold."
Denormalization and renormalization. Frequently data that is in normalized form when it comes from the source system needs to be broken out into a denormalized form when dimensions are created in repository data tables. Conversely, data sourced from join extractions may be denormalized and may need to be renormalized before it is forwarded to the warehouse.
Aggregation. Any aggregate information that is used for populating summaries or any cube dimensions can be performed at the staging area.
Audit information. As a matter of reference for integrity checking, it is always useful to calculate some auditing information, such as row counts, table counts, column counts, and other tests, to make sure that what you have is what you wanted. In addition, some data augmentation can be done to attach provenance information, including source, time and date of extraction, and time and date of transformation.
Null conversion. Because nulls can appear in different forms, ranging from system nulls to explicit strings representing different kinds of nulls, it is useful to have some kind of null conversion that transforms different nulls from disparate systems.

Loading

The loading component of ETL is centered on moving the transformed data into the data warehouse. The critical issues include the following.

   Target dependencies, such as where and on how many machines the repository lives, and the specifics of loading data into that platform.
   Refresh volume and frequency, such as whether the data warehouse is to be loaded on an incremental basis, whether data is forwarded to the repository as a result of triggered transaction events, or whether all the data is periodically loaded into the warehouse in the form of a full refresh.

ETL Scalability

There are two flavors of operations that are addressed during the ETL process. One involves processing that is limited to all data instances within a single data set, and the other involves the resolution of issues involving more than one data set. For example, the merge/purge operation compares pairs of records taken from different data sets to determine if they represent the same entity and are therefore candidates for merging. In the most naive method, this process requires each instance from one data set to be compared with all the instances from the other set; as more data sets are added to the mix, the complexity of this process increases geometrically.

The story is basically this: The more data sets that are being integrated, the greater the amount of work that needs to be done for the integration to complete. This creates two requirements: (1) More efficient methods must be applied to perform the integration, and (2) the process must be scalable, as both the size and the number of data sets increase. Extract, Load and Transform(ELT)

Increased data volumes pose a problem for the traditional ETL approach in that first accumulating the mounds of data into a staging area creates a burst-y demand for resources. When the data sets are being extracted and transformed, the storage and computational needs may be high (or actually, very high), but during the interim periods, those resources might be largely unused. This is undesirable from both the performance and utilization standpoints.

A different approach seeks to take advantage of the performance characteristics of the analytical platforms themselves by bypassing the staging area. In other words, the data sets are extracted from the sources, loaded into the target, and the transformations are applied at the target. This modified approach, Extract, Load, and Transform (ELT), is beneficial with massive data sets because it eliminates the demand for the staging platform (and its corresponding costs to manage).

However, once the data is loaded into the target system, you may be limited by the capabilities of executing the transformation. For example, applications for transformation can be much more flexible dealing with data streamed directly out of files. In the ELT approach, you may have to use an RDBMS's native methods for applying transformation. These may not be as complete and may run slower than custom-designed transformation applications. Practice: Approaches used in data warehouse designing and implementation

6.11 Six Basic Data Mining Activities

It is important to differentiate between the tasks employed for mining data, the methods by which these tasks are performed, and the techniques that use these activities for uncovering business opportunities. We can essentially boil the most significant methods down to a set of six tasks. Clustering and Segmentation

Clustering is the task of taking a large collection of entities and dividing that collection into smaller groups of entities that exhibit some similarity (Figure 6.5). The difference between clustering and classification is that during the clustering task, the classes are not defined beforehand. Rather, it is the process of evaluating the classes after the clustering has completed that drives the determination or definition of that class.

Clustering is useful when you are not really sure what you are looking for but want to perform some kind of segmentation. For example, you might want to evaluate health data based on particular diseases along with other variables to see if there are any correlations that can be inferred through the clustering process. Clustering can be used in concert with other data mining tasks as a way of identifying a business problem area to be further explored. An example is performing a market segmentation based on product sales as a prelude for looking at why sales are low within a particular market segment.

Figure 6.5 Example of clustering

The clustering process collects data instances into groups such that each group is clearly different from all others, and that the members of each group are recognizably similar. The records in the set are organized based on similarity; since there are no predefined specifications for classification, the algorithms essentially "select" the attributes (or "variables") used to determine similarity. Someone with business context knowledge might be called upon to interpret the results to determine if there is any specific meaning associated with the clustering, and sometimes this may result in culling out variables that do not carry meaning or may not have any relevance, in which case the clustering can be repeated in the absence of the culled variables. Classification

The world is divided into two groups of people: those who classify the world into two groups, and those who do not. But seriously, our natural tendency is to assign things into groups based on some set of similar characteristics. For example, we break up groups of customers by demographic and/or psychographic profiles (e.g., marketing to the lucrative 18-to 34-year-olds), or divide products into product classes, and so on.

Once you have completed a segmentation process, the values of identified dependent variables can be used for classification. Classification is the process of organizing data into predefined classes (Figure 6.6). Those classes may be described using attributes selected by the analyst, or may actually be based on the results of a clustering model. During a classification process, the class definitions and a training data set of previously classified objects is presented to the application, which then attempts to build a model that can be used to accurately classify new records. For example, a classification model can be used to evaluate public companies into good, medium, and poor investments, assign meta-tags to news articles based on their content, or assign customers into defined market segments.

Figure 6.6 Classification of records based on defined characteristics. Estimation

Estimation is a process of assigning some continuously valued numeric value to an object. For example, credit risk assessment is not necessarily a yes/no question; it is more likely to be some kind of scoring that assesses a propensity to default on a loan. Estimation can be used as part of the classification process (such as using an estimation model to guess a person's annual salary as part of a market segmentation process).

A value of estimation is that because a value is being assigned to some continuous variable, the resulting assignments can be ranked by score. So, for example, an estimation process may assign some value to the variable probability of purchasing a time-share vacation package and then rank the candidates by that estimated score, making those candidates the best prospects.

Estimation is used frequently to infer some propensity to take some action or as a way to establish some reasonable guess at an indeterminable value. An example is customer lifetime value, in which we sought to build a model reflecting the future value of the relationship with a customer. Prediction

The subtle difference between prediction and the previous two tasks is that prediction is the attempt to classify objects according to some expected future behavior. Classification and estimation can be used for the purposes of prediction by using historical data, where the classification is already known, to build a model (this is called training). That model can then be applied to new data to predict future behavior.

You must be careful when using training sets for prediction. There may be a risk of an inherent bias in the data that may lead you to draw inferences or conclusions that are relevant in relation to the bias. Use different data sets for training and test, test, test! Affinity Grouping

Affinity grouping is a process of evaluating relationships or associations between data elements that demonstrate some kind of affinity between objects. For example, affinity grouping might be used to determine the likelihood that people who buy one product will be willing to try a different product. This kind of analysis is useful for marketing campaigns when trying to cross-sell or up-sell a customer on additional or better products. This can also be used as a way to create product packages that have appeal to large market segments. For example, fast-food restaurants may select certain product components to go into a meal packaged for a particular group of people (e.g., the "kid's meal") and targeted at the population who is most likely to purchase that package (e.g., children between the ages of 9 and 14). Description

The last of the tasks is description, which is the process of trying to characterize what has been discovered or trying to explain the results of the data mining process. Being able to describe a behavior or a business rule is another step toward an effective intelligence program that can identify knowledge, articulate it, and then evaluate actions that can be taken. In fact, we might say that the description of discovered knowledge can be incorporated into the metadata associated with that data set. Practice: Data Mining activities

6 Glossary

Business Intelligence-The transformation of raw data into useful information.

Classification-Predictive - maps data into predefined groups or classes

Clustering-Descriptive - groups similar data together into clusters

Extraction/Transformation/Loading-ETL is the process of extracting raw data and then transforming and loading into a target to be used with Business intelligence.

Summarization Rules-Descriptive - maps data into subsets with associated simple descriptions or generalizations.

QUIZ 6

1) Which of the following key factors is based on the need for linear scalability?

-Performance

2) Which of the following conceptual definitions is the concept of understanding information based on recognized patterns in a way that provides insight to information?

-Knowledge

3) Which of the following identifies the groups whose members are similar?

-Clustering

4) Which of the following includes exploiting the discovery of table and foreign keys for representing linkage between different tables?

-Integration

5) Which of the following conceptual definitions is a collection of raw value elements or facts used for calculating, reasoning, or measuring?

-Data

6) Which of the following is used for populating summaries or any cube dimensions that can be performed at the staging area?

-Aggregation

7) Which of the following data mining activities is the process of organizing data into predefined classes?

-Classification

8) Which of the following fundamental aspects is true for a data warehouse?

-A data warehouse is organized around the relevant subject areas important to the organization.

9) Which of the following fundamental aspects is NOT true for a data warehouse?

-The data in a data warehouse is volatile.

10) Which of the following is the sequence of applications that extract data sets from the various sources and bring them to a data staging area?

-Extract/Transform/Load

11) Which of the following is NOT a reason that the dimensional model has become the standard for representing and managing data in a data warehouse?

-Integration

12) Which of the following conceptual definitions is the result of collecting and organizing data in a way that establishes relationships between data items?

-Information

13) Which of the following data mining activities is a process of assigning some continuously valued numeric value to an object?

-Estimation

Lesson 7

Storage Technology Completing this lesson will help you gain the following competency:

Competency 4017.1.7: Storage Technology The graduate demonstrates a fundamental understanding of storage technologies.

Your learning objectives for this lesson are listed below:

   Explain the components of RAID, including software and hardware implementations
   Identify and describe RAID levels.
   Identify the components of an Intelligent Storage system.
   Define cloud computing.
   Explain the characteristics of cloud computing.
   Outline the benefits and drawbacks of cloud computing.

We would like to direct you to Information Storage and Management: Storing, Managing, and Protecting Digital Information in Classic, Virtualized, and Cloud Environments 2e one more time for some important content for this lesson-specifically pages 51-62 of chapter 3 ("Data Protection: RAID"), pages 72-92 of chapter 4 ("Intelligent Storage Systems"), and pages 313-330 of chapter 13 ("Cloud Computing"). Knowledge Check: Storage technology

Practice: NIST cloud characteristics

Practice: Components of Intelligent Storage System

Practice: NIST cloud characteristics

Glossary 7

Cloud Computing-The use of a network of remote hosted servers on the internet to store and manage data.

Intelligent Storage System-the use of 4 components to retrieve data-this includes a front-end, cache, back-end, and a physical disk. It is intelligent because data can be retrieved from the cache if it has already been requested.

RAID-Redundant Array of Independent Disk - a term used to describe the use of multiple disk drives to store data for performance and redundancy

C175

Contents

Lesson 2

Quiz 2

Lesson 3

Glossary

Quiz

Lesson 4

4 Glossary

Quiz 4

Lesson 5

5 Glossary

Quiz

Lesson 6

6 Glossary

QUIZ 6

Lesson 7

Glossary 7

QUIZ 7

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools