
    William H. HAMER, et al., Plaintiffs-Appellants, v. CITY OF ATLANTA, et al., Defendants-Appellees. UNITED STATES of America, Plaintiffs, v. CITY OF ATLANTA, et al., Defendants.
    Nos. 86-8607, 86-8788.
    United States Court of Appeals, Eleventh Circuit.
    May 23, 1989.
    
      Antonio L. Thomas, Thomas & Dotson, Atlanta, Ga., for plaintiffs-appellants.
    Marva Jones Brooks, W. Roy Mays, III, Deborah Mclver Floyd, Atlanta, Ga., for City of Atlanta et al.
    Anthony J. McGinley, Carter & Ansley, Atlanta, Ga., for D. Stewart, G. Rioux, R.G. Bickford, B. Poss and L. Miller.
    Before RONEY, Chief Judge, CLARK, Circuit Judge, and MORGAN, Senior Circuit Judge.
   CLARK, Circuit Judge:

This is an appeal from an order issued by the United States District Court for the Northern District of Georgia determining that a written examination given by the City of Atlanta for the purpose of promoting candidates in the Bureau of Fire Services from the rank of firefighter to fire lieutenant was properly validated. The appellants fault the decision of the district court in two respects: (1) the district court failed to recognize that the validation study on which the City of Atlanta relied was flawed; and (2) the district court erred in not determining whether alternative selection procedures would have had less of an adverse impact on the racial composition of those promoted.

The issue in this lawsuit is a narrow one. Although the City conceded that the promotions had an adverse impact on race, the appellants do not assert that the examination was racially biased. Instead, they attack the validation of the test. As will be discussed infra, if business necessity requires employees to perform certain specified skills, an employer is not guilty of violating Title VII of the Civil Rights Act if promotions are made pursuant to testing applicants for those skills by a procedure that is properly validated. 29 C.F.R. § 1607.2(C) states:

Nothing in these guidelines is intended or should be interpreted as discouraging the use of a selection procedure for the purpose of determining qualifications or for the purpose of selection on the basis of relative qualifications, if the selection procedure had been validated in accord with these guidelines for each such purpose for which it is to be used.

The Guidelines do consider the subject of adverse impact under the heading of “Fairness,” which is set forth in 29 C.F.R. § 1607.14(B)(8) and is discussed infra at pages 1531-1532. We affirm.

I. HISTORY

The history of this case properly begins on September 18, 1975, when black firefighters filed a complaint in the United States District Court for the Northern District of Georgia, alleging that certain employment practices of the City of Atlanta were unlawfully discriminatory and violated Title VII of the Civil Rights Act of 1964, 42 U.S.C. § 2000e and 42 U.S.C. §§ 1981, 1983 and 1985 (district court civil action C75-1809, in this court 86-8607). The district court allowed the International Association of Firefighters to intervene on behalf of white firefighters employed by the Bureau of Fire Services of the City of Atlanta. Thereafter, on December 12, 1975, the United States of America, as plaintiff, filed its own action against the City of Atlanta, (district court civil action C75-2315, in this court 86-8788). The district court consolidated the two actions and eventually all parties to this case entered into a consent order resolving all allegations of discrimination and reverse discrimination. In this consent order, the City of Atlanta agreed not to discriminate against any person in its Bureau of Fire Services on account of race. The consent order also included a provision stipulating that the district court retained jurisdiction over this matter and reserved the right to initiate, on its own motion, appropriate proceedings in the event a question arose as to any party’s compliance with the provisions of the agreement. The consent order was approved by the district court on November 9, 1979. (R. 5 at 119).

The consent decree implicitly ratified City of Atlanta Ordinance Sections 11-3041 to 11-3048 (Defs.Exh. 1) which provided in part: “A candidate shall first take a written examination of knowledge that has been validated for content by a professional tester consulted in accordance with Title VII of the 1964 Civil Rights Act, as amended (42 U.S.C. § 2000e et seq).” The general purpose of the ordinance “is to establish a promotional system for the bureau that provides for the selection of superior officers based solely on merit and qualifications.” (§ 11-3041). The ordinance makes clear that a written examination is the sole component in making promotions to the rank of lieutenant. Promotion to captain, however, includes both a written examination and an oral examination.

In the summer of 1979, the City of Atlanta entered into a relationship with McCann Associates, Inc. of Huntingdon Valley, Pennsylvania, a professional test-developing firm. After preliminary studies and consultations, it was decided in 1980 that McCann was to provide the Bureau of Fire Services with a written, multiple-choice examination, to be used in promoting individuals to two company officer ranks — fire lieutenant and fire captain. The test would be based on a thorough job analysis, be reliable and valid, and would conform to the requirements of the Uniform Guidelines for Employee Selection Procedures. 29 C.F.R. § 1607.

After completion of a job analysis, McCann developed a pool of 250 questions from which three alternative examinations could be drawn. In order to demonstrate the job relatedness of the potential exams, McCann subjected them to a criterion related validity study using the concurrent model. As part of the validation process, McCann asked shift commanders and battalion chiefs in the Bureau of Fire Services to give job performance ratings to 78 existing fire captains, using criteria developed by McCann. Forty-two captains were white and 36 were black. McCann trained the shift commanders and battalion chiefs in the rating procedure and the ratings for the 78 fire captains were taken in January, 1981. In March 1981, the pool of 250 questions was administered to the fire captains in the form of a written examination. Pri- or to taking the exam, McCann provided the fire captains with a test study guide and, after scoring the 78 exams, McCann compared each individual’s test score with the performance rating given that individual by his supervisors.

The Uniform Guidelines for Employee Selection Procedures (“Guidelines”) require a sufficiently high correlation between test scores and performance ratings in order to establish the criterion related validity of a promotion exam. In other words, an employer desiring to use a certain exam must first show that those performing well on the exam are also given high job performance ratings and those performing poorly are given lower job performance ratings. If a statistically significant relationship is indicated, then the examination is considered to be valid because a high grade on the examination is predictive of satisfactory performances.

When McCann compared the 78 test scores with their corresponding performance ratings, the correlation proved insufficient to meet the requirements of the Guidelines. Accordingly, none of the three alternative exams were validated. After analyzing the information that had been gathered up to that point, McCann Associates decided that the reason for the poor correlation was the relative inexperience of the shift commanders and battalion chiefs. Most of the supervisors had been promoted to their current positions only nine months previously. Not only did they lack experience at their own job, they had only had nine months to observe their subordinates. McCann concluded that this period of time was insufficient for the supervisors to adequately rate the performance of the fire captains. McCann and the City of Atlanta decided to have the supervisors rate the fire captains once again and compare the new ratings with the original test scores from March 1981. In June 1982, after fifteen months had passed since the exam had been given and after twenty-four months of supervision of those being rated had accrued, the rating was repeated, using the same training and techniques. Eleven new fire captains participated by taking the 250 question exam, raising the number of subjects to 89, 49 of whom were white, 40 of whom were black. When McCann compared the new ratings with the test scores, they found a correlation that was sufficient to satisfy the Guidelines for each of the three alternative forms of the exam.

After McCann reported the successful validation of the examination, the City of Atlanta decided to have McCann test the candidates in order to rank them for promotion to fire lieutenant. (R. 28-440). In October of 1984 the Bureau of Fire Services administered a written, multiple-choice examination for the purpose of promoting ten persons to the rank of fire lieutenant in the Bureau. Of the 270 firemen who took the test, 156 were black, 113 were white, and one was of “other” racial origin. (See Def.Exh. 2.) The persons who achieved the ten highest scores consisted of nine whites and one oriental. Because this result manifested a statistical adverse racial impact against black persons, the City engaged Dr. John Veres and Dr. Chester Palmer of the University of Auburn at Montgomery, Alabama, to determine whether the examination was validated in accordance with the Uniform Guidelines for Employee Selection Procedures (1978), 29 C.F.R. § 1607 (1982).

Although satisfied that the test examination was properly validated, the City informally requested the district court to conduct a hearing for the purpose of determining that the examination was properly validated according to law. Following this request, the court issued an order on May 1, 1986, directing any person to show cause why the examination should not be validated. The City Attorney was directed to serve a copy of the order upon all applicants who took the examination.

At hearings held by the district court on May 15, 1986 and July 22, 23 and 24, 1986, a group of applicants, through counsel, appeared and contested the validation of the examination. This appeal ensued. During the course of those hearings, the City of Atlanta assumed the burden of demonstrating that the examination was validated in accordance with the appropriate standards. The City offered testimony and documentary evidence through William Howeth, Vice President of McCann Associates, Inc., and Drs. John G. Veres, III and Chester I. Palmer, Jr., both of Auburn University. The appellants offered the testimony of Dr. Stephen Cole, a research director employed by Research Design Associates in Decatur, Georgia, and who also held a teaching position at the Emory University School of Law. Dr. Cole supported the appellants’ assertion that the examination was flawed and had not been properly validated.

After receiving all of the evidence offered by the parties in this case, Judge Charles A. Moye, Jr., ruled that the examination was validated in accordance with the Uniform Guidelines for Employee Selection Procedures. An order to that effect was entered on July 25, 1986, but the court delayed the effective date until August 10, 1986, to allow appellants time to file an appeal. The appellants moved the district court to enjoin the City of Atlanta from making any promotions based on the examination pending their appeal of the district court decision. This motion was denied on August 7, 1986. The appellants then moved this court to stay the district court’s July 25, 1986 order, pending this appeal. This motion was also denied on August 18, 1986, and this appeal proceeded.

II. VALIDATION OF EMPLOYMENT EXAMS

In Title VII cases involving the use of written examinations for hiring or promotion, a prima facie case is presented by a statistical demonstration that the test in question has an adverse racial impact. Upon a showing of such impact, the burden shifts to the employer to prove that the test is job related. Albemarle Paper Co. v. Moody, 422 U.S. 405, 425, 95 S.Ct. 2362, 2375, 45 L.Ed.2d 280, 301 (1975); Fisher v. Procter & Gamble Manufacturing Co., 613 F.2d 527, 544 (5th Cir.1980), cert. denied, 449 U.S. 1115, 101 S.Ct. 929, 66 L.Ed.2d 845 (1981); Scott v. City of Anniston, 597 F.2d 897, 901 (5th Cir.1979), cert. denied, 446 U.S. 917, 100 S.Ct. 1850, 64 L.Ed.2d 271 (1980). If the employer establishes the job relatedness of the test, the burden then shifts back to the challenging party to demonstrate that alternative methods of selection would have a lesser adverse impact. Dothard v. Rawlinson, 433 U.S. 321, 329, 97 S.Ct. 2720, 2725, 53 L.Ed.2d 786, 797 (1977); Albemarle Paper Co., 422 U.S. at 425, 95 S.Ct. at 2375.

The Equal Employment Opportunity Commission has issued “Uniform Guidelines on Employee Selection Procedures” to assist in determining whether employment tests are job related. 29 C.F.R. § 1607. The Guidelines “are designed to assist employers, labor organizations, employment agencies, and licensing and certification boards to comply with requirements of Federal law prohibiting employment practices which discriminate on grounds of race, color, religion, sex, and national origin. They are designed to provide a framework for determining the proper use of tests and other selection procedures.” 29 C.F.R. § 1607.1(B). The Supreme Court has held that the Guidelines are “entitled to great deference.” Griggs v. Duke Power Company, 401 U.S. 424, 433-34, 91 S.Ct. 849, 854-55, 28 L.Ed.2d 158, 165-66 (1971). See also Watkins v. Scott Paper Co., 530 F.2d 1159, 1186-87 (5th Cir.), cert. denied, 429 U.S. 861, 97 S.Ct. 163, 50 L.Ed.2d 139 (1976).

The Guidelines refer to three procedures or “validity studies,” whereby sufficient job-relatedness may be demonstrated: criterion-related validity, construct validity, and content validity. 29 C.F.R. § 1607.5. The Supreme Court has described these methods of validation:

Professional standards developed by the American Psychological Association in its Standards for Educational and Psychological Tests and Manuals (1966), accept three basic methods of validation: “empirical” or “criterion” validity (demonstrated by identifying criteria that indicate successful job performance and then correlating test scores and the criteria so identified); “construct” validity (demonstrated by examinations structured to measure the degree to which job applicants have identifiable characteristics that have been determined to be important in successful job performance); and “content” validity (demonstrated by tests whose content closely approximates tasks to be performed on the job by the applicant). These standards have been relied upon by the Equal Employment Opportunity Commission in fashioning its Guidelines on Employee Selection Procedures, 29 CFR pt. 1607 (1975), and have been judicially noted in cases where validation of employment tests has been in issue. See e.g., Albemarle Paper Co. v. Moody, 422 U.S. 405, 431, 95 S.Ct. 2362, 2378, 45 L.Ed.2d 280, 304 (1975); Douglas v. Hampton, 168 U.S.App.D.C., [62] at 70, 512 F.2d, [976] at 984 [(D.C.Cir.1975)]; Vulcan Society v. Civil Service Comm’n, 490 F.2d 387, 394 (CA2 1973).

Washington v. Davis, 426 U.S. 229, 247 n. 13, 96 S.Ct. 2040, 2051, 48 L.Ed.2d 597, 612 (1976).

In this case, the City of Atlanta chose to attempt validation of its exam through the “criterion-related” method. As the issues in this case hinge upon a clear understanding of this technique, a closer analysis is necessary.

At the heart of criterion-related validity is the statistical correlation between performance on the test and objective measures or “criterions” of performance on the job. This is measured in one of two ways. In a “predictive” study, all applicants for a position are given the examination. Those applicants selected for the position are allowed to work at the job for a period of time and their job performance is then measured. Their preemployment test scores are then compared to their job performance ratings. In a second method, known as “concurrent” validation, the test is administered to existing employees and their scores are compared to their job performance. It is this method that was used by McCann Associates in the preparation of the allegedly unlawful examination.

To prove that a test is criterion-related, a proponent of an exam must show two elements of correlation. These elements are “practical significance” and “statistical significance.” Practical significance is the degree to which test scores relate to job performance and is measured by a “correlation coefficient.” Statistical significance is a measure of the confidence that can be placed on the practical significance; that is, it expresses the probability that a particular correlation coefficient occurred by chance. In Ensley Branch of NAACP v. Seibels, 616 F.2d 812 (5th Cir.), cert. denied, 449 U.S. 1061, 101 S.Ct. 783, 66 L.Ed.2d 603 (1980), the former Fifth Circuit felt compelled to state that “explanation of a few statistical concepts is in order.” That explanation bears repeating here:

Statistically, the degree of correlation between two variables {e.g., entrance exam scores and subsequent school grades) is expressed as a “correlation coefficient” on a scale running from + 1.0 to — 1.0. A perfect positive correlation (e.g., entrance exam scores exactly predict subsequent school grades, with the higher exam scores predicting the best grades) would be expressed as + 1.0, and a perfect negative correlation (e.g., entrance exam scores exactly predict subsequent school grades, except in reverse, with the lower exam scores predicting the best grades) would be expressed as — 1.0. Where the two variables had absolutely no relationship to each other, the correlation coefficient would be .0. The closer a correlation coefficient is to either + 1.0 or — 1.0, the “higher the magnitude” of the correlation; and the closer it is to .0, the “lower the magnitude.” Mueller, Schuessler & Castner, Statistical Reasoning in Sociology, 2d Ed., at p. 315. Because a purely random drawing of a sample is liable to produce a correlation coefficient which is somewhat off an absolute .0, the concept of statistical significance becomes relevant. The concept is tied to the statistical theory of probability and is dependent upon the number of people in the sample. Generally, if a correlation coefficient is so low that, on the basis of the sample size involved, more than 1 in 20 random drawings could be expected to produce a correlation at least as great, that correlation coefficient is considered not to be statistically significant, or simply to be the same as a correlation coefficient of .0. On the other hand, if the obtained coefficient could be expected to reoccur no more than once in 20 random drawings, it is considered statistically significant, the statistical indication for which is p<.05. A correlation co-efficient of the obtained magnitude which could not be expected to occur by chance more than once in 100 random drawings is expressed as p<.01. Mueller, et al., pp. 394, et seq.

Ensley Branch of NAACP, 616 F.2d at 817 n. 13.

III. STANDARDS OF REVIEW

Appellants take issue with the district court regarding certain findings of fact and the application of a rule of law. Generally, we apply a clearly erroneous standard of review for a factual determination by a district court. Fed.R.Civ.P. 52(a). This standard has been applied previously in a Title VII case questioning the validity of an employment examination. Ensley Branch of NAACP, 616 F.2d at 818. “A finding is ‘clearly erroneous’ when although there is evidence to support it, the reviewing court on the entire evidence is left with the definite and firm conviction that a mistake has been committed.” United States v. United States Gypsum Co., 333 U.S. 364, 395, 68 S.Ct. 525, 542, 92 L.Ed. 746, 766 (1948). Because the district court’s determination regarding the validity of the examination in this case was a factual one, we examine it for clear error.

The appellants further claim that the district court misapplied a rule of law regarding the use of alternative selection procedures. This question of law is subject to plenary review. Bailey v. Carnival Cruise Lines, Inc., 774 F.2d 1577, 1578 (11th Cir.1985).

IV. THE VALIDITY STUDY ISSUE

Appellants contend that the validity study conducted by McCann Associates, Inc. was flawed in two respects: “1. The inconsistency between supervisory ratings and the fire company officer test” and “2. The inconsistency of ratings given by shift commanders and those given by battalion chiefs.” (App.Br. pp. 14 and 16.) Furthermore, appellants criticize the fifteen-month delay between the administration of the exam and the supervisory ratings. The district court found that the evidence submitted by appellants did not support their contentions.

Dr. Stephen Cole, appearing on behalf of the appellants, testified at the district court hearing that he had observed inconsistencies between the test scores of the participating fire captains and the ratings given those officers by their respective shift commanders and battalion chiefs. Dr. Cole stated:

The point here is that there is very little correspondence with those captains between the test and the supervisory ratings. For those of correlation .33, there is very little association between how well you did on the test and how well the supervisors evaluated your job performance. Of the top 25 candidates identified by supervisors, only seven of them would be identified in the top 25 if the test alone was used. (R. 22-180).

Dr. Cole also pointed out that the person who ranked first on the written exam was ranked 31st by the supervisors; the person who ranked second on the written exam was ranked 68th by the supervisors; and the person who ranked 20th on the written exam was ranked 79th by the supervisors. Dr. Cole’s conclusion was that “... clearly when used as the sole criterion for promotion, the test does not act as a good predictor.” (R. 22-183).

Dr. Cole is not an Industrial Psychologist and this case apparently offered him his first opportunity to analyze a Criterion Measure Performance Rating (CMPR) system of testing for job skills and knowledge. His testimony that the ratings of the subject for job skills did not correlate with their test scores is seriously flawed. Dr. Cole relied on his (plaintiffs) Exhibit 13 which listed (not by name) the 89 subjects, the ratings given by their supervisors, and their test scores. His testimony to the district court concerned only 25 of the subjects which he picked out because their test scores differed greatly from their ratings. For example, the subject who was ranked at the bottom, 89th, was 5th from the top on test scoring. Cole used him and others to contend that the study results were skewed.

It is elementary statistical knowledge that the smaller the sample the more unreliable the result. McCann Associates admitted concern that having only 89 subjects was not as large a group as would be preferred. Dr. Cole shrank the group from 89 to 25 in order to reach a conclusion that the study did not accurately predict that the test would result in demonstrating which firemen would make the best fire lieutenants. The district court correctly rejected Dr. Cole’s unscientific approach to the issue.

One of the City’s experts was Dr. Chester Palmer, a professor of mathematics at Auburn University who had experience in the field of statistical analyses of employee testing for Title VII purposes and had clients such as the United States Navy, United States Air Force, and the personnel Departments of the States of Alabama and Georgia. Dr. Palmer concluded that the McCann study and tests met professional standards.

Dr. Palmer conducted five separate analyses of the validity of the CMPR system used by McCann. Dr. Palmer ran these analyses in an attempt to equalize— that is, to take out the extraordinary influences caused by certain factors, such as a very lenient supervisor/rater who scored everyone rather high or such as the captain (ratee) who was rated 89th by the raters but scored fifth on the test. (This person was rated by only one supervisor rather than the usual two or three, which may account in part for the deviation.) Dr. Palmer’s term for such persons was “an outlier.” Others might use the term “sport” or “deviant.” Dr. Palmer discussed the importance of such factors in making an analysis:

One of the problems with this kind of study with small “m” is that one person can swing that correlation a large amount. So I tried just to see what would happen if I went back and I ran those charts of the correlations again but I left him out. Okay. Now, I must emphasize I don’t have any justifiable reason for leaving him out other than the fact that it looks strange. Okay. That is, I’m not saying that the right thing to do is to leave him out. I can’t say that but when you see one person who is that wild, it only makes sense to see what would happen otherwise.
The answer turns out to be quite different. Now, we are back to the back of the chart. In the back of the chart is three more of the correlation tables. You can see in the title that these are labeled in the third line of the title, “88 subjects, one outlier removed.” The first one is what the correlations would be for the combined group if you did not include that individual. And we see that now if you read down under the total column we now have two of them that are almost .4. We’ve got one that’s 893 and one that’s 398.
So that essentially this one individual will change the correlation from .4 to .3. That’s what happens in small groups when there’s one person who is sort of off the wall.
Q. I haven’t found the .4 and the .3?
A. I’m sorry. The .3 is not in this table. It’s from the previous one. But if you read under the column that says “total,” if you read down for example to the one that says, “criterion standard,” 393, that’s what McCann would have gotten if they hadn’t had that one individual, if they just had the other 88 people, and the others are what the other ones would have been. They are all higher than they were before. They are all significant at the one percent level.
Now, the next two charts are the same thing as they were before. One set for the blacks, one set for the whites. It happens that this one individual is black so that the chart for the whites is going to be the same as it was before. They are the same people. But the chart for the blacks is different, and if you now look under “total” you see that all of a sudden they have gone way up.
For example, for the McCann standardization, the one I have labeled “crit std” it has now gone all way up to .45. It is now significant past the one percent level. What’s happening is that the only reason that that isn’t significant for blacks separately is this one person. Now, again, I have to emphasize I don’t think you can just throw him away because I don’t have a legitimate reason to say that’s an error of some kind. But the fact is that actually the correlation would be higher for blacks than for whites if it hadn’t been for one person. And that’s something that I think kind of important in interpreting results.

R. 23 at 337-39.

The second issue raised by the appellants deals with the supervisory ratings given by the shift commanders and the battalion chiefs. Dr. Cole criticized these ratings on the ground that inconsistencies existed between the ratings given particular subjects by their respective shift commanders and those given by their battalion chiefs. Some fire captains were ranked at a particular level by their shift commander and ranked considerably lower by their battalion chief. (R. 22-164). Dr. Cole concluded that the supervisors were using the rating system differently, despite McCann’s efforts to achieve uniformity.

The City of Atlanta responded to this criticism by pointing out that McCann had employed a standardization procedure to correct the inconsistencies between the ratings. Dr. Palmer testified that McCann had made adjustments to the ratings so that the rankings given by each supervisor were uniform. McCann examined each supervisor’s rankings to determine what that shift commander or battalion chief considered to be average job performance. Those rankings were then statistically adjusted to conform with a uniform standard of average job performance based on the rankings of all supervisors. The rankings of those individuals who the supervisor felt rated above or below average were similarly adjusted to reflect the collective average. Dr. Palmer agreed that standardization was the proper procedure in this instance, but stated:

The next question is how should you standardize and the problem I have is that there’s no professionally recognized formula for how to do this. There are different ways you could set up the way the numbers work. I don’t have any particular objection to what McCann did but I would be very unhappy if the study yielded a significant correlation their way and it didn’t come out any other way because I’m not sure their way is the best. So I went back and did it a bunch of different ways, ... (R. 22-297).

To be precise, Dr. Palmer performed five different standardization procedures. Dr. Palmer testified that each of the five standardization procedures yielded virtually identical results. He stated:

Now, I’m pleased to say after all this— and we can look at the numbers, if you like, but it didn’t make any difference. That’s what it comes right down to. Actually, it made very little difference which of these you used. Including the raw score, which I thought was amazing but I ran all my analyses, you know, when I did the correlations to see if things were significant, I ran all my analyses five times. Once for each of those five ways of giving the number to see if it made any difference how you gave the number, and it really didn’t. (R. 22-303).

Dr. Palmer expressed confidence in the validity of the supervisory ratings and their correlation with the test scores, stating:

That’s one reason that I have faith that the numbers — the correlations reported by McCann Associates are reasonable because when I did some of these things that were totally different I cam out with almost the same thing. That’s a good reason to believe that the numbers are reasonable, that they haven’t just concocted a method that gives big numbers. (R. 22-303).

In sum, the City of Atlanta maintains that the Guideline requirements regarding the establishment of validity have been met; therefore the reliability of the criteria measurements are irrelevant. Mr. William F. Howeth testified that the exam known as Form A was found to have a correlation coefficient of + .33. (R. 20-38). The statistical significance was demonstrated to be p<.01, meaning there was less than one chance out of one hundred that the correlation coefficient was the product of chance. (R. 20-39). This value falls within the range set forth in the Guidelines of .05 or less. 29 C.F.R. § 1607.14(B)(5). Dr. John Veres testified that the correlation coefficient obtained on the study was particularly significant in view of the factors working against McCann. He concluded that “... with respect to the guidelines and what I would construe to be the consensus of people in our profession, the criterion related validity study passes muster.” (R. 20-94).

The final criticism raised by the appellants is the possibility that the fifteen-month delay between the administration of the exam and the supervisory ratings may have led to “contamination.” McCann allowed any fire captain participating in the validity study to learn his score on the written examination by submitting a self-addressed envelope along with his answer sheet at the completion of the test. The exam was given in March 1981. The test scores were not revealed to anyone other than the individual who had taken the exam. Appellants suggest that the shift commanders and battalion chiefs may have learned the test scores of the fire captains under their supervision. When the supervisors ranked these fire captains in June 1982, appellants maintain that knowledge of these scores may have led the shift commanders and battalion chiefs to rate individuals in accordance with their performance on the exam. Appellants point out that the Guidelines specifically warn against this form of contamination. “Proper safeguards should be taken to insure the scores on selection procedures do not enter into any judgments of employee adequacy that are to be used as criterion measures.” 29 C.F.R. § 1607.14(B)(3). Beyond raising this possibility, the appellants offer no evidence of any actual contamination.

In the course of cross-examination, Dr. Palmer testified that while there were potential problems with the fifteen-month delay, McCann had taken steps to avoid the possibility of contamination. He stated: “There is a possibility that the supervisors who were doing the rating might have known the scores of those people that are being rated. That is something you would normally try to prevent.” (R. 23-407). Documentary evidence submitted by the City of Atlanta indicated that McCann considered the possibility of contamination and rejected it based on their analysis of the circumstances leading to the second set of performance ratings in June 1982. In addition, McCann engaged an independent psychologist to make a complete statistical study of the ratings data to determine if any evidence of contamination existed. The psychologist concluded: “I find no evidence of contamination in the second set of ratings and I believe it is unlikely that such contamination occurred.”

As mentioned in the beginning of this opinion, the Guidelines require consideration of fairness when a test is validated through a criterion-related study. The applicable section states:

(b) Investigation of fairness. Where a selection procedure results in an adverse impact on a race, sex, or ethnic group identified in accordance with the classifications set forth in section 4 above and that group is a significant factor in the relevant labor market, the user generally should investigate the possible existence of unfairness for that group if it is technically feasible to do so. The greater the severity of the adverse impact on a group, the greater the need to investigate the possible existence of unfairness.

29 C.F.R. § 1607.14(B)(8)(b). Dr. Palmer discussed his analysis with respect to the fairness doctrine:

Q. Dr. Palmer, did you perform any analyses on whether the examination was fair to black applicants in this case?
A. Yes, I did. I should perhaps begin by explaining what “fairness” means in this context. The accepted definition of “fairness” in a criterion context like this is not whether groups differ in their test scores or not but whether, for example,, if you have a black with a score of 75 and a white with a score of 75 you would expect them to be equally good officers; that is, are the predictions roughly the same for blacks and whites.
Obviously, if you expected a black who made a 75 to be as good as a white you made an 80, there would be some unfairness there because you are preferring the person with an 80 on the basis of his test score. And you can really only do that if it’s justified by your expectation of performance. And so there’s a standard way of testing this. Actually, McCann Associates did this for their criterion measure and it appears in the report. I arranged the computations slightly differently but I did it for all five of these and they formed — the results of this formed my final exhibit which I’m sure pleases everyone, which is exhibit 18, and I should perhaps say something else before we go into the numbers, too. Both the Guidelines and the standards require investigation of test fairness when it is technically feasible. Now, I believe that it is arguable whether or not it was technically feasible under these circumstances, the groups are so small— we are talking about groups of 40 and 49 — that I would probably be sympathetic to an argument that it wasn’t really technically feasible.
On the other hand, since it was easy to do I did it anyway and these are the results. And I sort of think this is arguable whether one needs to do this but as I said McCann Associates did it and since it was easy to do I did it rather than argue that it might not be feasible because I think that’s a judgment call.
What we have here is five sets, each of which contains three pages.
If the number for race in the last column under where it says “probability greater value of ‘t,’ ” if that number were below .05 it would say that race was statistically important, significantly important. We see here that it is not, that the coefficient is .20, that it’s well within the range that could be expected by chance.
:fc ‡ * ‡ jjt $
The test for that is again on the bottom line. It’s the variable called “interact,” which stands for interaction, and again we can look under the last column where it says “probability of getting a value that large by chance,” we see that it’s .65 which is fine. That’s actual — that means that the effect is actually smaller than we would expect to happen by accident.
‡ >Jt * # $ #
Q. What is the conclusion that you would draw from your examination of defendant’s exhibit 18 collectively?
A. There is no evidence of unfairness on a racial basis for this test as compared to the criterion measure.

R. 23 at 339-42.

Dr. Palmer concluded that McCann Associates had met the requirements of the Guidelines both because the test accurately measured the skills and talents needed by fire lieutenants, and because the fairness doctrine requirements had been fulfilled.

We have discussed the fairness doctrine although the issue was not raised by appellants nor discussed by Dr. Cole in his testimony. It is helpful to note here the comment of the district court judge at the conclusion of the evidence:

The Court: Well, I’ll do it this way. I don’t know that I can just mentally go through the various things that are challenged but I will make subsidiary rulings as requested. But I will find the test valid and my general basis is this. I recognize the technical challenges to the test, that is, I understand what the challenges are.
To me it is important that there is not the slightest suggestion in any of the challenge that there is any racial bias contained in any of the questions. There’s not been the slightest challenge —and it may be inappropriate in this type of case — but there isn’t the slightest challenge as to the job relatedness of the items despite the admitted obvious and severe adverse racial impact which is recognized by the court.
Now, differing from the police case, there was in that case what appeared to the court to be an admission, concession, universal agreement that the test instrument did not measure a substantial proportion of the job content. There is no such question here and that is the difference as I see it, and particularly with respect to what I gather is the suggestion here that maybe additional instruments should be used.

R. 19 at 2.

After receiving all the evidence in this case, the district court ruled that the examination had been properly validated. The district court judge, with the agreement of all parties, made a general ruling that the test was valid and offered to make subsidiary rulings as requested. No such requests were made. As a result, this court is placed somewhat at a disadvantage when asked to rule on the correctness of the decision below, due to the lack of meaningful insight into the basis for the trial court’s holding. In an earlier case involving the validity of a promotion exam, this court stated: “Had the court made numbered separate findings of fact and conclusions of law, we might not be in such perplexity. The discursive mode it adopted, while permissible under the rules, leaves the reader frequently in the dark as to the legal assumptions that may underlie any particular fact assertion.” Nash v. Consolidated City of Jacksonville, 763 F.2d 1393, 1398 (11th Cir.1985), petition for cert. filed (May 27, 1988).

The finding of validity by the district court must be examined by this court under a standard of clear error. Ensley Branch of NAACP, 616 F.2d at 818. We must therefore consider whether there was sufficient evidence on which the trial court could base its decision that the examination had been properly validated. Part of this inquiry requires an examination of the weight that is to be given to the opinions of the respective expert witnesses in this case. “... [O]ne of the most generally accepted rules of all jurisprudence, state and federal, civil and criminal, is that the questions of credibility and weight of expert opinion testimony are for the trier of fact....” Mims v. United States, 375 F.2d 135, 140 (5th Cir.1967). “Credibility choices and the resolution of conflicting testimony are for the trial court, if not clearly erroneous.” Middleton v. Dan River, Inc., 834 F.2d 903, 910 (11th Cir.1987), citing United States v. Reddoch, 467 F.2d 897, 898 (5th Cir.1972). In Anderson v. Bessemer City, 470 U.S. 564, 575, 105 S.Ct. 1504, 1513, 84 L.Ed.2d 518, 529-30 (1985), the Supreme Court stated: “When a trial judge’s finding is based on his decision to credit the testimony of one of two or more witnesses, each of whom has told a coherent and facially plausible story that is not contradicted by extrinsic evidence, that finding, if not internally inconsistent, can virtually never be clear error.”

At the district court hearing, Dr. Cole, the lone witness for the appellants, testified under cross-examination that he had not taken a course with test construction as its sole focus; that he had not taught courses focusing solely or primarily upon psychometrics, test construction, or test validation; that he had never published in the areas of psychometrics or test construction; and that he had never been in charge of construction of a promotional examination similar to that involved in this case. (R. 22-188-98). He further testified that his Ph.D. was in the area of human experimental psychology (R. 22-190), and his major publication was in the area of sleep research. Dr. Veres and Dr. Palmer, witnesses for the City of Atlanta, were accepted by the appellants and the district court as experts. (R-20-65 and 22-264).

The district court, upon consideration of the credentials of the witnesses, their testimony, and the evidence presented, determined that the experts testifying on behalf of the City of Atlanta were more persuasive and ruled that the test had been properly validated. “Job-relatedness can be established through the testimony of expert witnesses supported by a validation study.” Nash v. Consolidated City of Jacksonville, 837 F.2d 1534 at 1537-38 (11th Cir.1988) (citations omitted). “A finding is clearly erroneous when although there is evidence to support it, the reviewing court on the entire evidence is left with the definite and firm conviction that a mistake has been committed.” United States v. United States Gypsum Co., 333 U.S. at 395, 68 S.Ct. at 542. Upon a review of the “entire evidence,” we are not left with this level of doubt regarding the decision of the district court. Because we lack a “definite and firm conviction that a mistake has been committed,” we reject the appellants’ objections.

V. THE ALTERNATIVE SELECTION PROCEDURES ISSUE

The appellants also allege that the district court erred in not requiring the City of Atlanta to review supplemental or alternative measures to lessen the adverse racial impact of the examination. Appellants cite Giles v. Ireland, 742 F.2d 1366 (11th Cir.1984), in support of this argument. Reliance on Giles in this instance is misplaced as Giles clearly adopts the ruling in Albemarle Paper Company v. Moody, 422 U.S. 405, 95 S.Ct. 2362, 45 L.Ed.2d 280 (1975). While the district court is required to determine “whether adequate alternatives with a lesser adverse impact would serve the employer’s needs,” Giles, 742 F.2d at 1374, Albemarle clearly places the burden of demonstrating the usefulness of alternatives on the complaining party:

If an employer does then meet the burden of proving that its tests are “job related,” it remains open to the complaining party to show that other tests or selection devices, without a similarly undesirable racial effect, would also serve the employer’s legitimate interest....

Albemarle, 422 U.S. at 425, 95 S.Ct. at 2375.

Appellants point out that Mr. William Howeth of McCann Associates admitted on cross-examination that an oral component of the examination would be desirable. (R. 20-53). They fail to point out, however, that Mr. Howeth’s complete testimony was that while an oral interview would be desirable, it was not necessary due to the positive results of the validation study performed on the written exam. When asked if it was his opinion that another component needed to be added to the exam in order for it to comply with the appropriate federal guidelines, he replied: “Clearly not ... because the instrument that was used is a valid instrument.” (R. 20-36). Appellants presented no evidence demonstrating that an oral interview or any other alternative procedure would lessen adverse racial impact while serving the “employer’s legitimate interest.”

The burden of establishing the effectiveness of alternative selection procedures lies with the complaining party. We cannot fault the district court for not requiring such a showing by the City of Atlanta. Because the district court properly placed the burden on the appellants and the appellants failed to carry that burden, we must reject the appellants’ claims regarding the use of alternative selection procedures in this case.

VI. BUSINESS NECESSITY

The business necessity test is part of the employment discrimination law of this circuit. Business necessity is closely akin to job relatedness and the terms are often interchanged. Job relatedness is used in analyzing the questions or subject matter contained in a test or criteria used by an employer in making hiring or promotional decisions. Business necessity is larger in scope and analyzes whether there is a business reason that makes necessary the use by an employer of a test or criteria in hiring or promotional decision making.

The doctrine in this circuit originated in Pettway v. American Cast Iron Pipe Company, 494 F.2d 211, 244-45 (5th Cir.1974), cert. denied, 439 U.S. 1115, 99 S.Ct. 1020, 59 L.Ed.2d 74 (1979). In Pettway the company had required an applicant to pass a test as a prerequisite to hiring. The district court eliminated that requirement. The district court did not eliminate the employer’s requirement that admittance to an apprentice program required a high school education. Our court struck the requirement of a high school diploma or equivalent criterion because it did not measure the skills necessary for the course work required by the apprenticeship. In Pettway, we said:

The test is whether there exists an overriding legitimate business purpose such that the practice is necessary to the safe and efficient operation of the business. Thus, the business purpose must be sufficiently compelling to override any racial impact....

494 F.2d at 245.

The most applicable circuit case to the one under consideration is Nash, 837 F.2d 1534. There the issue involved the promotion of firefighters to the position of lieutenant in that city’s fire department. The City of Jacksonville failed to promote appellant Nash because of his low score on a promotion examination. Nash showed an adverse impact and the panel in Nash discussed the problems related to those we confront. However, in Nash the City failed to establish that the test used for promotion purposes was validated under the Guidelines, 29 C.F.R. § 1607. The panel found that the City had failed to prove that the test established job relatedness and said the following about business necessity:

Thus while the City has a business necessity in ensuring that it promotes only qualified firefighters to lieutenant, it has no business purpose at all for using an improper means to do so. The City presented no evidence on why a racially discriminatory test is a business necessity.

837 F.2d at 1539. This'statement followed a finding by the panel that the City’s expert had admitted that 27 of the 97 questions on the 1981 examination had an adverse impact on black applicants.

At no point in this case, in the testimony or the briefs, is there any reference to whether the test given was a business necessity. It is apparently assumed by all of the parties and the district court that once job relatedness was shown, a separate showing of business necessity was unnecessary. This may well stem from the consent decree and the City Ordinance previously quoted. The consent decree contains the following paragraph:

X. That the City shall, as a goal, seek and use good faith efforts to recruit and hire applicants so that, within three (3) years from the entry of this decree, the representation of white and black firefighter recruits, among those hired during this time period, shall approximate the representation among those hired in the years 1970-1975, this Court having previously found no evidence of discrimination during this time period. To this end, the City agrees to take the following actions as part of its hiring procedures.
A.To make a clear policy statement of non-discrimination toward all races as part of its recruitment efforts, which statement shall convey the following, or language substantially similar to the following:
“Allegations have been made in the past that the City has discriminated against whites and blacks. The City wishes to make clear that it has no policy of discrimination for either black or white applicants, and that all are encouraged to apply for the position of firefighter.”

Such policy statement shall also specify that residency within the City of Atlanta is not a condition of employment.

B. To make affirmative recruitment efforts throughout the surrounding five-county area, in a substantially uniform manner, using racially balanced teams of recruiters, such efforts to exclude advertisements visible only in neighborhoods dominated by one racial group.
C. To utilize a professionally developed test.

R. 5 at Tab 119, p. 4-5. (Emphasis added.)

It is the function of the district court and ourselves to be guided by the consent decree in this case and the City Ordinance which was adopted concurrently with the entry of the consent decree. The City Ordinance provides for a written examination for promotion to lieutenant conditioned upon the examination being validated pursuant to the Title VII proceedings. Since the district court and this panel find that the test was properly validated, and since the appellants have as their only appellate contention that the test was not properly validated, we conclude that there is no error in the district court not having considered this factor. Our own review of the validation report of McCann Associates and the job analysis of fire lieutenants and fire captains convinces us that a test of this nature is a business necessity. It is a matter of common sense that the most qualified firefighters should be promoted to the positions of lieutenant and captain. While only a few skills are needed to put out a grass fire or a burning automobile, fire departments in many instances are called upon to save human life, preserve property, and extinguish fires caused by chemicals and other incendiary causes.

In performing the job analysis and in preparing the test questions, McCann Associates considered a variety of knowledges, skills and abilities required of those in command. The following is a condensation of some of those knowledges, skills and abilities:

A. Fire chemistry and physics knowledge. This analysis pointed out the necessity of the officer in charge of a crew of firefighters being able to identify the cause or source of a fire in order to determine the methodology to be used in fighting the fire, as well as an understanding of the chemical and physical properties of fire, smoke, heat, etc., in order to protect the public.

B. and C. Fire attack knowledge and fire extinguishment knowledge. Included is the necessity of initial evaluation of the fire scene, rescue considerations and performance of firefighters and equipment, as well as choosing the appropriate fire extin-guishment methodology and use of special equipment.

D. Building construction knowledge.

E. Knowledge of local conditions which takes into consideration that the officer in charge must know a variety of ways of reaching the location of the fire, coordinating the route to be taken with those being taken by other fire companies responding to the call, locations of water supplies, hydrant locations, etc.

F. G. and H. Knowledge of administrative procedures of the fire department to insure adherence to policies and procedures, as well as an ability to supervise some ordinance while maintaining a high level of morale.

There were 12 other categories of skills and abilities used by McCann. The study shows the vital importance of promoting only qualified persons to the rank of officer positions in the Atlanta Fire Bureau and the business necessity of the test.

VII. CONCLUSION

This is a difficult case both factually and technically, perplexing to the City of Atlanta, the appellants, and the courts in no small measure. By the terms of a consent order entered into in 1979, the City of Atlanta agreed not to discriminate against any person in the Bureau of Fire Services. See also Afro-American Patrolmen’s League v. Atlanta, 817 F.2d 719 (11th Cir.1987) (consent decree whereby race would play no part in the promotion process in Atlanta Bureau of Police Services). In an attempt to avoid the influence of race in the promotion of firefighters to fire lieutenants, the City wished to base advancement on the results of a scored written exam. Unfortunately the exam, though properly validated, had a serious adverse racial impact.

The job of a fire lieutenant is a difficult one, requiring a high degree of skill, knowledge and experience, as indicated by the job analysis conducted by McCann. Unlike a production line worker or salesperson, a fire lieutenant’s job performance is difficult to quantify. But the human and economic costs that are involved in the poor performance of this job are enormous. The risks of death, injury and property damage are greatly magnified when firefighting leadership is not of the highest possible calibre. The Court of Appeals for the Tenth Circuit, recognizing this reality with regard to airline pilots has held that “when the job clearly requires a high degree of skill and the economic and human risks involved in hiring an unqualified applicant are great, the employer bears a correspondingly lighter burden to show that his employment criteria are job-related.” Spurlock v. United Airlines, Inc., 475 F.2d 216, 219 (10th Cir.1972). Accord, Walker v. Jefferson County Home, 726 F.2d 1554, 1558 (11th Cir.1984). It is clearly imperative that the best qualified candidates should be singled out for advancement to a position as important as fire lieutenant. We cannot fault the City of Atlanta for doing precisely that, without regard to race.

Since it has been demonstrated that the decision of the district court recognizing the validity of the promotion examination in this case was not clearly erroneous, and the appellants failed to carry their burden of demonstrating that alternative selection procedures would have less adverse impact, the district court finding that the examination has been properly validated is

AFFIRMED. 
      
      . On May 15, 1986 at the hearing Commissioner of Public Safety George Napper, Jr. testified there were four vacancies for fire lieutenant and the promotions would go to those receiving the highest scores. There was no testimony as ' to the racial composition of the additional four persons.
     
      
      . These efforts included developing a rating system whereby the rater uses a separate rating sheet for rating each of nine performance dimensions. The rater rates all of his subordinates at one time for a dimension and then starts over and rates all of his subordinates on the next dimension, etc. The purpose of this system was to encourage ratings based on a comparison of how each subject performs a specific dimension in relation to the performance of all his fellow subjects. It discourages rating a subject on the basis of his overall performance and focuses on the subject’s strengths and weaknesses.
      Performance was not ranked numerically to avoid the tendency of some raters to associate a number value with the value of a subject’s performance. The rating process used a ruler-like scale with five areas of proficiency for each particular dimension described in performance terms. This allowed the rater to distinguish among his subordinates those who possessed the dimension being rated in greater or less degree.
      McCann trained the raters prior to their participation and emphasized that the ratings would have no employment consequences for any of the subjects. The raters were urged to use a rating standard based on an "average" drawn from all the fire captains they had ever known. The importance of rating the subjects on personal observation rather than potential was stressed.
     
      
      . According to his testimony, Dr. Palmer duplicated the procedure used by McCann, performed an analysis utilizing unadjusted raw data, and selected three procedures of his own design to insure that McCann’s standardization had not yielded inaccurate results.
     
      
      . This figure was computed using a statistical technique known as a Pearson Product Moment Correlation.
     
      
      . The results of this independent evaluation are reported in Defendants’ Exhibit 4A: The Validation of a Written Test for Fire Company Officers (Lieutenant and Captain), pp. 42-43, McCann Associates, Inc. (1983).
     
      
      . The police case to which the district court makes reference is found on appeal in Afro-American Patrolmen’s League v. Atlanta, 817 F.2d 719 (11th Cir.1987). A panel of this court affirmed this same district judge’s holding in that case that the City was in civil contempt for violating the 1980 consent decree which governed Atlanta’s police department and the civil rights case comparable to this proceeding which started in 1979.
     