
    815 F.2d 84
    Alison PALMER, et al., Appellants v. George P. SHULTZ, as Secretary of State. Marguerite COOPER, et al., Appellants v. George P. SHULTZ, as Secretary of State.
    Nos. 85-6101, 85-6102.
    United States Court of Appeals, District of Columbia Circuit.
    Argued Sept. 25, 1986.
    Decided March 24, 1987.
    As Amended March 24, 1987.
    
      Bruce J. Terris, with whom Ellen Kabcenell Wayne, Washington, D.C., was on brief for appellants.
    Stuart Henry Newberger, Asst. U.S. Atty., with whom Joseph E. diGenova, U.S. Atty., Royce C. Lamberth, R. Craig Lawrence and Diane M. Sullivan, Asst. U.S. Attys., Washington, D.C., were on brief for appellee.
    Bettina M. Lawton, Washington, D.C., was on brief, for amicus curiae, Women’s Bar Association of the District of Columbia, urging reversal.
    Before WALD, Chief Judge, BORK, Circuit Judge, and HAROLD GREENE, District Judge.
    
      
       Of the United States District Court for the District of Columbia, sitting by designation pursuant to 28 U.S.C. § 292(a).
    
   Opinion for the Court filed by Chief Judge WALD.

WALD, Chief Judge:

In this action, a class of women plaintiffs allege various forms of unlawful employment discrimination in the Foreign Service from 1976 to 1983. After a trial, the District Court found that no unlawful discrimination had occurred. See 616 F.Supp. 1540 (D.D.C.1985). This appeal followed. The record, however, discloses that the District Court’s decision was premised on errors of law and that several of its critical findings of fact were clearly erroneous. Consequently, we reverse, and remand for further proceedings in accordance with this opinion.

I. Background Information

A. The Foreign Service and Its Employment Practices

The Foreign Service is our nation’s professional diplomatic corps. Members of the Service represent the interests of this nation abroad and assist the Secretary of State in the formulation of foreign policy at home. See 22 U.S.C. § 3904(1)-{2). The organization of Foreign Service personnel draws on the model of the United States military as well as the United States civil service. See S.Rep. No. 913, 96th Cong., 2d Sess. 2 (1980), U.S.Code Cong. & Admin. News 1980, P. 4419. For example, the Foreign Service is a “rank-in-person” system: members of the Service have an individualized rank which is independent of the rank of the particular job they happen to hold at any given time. H.R.Rep. No. 992, pt. 1, 96th Cong., 2d Sess. 3 (1980).

The Foreign Service also copies the military in its “up or out” personnel system. Individuals must serve a probationary period of up to five years before they can receive a career appointment in the Service. 22 U.S.C. § 3946. If at the end of that period an individual has not received a career appointment, he or she must leave the Service. Id. § 3949. (Although according to the Foreign Service Act of 1980, the term “Foreign Service Officer” refers only to members of the Service with career appointments, and those serving under a limited, probationary appointment are called “career candidates,” the parties to this lawsuit use the term “Foreign Service Officer,” or “FSO,” to refer to those serving under both career and limited appointments. To avoid confusion, we will do likewise.)

The Foreign Service assigns its officers to one of four areas of functional specialization, known as “cones”: political, economic, administrative, and consular. Officers in the political and economic cones deal with, respectively, political and economic dimensions to foreign relations and foreign policy. Officers in the administrative cone “are responsible for the support operations of U.S. embassies and consulates.” 616 F.Supp. at 1544 (¶ 5). Officers in the consular cone “work closely with the public providing assistance to American travelers and residents abroad, issuing visas [and dealing with] other immigration related issues.” Id. (II 6). As the District Court expressly found, the State Department does not encourage FSOs to change cones, and “[o]fficers are expected to serve the major portion of their time in the Service” in the cones to which they were initially assigned. Id. (Ml 10,14). Some officers, however, do switch cones. Senior FSOs who have demonstrated leadership ability may transfer into a “prestigious” program direction cone. Id. at 1554 (11104). Other FSOs are occasionally given temporary assignments to other cones or to some “inter-functional” positions. Id. at 1550 (1170).

Most FSOs applying to the Foreign Service at junior entry levels must take a written examination. Beginning in 1975, the examinations have tested applicants for aptitude in all four functional areas, and the Foreign Service has used the results of these examinations to determine a new FSO’s initial cone assignment. Id. at 1545 (1115.) A relatively small number of individuals have entered the Service laterally as mid-level FSOs. These lateral entrants bypassed the examination process and “selected, in advance, the functional field in which they wished to compete and were evaluated only for that specific cone.” Id. (¶ 17).

Once in the Foreign Service, individuals change specific jobs frequently; the State Department has a policy of assigning individuals to positions for a set period of time, generally two to three years. See id. at 1550 (1Í 71); H.Rep. No. 96-992, pt. 1, 96th Cong., 1st Sess. 3 (1980). Since 1975, job assignments in the Foreign Service have been made pursuant to an Open Assignment Policy, in which all members of the Service receive a list of vacant positions and submit “a bid list” indicating their preferences. These bid lists are compiled into a “bid book” from which assignment panels make their selections, after considering the interests and preferences of the bureau in which each position is located. Id. at 1550 (ITU 73, 74). As previously indicated, some FSOs receive “out-of-cone” assignments pursuant to this process but in the main, job transfers are made inside the cones of initial assignment. In addition, FSOs do not necessarily receive a job position with a rank corresponding to the individual’s personal rank. Positions that have a higher .rank than the individual are known as “stretch” assignments. Positions with a lower rank than the individual’s are “down-stretch” assignments. Pursuant to the Open Assignment Policy, individuals do not receive stretch or down-stretch assignments unless they bid for them, but as with any other assignment, individuals do not receive these assignments simply because they bid for them. Id. at 1551 (¶ 77).

The Foreign Service prepares annual written evaluations of its officers’ job performance. In addition to rating the actual past performances of FSO’s, the evaluations rate the potential of the FSOs future job performance. 616 F.Supp. at 1549. The State Department also gives out Honor Awards in recognition of outstanding achievement. In descending order of prestige are the Distinguished Honor Award, the Superior Honor Award, and the Meritorious Honor Award. See Plaintiffs’ Post-Trial Brief at 112-13.

Except for Senior members, salaries in the Foreign Service are based on a schedule established by the President which consists of nine salary classes. 22 U.S.C. § 3963. The Secretary of State assigns all Foreign Service Officers to a particular salary class. Id. § 3964. By statute, except in limited circumstances, a career candidate for appointment as a Foreign Service Officer may not be initially assigned to a salary class higher than class 4 (class 1 being the highest). Id. § 3947. Usually career candidates are placed initially in class 7 or class 8. Promotions from one salary class to another are made by the Secretary of State after receiving recommendations and rankings submitted by selection boards which evaluate the members of each class. Foreign Service Officers do not compete for promotions until the transition from class 6 to class 5; until then, they are promoted at the end of an established time period if they perform their duties satisfactorily. See Joint Appendix (“J.A.”) at 117-121; Defendant’s Post-Trial Brief at 96.

B. The History of This Litigation

This class action began over ten years ago when appellants filed their complaint alleging that widespread discrimination against women in the Foreign Service violated Title VII of the Civil Rights Act of 1964, as amended in 1972 to cover employment discrimination in the federal government. See 42 U.S.C. § 2000e-16. The parties subsequently resolved by consent decree all claims relating to admission into the Foreign Service. The appellants’ claims of discriminatory personnel actions against women already in the Foreign Service proceeded to trial in the District Court. The parties agreed to try initially only the issue of liability, leaving appropriate remedies to a subsequent phase of the proceedings, if necessary. After trial on the liability issue, the District Court concluded that appellants “failed to show by a preponderance of the evidence any sexual discrimination by the State Department.” 616 F. Supp. at 1561. The court entered a final judgment for the Secretary of State, dismissing the complaint. Id.

This appeal followed from the District Court's failure to find sex discrimination in seven different types of personnel practices. First, the appellants claim that from 1976 to 1983, the Foreign Service discriminated against women in the initial cone assignments of entering FSOs; the State Department assigned proportionally fewer women than men to the political cone and proportionately more women than men to the consular cone. This disparity was allegedly caused by the differing scores of women and men on the Foreign Service entrance examinations, producing a disparate impact on women and men candidates in violation of Title VII. Second, women were given proportionally fewer out-of-cone assignments to the program direction cone and proportionally more out-of-cone assignments to the consular cone. Third, women were given proportionally fewer “stretch” assignments and proportionally more “downstretch” assignments than men in the same class. Fourth, women received a disproportionately low number of appointments as Deputy Chief of Mission, the position just below that of Ambassador. Fifth, in its evaluation reports, the State Department gave lower future potential ratings to women than men despite equivalent ratings for their past performance. Sixth, women received a disproportionately low number of Foreign Service Honor Awards. And seventh, the State Department promoted women from class 5 to class 4 at a lower rate than it promoted men.

With respect to each of these seven personnel practices, the appellants offered data showing a disparity between men and women, along with a statistical analysis designed to demonstrate the improbability that a disparity of that scale could result from chance. The data and analysis, they allege, provide a strong basis for inferring that this disparity was the product of unlawful discrimination. In addition, the appellants introduced nonstatistical evidence pertaining generally to the existence of a prejudicial attitude towards women in the Foreign Service from 1976 to 1983. The District Court, however, rejected the inference of unlawful discrimination in each of the seven areas.

In discounting the probative force of appellants’ statistics, the District Court said that their statistical studies rested on faulty data, or flawed methodology, or omitted a crucial variable that would explain the disparity between men and women in a nondiscriminatory way. The District Court also said that some of the statistical evidence focused on too narrow a segment of Foreign Service personnel practices. As we shall explain, the District Court’s treatment of the appellants’ evidence was in some instances contrary to law and in other respects clearly erroneous as a matter of fact.

II. Title VII Claims: Two Different Theories

Under Title VII a plaintiff can rely on either of two different theories to support a claim of unlawful sex discrimination. A “disparate treatment” claim alleges that the defendant intentionally based an employment decision on the sex of the plaintiffs. See, e.g., International Brotherhood of Teamsters v. United States, 431 U.S. 324, 335 & n. 15, 97 S.Ct. 1843, 1854 & n. 15, 52 L.Ed.2d 396 (1977). Disparate treatment claims can involve an isolated incident of discrimination against a single individual, or, as in this case, allegations of a “pattern or practice” of discrimination affecting an entire class of individuals. Id. A “disparate impact” claim alleges that the defendant based an employment decision on a criterion that although “facially neutral” nevertheless impermissibly disadvantaged individuals of one sex more than the other. Id. at 336 n. 15, 97 S.Ct. at 1854 n. 15. This case is a “classic” example of a disparate impact claim in which plaintiffs allege that the defendant based employment decisions on the results of a test for which members of one sex on average received lower scores than members of the other sex. See B. Schlei & P. Grossman, Employment Discrimination Law at 13 (1983-84 Supp.); see also Griggs v. Duke Power Co., 401 U.S. 424, 91 S.Ct. 849, 28 L.Ed.2d 158 (1971) (the original disparate impact case).

Because these two theories áre distinct, we must consider them separately. Appellants’ only disparate impact claim concerns the initial cone assignments; the other six claims involve disparate treatment and we will consider them first.

III. Legal Principles Applying to Pattern or Practice Disparate Treatment Claims

In a typical sex discrimination pattern or practice disparate treatment case, plaintiffs allege the existence of a disparity between men and women in selection rates for a particular job or job benefit and further allege that this disparity was caused by an unlawful bias against members of the disadvantaged sex, usually women. To prevail in their claim, plaintiffs must prove, by a preponderance of the evidence, that these allegations are true. Proof of the disparity itself is based upon a comparison of the proportion of those women eligible for selection who were actually selected with the corresponding proportion of eligible men who were actually selected. Plaintiffs establish a disparity disfavoring women if the evidence demonstrates that the selection rate for eligible women was less than the selection rate for eligible men. Sometimes, the disparity is expressed as the difference between the number of women actually selected and the number of women one would expect to have been selected, assuming equality in the selection rates for men and women. (If one knows the number of women eligible and the selection rate for men, one can determine, using algebra, the expected number of successful women.)

Proof that the observed disparity was caused by an unlawful bias against women need not be direct. Circumstantial evidence that the disparity, more likely than not, was a product of unlawful discrimination will suffice to prove a pattern or practice disparate treatment case. See Teamsters, 431 U.S. at 335 n. 15, 97 S.Ct. at 1854 n. 15. Indeed, this circumstantial evidence may itself be entirely statistical in nature. See, e.g., Segar v. Smith, 738 F.2d 1249, 1278-79 (D.C.Cir.1984), cert. denied sub. nom. Meese v. Segar, 471 U.S. 1115, 105 S.Ct. 2357, 86 L.Ed.2d 258 (1985). In this case, appellants rely to a great extent on statistical evidence to prove their claims of disparate treatment. We find it necessary, therefore, to discuss how statistical analysis of an observed disparity can raise an inference of unlawful discrimination.

A. Raising An Inference of Discrimination With Statistical Evidence

A disparity between the selection rates of men and women for a particular job or job benefit has one of three possible causes. See D. Baldus & J. Cole, Statistical Proof of Discrimination 291 (1980). First, the disparity may be a product of an unlawful discriminatory animus; this is what plaintiffs are attempting to prove. Second, the disparity may have a legitimate and nondiscriminatory cause. For example, prior experience of a certain type may be an important factor in making certain employment decisions, and if it happened to be true that women on the average have less of this experience than men, one would expect that women could be selected less frequently. Third, the disparity may simply be a product of chance. Even if we may properly assume that, as a general rule, women and men on average are equally qualified to be selected for a particular job or job benefit, for any particular group of men and women who happen to constitute the actual pool of eligible candidates at the time the selections are made, there may be some deviation from this general rule because the actual qualifications of men and women differ from individual to individual and any particular pool of eligible candidates constitutes an inherently random collection of individuals. Thus, even if selections were made entirely on the basis of qualification, without a trace of discriminatory bias, random deviations in the selection rates for men and women may result.

A statistical analysis of a disparity in selection rates can reveal the probability that the disparity is merely a random deviation from perfectly equal selection rates. Statistics, however, cannot entirely rule out the possibility that chance caused the disparity. Nor can statistics determine, if chance is an unlikely explanation, whether the more probable cause was intentional discrimination or a legitimate nondiscriminatory factor in the selection process. See id. at 290-92.

Title VII nevertheless provides that if the disparity between selection rates for men and women is sufficiently large so that the probability that the disparities resulted from chance is sufficiently small, then a court will infer from the numbers alone that, more likely than not, the disparity was a product of unlawful discrimination — unless the defendant can introduce evidence of a nondiscriminatory explanation for the disparity or can rebut the inference of discrimination in some other way. See Hazelwood School District v. United States, 433 U.S. 299, 307-08, 97 S.Ct. 2736, 2741, 53 L.Ed.2d 768 (1977) (“Where, gross statistical disparities can be shown, they alone in a proper case constitute prima facie proof of a pattern or practice of discrimination.”); see also Segar, 738 F.2d at 1278 (“[W]hen a plaintiffs methodology focuses on the appropriate labor pool and generates evidence of [a disparity] at a statistically significant level,” this evidence alone will be “sufficient to support an inference of discrimination.”).

The preliminary question for a court, then, is at what point is the disparity in selection rates is sufficiently large, or the probability that chance was the cause sufficiently low, for the numbers alone to establish a legitimate inference of discrimination. Although this question is crucial in Title VII litigation, the answers given by courts have been regrettably imprecise. The Supreme Court has twice stated that “[a]s a general rule for ... large samples, if the difference between the expected value and the observed number is greater than two or three standard deviations, then the hypothesis that [the disparity] was random would be suspect to a social scientist.” Castaneda v. Partida, 430 U.S. 482, 497 n. 17, 97 S.Ct. 1272, 1281, n. 17, 51 L.Ed.2d 498 (1977); see also Hazelwood, 433 U.S. at 309 n. 14, 97 S.Ct. at 2742 n. 14 (quoting Castaneda). But many lower courts and commentators have noted that the difference between two and three standard deviations is considerable and that, therefore, the Supreme Court's statement falls short of establishing an exact legal threshold at which statistical evidence, standing alone, establishes an inference of discrimination. See, e.g., Segar, 738 F.2d at 1283 n. 28.

This court, using different terminology, has stated that statistical evidence meeting “the .05 level of significance ... [is] certainly sufficient to support an inference of discrimination." Segar, 738 F.2d at 1283. “[T]he .05 level,” the Segar opinion explained, “indicates that the odds are one in 20 that the result could have occurred by chance.” Id. at 1282. (This statement is somewhat imprecise and has predictably led to confusion, as we discuss infra.) The Segar court justified the consistency of its statement with the statements of the Supreme Court by observing that “[a] level of two standard deviations corresponds to statistical significance at the .05 level.” Id. at 1283 n. 28. In this case, the District Court cited Segar in its Conclusions of Law, stating: “The Court adopts the .05 level for establishing that a [statistical] study is statistically significant.” 616 F.Supp. at 1559 (¶ 14). But the District Court then went on to say that “[t]he .05 level generally corresponds to 1.65 standard deviations.” Id.

How can a 5% probability of randomness correspond both to a measurement of two standard deviations and a measurement of 1.65 standard deviations, one may reasonably ask? There is a legitimate answer: it depends on whether one is using a “one-tailed” or a “two-tailed” test of statistical significance. A disparity measuring 1.65 standard deviations corresponds to a 5% probability of randomness under a one-tailed test. A disparity measuring two standard deviations (to be more precise,. 1.96 standard deviations) corresponds to a 5% probability of randomness under a two-tailed test.

This difference between one-tailed and two-tailed tests obviously requires further explanation. It also presages the obvious question, given the substantial differences in result, of which test is the more appropriate one to use in Title VII cases. Neither this court’s opinion in Segar nor the District Court’s opinion in this case discusses the difference between “one-tailed” or “two-tailed” approaches. The Supreme Court has given us no explicit guidance on this issue. And, unfortunately, neither side to this litigation has devoted more than a single footnote each to this difficult but important issue. See Appellants’ Reply Brief at 32 n. 38; Appellee’s Brief at 62 n. 73. For obvious reasons we, too, confront this issue with some trepidation. But appellants’ and appellee’s evidence on the underpromotion of women from FSO class 5 to class 4 measures 1.88 and 1.76 standard deviations, respectively. (The difference results from the use of some different data. See 616 F.Supp. at 1557 (¶ 130).) Whether one adopts the appellants’ or the appellees’ number as the better evidence, it falls between 1.65 and 1.95 standard deviations. Therefore, if one tests the statistical significance of this number using the Se-gar standard of a 5% probability of randomness, the outcome turns on whether one uses a one-tailed or two-tailed test. Under a one-tailed test, the number is statistically significant (because it is larger than 1.65 standard deviations, which correspondents to a 5% probability of randomness under a one-tailed test) and therefore by itself establishes a prima facie case of disparate treatment. Under a two-tailed test, the number does not quite reach the statistically significant threshold (because it is smaller than 1.96 standard deviations, which corresponds to a 5% probability of randomness using a two-tailed test) and therefore by itself does not raise an inference of discrimination.

Given the unavoidability of embarking upon a journey into the statistical maze, we begin with the terms “one-tailed” and “two-tailed”; they refer to the “tails” or ends of the bell-shape curve, which represents in graph form a “random normal distribution.” E.g., W. Curtis, Statistical Concepts for Attorneys 72-73 (1983); see Diagram 1 copied from id. In these random distributions, the area under any segment of the bell curve measures the probability of that range of results occurring randomly. Id. Furthermore, the percentage area underneath the bell curve within one standard deviation (o) distance from the mean (p) of a normal distribution is always the same for all normal distributions (regardless of the specific value of a or p, or the units in which these terms are measured). Thus, the probability of a result randomly occurring that measures within one standard deviation of the mean of the distribution (either greater or lesser than the mean) is the same for all normal distributions: 68.26%. Id. Indeed, this relationship holds true for any distance from the mean, measured in numbers of standard deviations. For example, the probability of a result occurring within two standard deviations from the mean is 95.44% and the probability of a result occurring within three standard deviations is 99.73%. See Diagram 1. Thus, for all normal distributions, the probability of randomness is directly associated with a measurement in numbers of standard deviations.

But for every deviation from the mean of a normal distribution, measured in a certain number of standard deviations, there are two distinct ways of referring to the probability of that result occurring randomly. For example, if fewer women than expected were selected for a particular job, and this disparity measured 2.17 standard deviations, we can ascertain the probability that women by chance would be underselected to this extent or greater. This probability corresponds to the area between 2.17 standard deviations and the end of the bell curve representing the most extreme underselection of women. Standard statistical tables reveal that this probability is only 1.5%. See B. Lindgren & D. Berry, Elementary Statistics 479 (1981).

We can speak of the probability measurement associated with 2.17 standard deviations in another way, however. Although the observed disparity between the actual and expected number of women in this example was an underselection of women, there is a corresponding possibility that women might randomly be overselected such that the difference between the expected number of women selected and the number of women selected due to this random overselection also measures 2.17 standard deviations. The probability of a random deviation from the expected number of women selected with a magnitude of 2.17 standard deviations or larger, resulting from either an underselection or overselection of women, corresponds to the area under the bell curve between 2.17 standard deviations and both extremes of the curves: 3%.

The difference between “one-tailed” and “two-tailed” tests of statistical significance stem from these two different ways of measuring probability. If one decides (as the Segar court did) to reject the hypothesis that an observed disparity from an expected result occurred randomly only if the observed disparity falls within the range of the 5% most extreme possible disparities, one must still decide whether the 5% range should be entirely within only one of the tails of the bell curve, or instead should be divided with half of the range in each tail. Five percent of the total bell curve can be found either in the range from 1.65 standard deviations from the mean to one extreme end of the bell curve or in the area from 1.96 standard deviations to both extreme ends of the bell curve. Compare Diagrams 2 and 3, copied from V. Cangelosi, P. Taylor & P. Rice, Basic Statistics 173-74 (1979). For this reason, a 5% probability of randomness corresponds to 1.65 or 1.96 standard deviations, depending upon whether one uses a one-tailed or a two-tailed test. (Similarly, 1.65 standard deviations correspond to a 10% probability of randomness under a two-tailed test; and 1.96 standard deviations correspond to a 2.5% probability of randomness under a one-tailed test.)

We are now, hopefully, in a position to address whether in a Title VII case, a court should use a one-tailed or two-tailed test to determine whether statistical evidence alone should raise an inference of unlawful discrimination, recognizing that there is a difference of opinion among courts and commentators on the issue. Compare, e.g., EEOC v. Federal Reserve Bank of Richmond, 698 F.2d 633 (4th Cir.1983), rev’d on other grounds sub. nom. Cooper v. Federal Reserve Bank of Richmond, 467 U.S. 867,104 S.Ct. 2794, 81 L.Ed.2d 718 (1984), with Little v. Master-Bilt Products, Inc., 506 F.Supp. 319 (N.D.Miss.1980). Indeed, one leading treatise on the role of statistical evidence in Title VII litigation has shifted its position between the publication of the main text and the publication of a supplement. In the main text of their book, Baldus and Cole write:

[S]tatistical texts frequently recommend the use of a one-tailed test when the only question of interest is the likelihood of a difference in one direction, e.g., when only a positive disparity between two numbers is of interest. This practice supports the use of a one-tailed test in discrimination cases, since the issue is always whether one group is favored over another. A defendant will argue, however, that both minority and majority groups [or men and women] are protected from discrimination and it is therefore inequitable to disregard the probability of outcomes that may favor either group. Since there is no clear answer to this question, the most desirable approach is an awareness of the conceptual and practical differences between the two types of tests and a consistent use of the same type of test in similar cases whenever practical. We have used two-tailed tests throughout this book.

D. Baldus & J. Cole, Statistical Proof of Discrimination 307-08 (1980) (footnote omitted). In the most recent supplement, however, the authors criticize as “unnecessarily strict” the Fourth Circuit’s decision in EEOC v. Federal Reserve Bank of Richmond to require a two-tailed approach unless “independent evidence indicates the presence of discrimination of the type being challenged.” D. Baldus & J. Cole, Statistical Proof of Discrimination 129 (1986 Cumulative Supp.) (footnote omitted). Baldus and Cole then state a preference for a legal rule that would allow a one-tailed test “if the possibility of intentional discrimination favoring the protected group represented by plaintiff [e.g., women in this case] can be ruled out as defying logic, i.e., the available evidence excluding the statistic in question gives strong support to the conclusion that the system is either nondiscriminatory or disadvantageous to the plaintiff’s group.” Id. at 129-30. In a footnote to this passage, the authors continue:

The logic underlying this statement is that if one can be certain that there was no discrimination in favor of plaintiffs group, then any disproportionate impact would simply be interpreted as being a chance outcome in an equitable process.

Id. at 130 n. 38.

Although the latest position adopted by Baldus and Cole makes some sense, we reject its applicability to the present case. We note that some of appellants’ claims of unlawful discrimination involved complaints that women were overselected for particular kinds of jobs, e.g., consular cone and downstretch assignments. Appellants undoubtedly have the right under Title VII to object to the State Department’s selection of FSOs for these positions on the basis of sex. Such claims of discriminatory overselection, however, require a two-tailed statistical analysis. Appellants may view consular assignments as inferior to political assignments, but another class of women plaintiffs could certainly bring a Title VII claim if women were intentionally underassigned to the consular cone. Consequently, statistically significant deviations in either direction from an equality in selection rates would constitute a prima facie case of unlawful discrimination. Indeed, appellants’ own statistical expert testified that a two-tailed test was necessary in evaluating the disparity between men and women in assignments to the consular cone because the hypothesis to be tested is whether cone assignments are made without regard to sex. See Transcript (Tr.) at 1081.

We also think a two-tailed test of statistical significance should be applied to all of appellants’ discrimination claims in this case. First, Baldus and Cole originally noted the importance of consistency in evaluating statistical evidence. Second, although we by no means intend entirely to foreclose the use of one-tailed tests, we think that generally two-tailed tests are more appropriate in Title VII cases. After all, the hypothesis to be tested in any disparate treatment claim should generally be that the selection process treated men and women equally, not that the selection process treated women at least as well as or better than men. Two-tailed tests are used where the hypothesis to be rejected is that certain proportions are equal and not that one proportion is equal to or greater than the other proportion. See Curtis, supra, at 119-22, 133-37.

Moreover, even if a disparity in only one direction is at issue in a particular Title VII case (e.g., only the underpromotion and not the overpromotion of women) we think that the more appropriate asse»,. (lent of the probability that the contested disparity resulted from chance requires a recognition that a random disparity of equal magnitude, but in the opposite direction, is equally as likely. For example, if plaintiffs in a Title VII case come into court simply with evidence that women were underselected for a particular job, and that this disparity measured 1.75 standard deviations, it is perfectly true that the probability of women being underselected to this extent or more by chance is only 4%. Under a one-tailed test of statistical significance, employing the 5% level, as this court did in Segar, this evidence alone would establish a prima facie case of disparate treatment.

But for a disparity measuring 1.75 standard deviations it is equally true that the probability of a random deviation of this magnitude or larger, either underselecting or overselecting women, is 8%. In other words, disparities of this magnitude will be consistent with the hypothesis that the selection process did not treat men and women differently in 8% of the cases. Even if in the case before the court the disparity disfavors women and not men, how can the court ignore the possibility that the case might still be one of the 8% cases in which a fair selection process would by chance produce disparities in this magnitude or greater? Thus, we think a court should generally adopt a two-tailed approach to evaluating the probability that the contested disparity resulted by chance. Furthermore, although an 8% probability is pretty low, we do not think that it is low enough to establish by itself an inference of unlawful discriminatory animus. We think that statistical evidence must meet the 5% level referred to in Segar for it alone to establish a prima facie case under Title VII. Taken together, as we have said, a two-tailed test and a 5% probability of randomness require statistical evidence measuring 1.96 standard deviations. Consequently, if plaintiffs come into court relying only on evidence that the underselection of women for a particular job measured 1.75 standard deviations, it seems improper for a court to establish an inference of disparate treatment on the basis of this evidence alone.

Of course, plaintiffs in Title VII pattern and practice cases need not rely on statistical evidence alone. Because the ultimate issue in a disparate treatment case is whether the disparity resulted from unlawful discriminatory animus, plaintiffs may introduce any additional evidence which is probative on this issue. Thus, plaintiffs are in no way foreclosed from establishing an inference of discrimination simply because the contested disparity falls short of the 1.96 standard deviations mark when analyzed statistically. Obviously, to use an extreme example, if an employer admits under cross-examination that assignments for a certain position were based in large part on sex, it matters not that the observed underselection of women measures only 1.75 standard deviations. When plaintiffs in a Title VII pattern or practice case rely on evidence in addition to the evidence of the disparity itself, the issue for the trier of fact in determining whether the plaintiffs have established a prima facie case must be whether the totality of plaintiffs’ evidence (again including the evidence of the disparity itself) demonstrates that, more likely than not, the disparity resulted from an unlawful discriminatory animus— just as the issue after all the relevant evidence has been introduced by both sides remains whether in light of the totality of the evidence, plaintiffs have shown that, more likely than not, the disparity resulted from discrimination.

B. The Applicability of Title VII to Any Personnel Action

A plaintiff may bring a Title VII claim for alleged discrimination with respect to any employment decision by an agency of the federal government. The statute itself states that “all personnel actions affecting employees or applicants for employment ... shall be made free from any discrimination based on ... sex.” 42 U.S.C. § 2000e-16. In the Foreign Service Act of 1980, Congress reiterated this requirement specifically for Foreign Service employment practices. 22 U.S.C. § 3905. Moreover, in the 1980 Act, Congress specifically defined a “personnel action,” which must be free from sex discrimination, to encompass “(A) any appointment, promotion, assignment (including assignment to any position or salary class), award of performance pay or special differential, within-class salary increase, separation, or performance evaluation and (B) any decision, recommendation, examination, or ranking provided for under this chapter which relates to any action referred to in subparagraph (A).” Id. This language could hardly be more inclusive.

From this statutory language, two legal principles necessarily follow. First, appellants in this case may bring a disparate treatment claim regarding discrimination in any type of personnel decision regardless of whether or not that discrimination has an effect on other, arguably more important, personnel decisions. Thus, if the State Department has intentionally discriminated against women in certain types of assignment decisions, the State Department has violated 42 U.S.C. § 2000e-16 even if the State Department can prove that the unlawful discrimination in assignments did not adversely affect the opportunities of women for promotion in the Foreign Service.

It is beyond dispute that the State Department may not discriminate against women in making any kind of employment decision, and if the State Department breaches this requirement, appellants have a cause of action to vindicate their statutory rights. We note, as further support of our interpretation of 42 U.S.C. § 2000e-16, that the Supreme Court last Term interpreted an analogous Title VII provision applying to private employers to encompass a claim of sex discrimination for sexual harassment even if the sexual harassment caused no tangible or economic loss. Meritor Savings Bank, FSB v. Vinson, 477 U.S. 57, 106 S.Ct. 2399, 91 L.Ed.2d 49 (1986). The provision of Title VII involved in Vinson makes it “an unlawful employment practice for an employer ... to discriminate against any individual with respect to his compensation, terms, conditions, or privileges of employment, because of such individual’s ... sex.” 42 U.S.C. § 2000e-2(a)(l). The language of 42 U.S.C. § 2000e-16, involved here, is even broader, covering “all personnel actions” based on sex, regardless of whether the personnel action affects promotions or causes other tangible or economic loss.

Second, and relatedly, if plaintiffs in a Title VII case claim discrimination in certain kinds of employment decisions, it is no defense that the government did not discriminate against women in other kinds of employment decisions. For example, if the State Department intentionally under-selected women for appointment as Deputy Chiefs of Mission (DCM), the State Department has violated 42 U.S.C. § 2000e-16 even if the State Department can prove that it did not discriminate against women in assignments to five other “high visibility” positions. Appellants need not allege or prove discrimination in assignments to other “high visibility” positions in order to maintain a cause of action with respect to discrimination in DCM assignments. As the Supreme Court has stated: “Of course, Title VII provides for equal opportunity to compete for any job.” Teamsters, 431 U.S. at 338 n. 18, 97 S.Ct. at 1856 n. 18 (emphasis in original).

Although under 42 U.S.C. § 2000e-16 appellants must not be required to prove discrimination in employment decisions other than the ones they are specifically contesting, the government is correct in arguing that evidence of nondiscrimination in those other employment decisions may be probative of whether intentional discrimination actually occurred in the contested employment decisions. For example, if an employer can demonstrate that it did not discriminate against women at several steps of a promotional ladder, that evidence, in some circumstances, may reasonably suggest that the employer did not discriminate in the step at issue either.

But courts must be especially careful in judging the relevance of this kind of evidence lest they contravene the legal rule that under 42 U.S.C. § 2000e-16 plaintiffs need not prove discrimination in personnel actions other than those specifically at issue. The evidence supporting an inference of unlawful discrimination in certain employment decisions may be sufficiently strong that evidence of nondiscrimination in other employment decisions cannot rebut this inference. Thus, in some cases the strength of appellants’ prima facie case is so great that even if they were to agree to a stipulation that sex discrimination did not occur in other employment decisions, their evidence as to the employment decisions specifically at issue would still prove that, more likely than not, unlawful discrimination occurred.

When all the evidence raising and rebutting the inference of discrimination is statistical, according the proper deference to each legal principle is a delicate task indeed. If Title VII plaintiffs are able to muster only the most marginal inference of discrimination in only one type of job decision (e.g., the underselection of women in one promotional class measures only 1.98 standard deviations), then an inference of discrimination may be undercut by the fact that women are demonstrably not underselected in other similar job decisions. But even here courts must be wary. Evidence that the underselection of women in another similar job decision measures just below the 1.96 threshold, while not sufficient to prove discrimination, is not compelling evidence that the employer did not discriminate in this other employment decision.

Thus, when plaintiffs in a Title VII case introduce statistical evidence of an extreme disparity in the selection rates for men and women for a certain type of job, the fact that these plaintiffs have insufficient evidence to establish an inference of discrimination regarding other employment decisions should not block an inference of discrimination on the specific type of employment decision at issue. For example, if Title VII plaintiffs present evidence that the underselection of women for a particular type of job assignment measures above 3.0 standard deviations, this evidence necessarily raises an inference of discrimination in these assignments regardless of the statistical evidence concerning other assignments. The likelihood that this disparity in the selection rate for men and women is merely a random deviation in a selection process that treated men and women equally is simply too low (l-in-500 using a two-tailed approach) for statistical evidence regarding other assignment decisions to rebut this evidence. In these circumstances, the Title VII defendant must present evidence directly relating to the type of assignment at issue to explain the evident disparity in a legitimate, nondiscriminatory fashion. For a district court to reject plaintiffs’ claim of discrimination in such a case on the grounds that plaintiffs failed to raise an inference of discrimination in other job assignments would effectively amount to a requirement that plaintiffs prove discrimination in employment decisions other than those specifically at issue. And, as we have said, such a requirement would directly conflict with the express provisions of 42 U.S.C. § 2000e-16.

C. Rebutting the Inference of Disparate Treatment

As we have discussed, under Title VII courts will initially infer that a disparity between men and women in selection rates for a particular job or job assignment results from unlawful discrimination if the disparity is large enough: ie., measures at least 1.96 standard deviations. But defendants in Title VII cases must be offered an opportunity to rebut this inference by showing that the disparity, albeit nonrandom in cause, resulted from some legitimate, nondiscriminatory factor. Similarly, defendants must be allowed to rebut the inference of discrimination by, alternatively, challenging the statistical calculations, upon which the inference of discrimination is based. For example, the statistics may rely on faulty data, flawed computations, or improper methodologies. A recent Supreme Court opinion provides courts with some guidance on how to treat attempts to attack an inference of discrimination based on statistical evidence alone. See Bazemore v. Friday, — U.S.-, 106 S.Ct. 3000, 92 L.Ed.2d 315 (1986).

In Bazemore, the United States District Court for the Eastern District of North Carolina was presented with statistical evidence that black employees of the North Carolina Agricultural Extension Service received substantially lower salaries than white employees working in the same job positions. The District Court determined that “the statistical evidence of plaintiffs standing alone and without further explanation probably suffices to make out a prima facie showing of discrimination in salaries.” Civil Action No. 2879, Mem. Op. at 47 (August 22,1982). The defendants in Bazemore, however, argued that plaintiffs’ statistics failed to account for several factors, any of which would provide a legitimate, nondiscriminatory explanation for the salary disparities. Id. at 48. The District Court agreed with the defendants, holding that because defendants had demonstrated that these other factors might have caused the salary disparities, defendants successfully rebutted plaintiffs’ inference of disparate treatment:

Having thoroughly considered all of the evidence bearing on the salary issue and the contentions of the parties based thereon, the court has concluded that if it be assumed that plaintiffs made out a prima facie case on this issue, it has only been by virtue of the plaintiffs’ statistical evidence ...; that because of their failure to include many of the vital factors necessary to be considered in fixing salaries the probative force of these statistics has been so substantially undermined that they cannot sustain a finding of purposeful discrimination in salaries ...; that the defendants have not only “articulated” plausible reasons for the seeming salary disparities, but have satisfied the court of the validity of their explanations____ It follows that plaintiffs have failed to establish by a preponderance of the evidence that the Extension Service has discriminated against black employees in the matter of salaries.

Id. at 54-55 (citation and footnotes omitted).

The Fourth Circuit affirmed this determination by the District Court in Bazemore. See 751 F.2d 662 (1984). The appellate court referred specifically to two flaws in the plaintiffs' statistics as grounds on which the District Court could legitimately rely in ruling for the defendant. “In the first place,” the Fourth Circuit stated, the plaintiffs’ statistics “contained salary figures which reflect the effect of pre-Act discrimination, a consideration not actionable under Title VII but permissible [only] to show the general background of the case, or intent, or to support an inference that such discrimination continued.” 751 F.2d at 672 (footnote omitted). Second, the appellate court noted that plaintiffs’ statistical study of salaries did not take into account “across-the-board and percentage pay increases which varied from county to county.” Id. The court stated that “[t]he across-the-board and percentage pay increases granted by the various counties in varying amounts, as well as simply paying higher salaries, are bound to have an effect on the salaries of the agents in the various counties.” Thus, the appéllate court held that “the district court was not required to accept [the plaintiffs’ statistics] as proof [of discrimination] by a preponderance of the evidence.” Id. The court went on to say that “appropriate” statistics “should ... include all measurable variables thought to have an effect on salary level.” Id.

The Supreme Court reversed. In a unanimous opinion for the Court, Justice Brennan responded to the Fourth Circuit’s “plainly incorrect” approach to statistical evidence:

Importantly, it is clear that a [statistical] analysis that includes less than “all measurable variables” may serve to prove a plaintiff's case. A plaintiff in a Title VII suit need not prove discrimination with scientific certainty; rather his or her burden is to prove discrimination by a preponderance of the evidence.

106 S.Ct. at 3009. Thus, imperfections in the data on which the analysis depends, or the omission of possible explanatory factors from a plaintiff’s statistical study, is not necessarily fatal to an inference of discrimination. “While the omission of variables from a regression analysis may render the analysis less probative than it otherwise might be,” the Justices held, “as long as the court may fairly conclude, in light of all the evidence, that it is more likely than not that impermissible discrimination exists, the plaintiff is entitled to prevail.” Id.

Elsewhere in the opinion, Justice Brennan makes plain that the determination by the District Court whether discrimination exists or not “is subject to the clearly erroneous standard of appellate review.” Id. at 3008. While the Supreme Court remanded the case to the Fourth Circuit to definitely determine whether “based on the entire evidence in the record,” the District Court’s decision had been clearly erroneous, the Justices did declare, “we think that consideration of the evidence makes a strong case for finding the District Court clearly erroneous.” Id. at 3010-11 (footnote omitted). Rather than viewing the inclusion of “pre-Act” salaries in the statistical study as rendering the study fatally flawed, the Supreme Court stated that “evidence of pre-Act discrimination is quite probative.” 106 S.Ct. at 3010 n. 13. Similarly, the Supreme Court rejected the assumption made by both the District Court and the Fourth Circuit that county-to-county variations in certain pay increases undermined plaintiffs' statistical conclusions: “Absent a disproportionate concentration of blacks in such counties, it is difficult, if not impossible, to understand how the fact that some counties contribute less to salaries than others could explain disparities between black and white salaries.” Id. at 3010.

Thus, Bazemore instructs lower courts to be cautious about dismissing plaintiffs’ statistical studies as not probative simply because defendant offers some nondiscriminatory explanation for the disparities shown. Implicit in the Bazemore holding is the principle that a mere conjecture or assertion on the defendant’s part that some missing factor would explain the existing disparities between men and women generally cannot defeat the inference of discrimination created by plaintiffs’ statistics. To be sure, as the Supreme Court acknowledged in Bazemore, there may be a few instances in which the relevance of a factor to the selection process is so obvious that the defendants, by merely pointing out its omission, can defeat the inference of discrimination created by the plaintiffs’ statistics. See 106 S.Ct. at 3009 n. 10. The logic of Bazemore, however, dictates that in most cases a defendant cannot rebut statistical evidence by mere conjectures or assertions, without introducing evidence to support the contention that the missing factor can explain the disparities as a product of a legitimate, nondiscriminatory selection criterion.

This court, even before Bazemore, had explicitly endorsed the same principle, most recently in a situation where the government attempted to rebut the inference of discrimination arising from evidence that blacks in the Drug Enforcement Agency were paid less and promoted less rapidly than whites. The government argued that blacks were less likely than whites to have an extra year of “specialized experience” over and above minimal qualifications. We rejected the argument because the DEA failed to introduce any evidence to substantiate its assertion:

Since DEA has presented no admissible evidence that black agents are more likely than white agents to lack a second year of requisite experience, plaintiffs’ failure to account for this variable does not dilute the force of their statistical analysis; ... absent any reason to conclude that the omitted factor correlates with race, the omission of this variable will not affect the validity of the race coefficient in the plaintiffs’ regression analysis.

Segar, 738 F.2d at 1277. We think the lessons of both Bazemore and Segar apply to this case.

IV. A Review of the Disparate Treatment Claims in This Case

Having discussed the applicable legal principles, we now address the specific disparate treatment claims at issue in this case. Supreme Court precedent has made plain the appropriate standard for reviewing a district court’s determination that employment decisions were not the product of an unlawful discriminatory animus. We can reverse this factual finding only if it is clearly erroneous in light of all the evidence in the record or if it rests on legal error. See Bazemore v. Friday, — U.S. -, 106 S.Ct. 3000, 92 L.Ed.2d 315 (1986); Anderson v. City of Bessemer City, 470 U.S. 564, 105 S.Ct. 1504, 84 L.Ed.2d 518 (1985); Pullman-Standard v. Swint, 456 U.S. 273, 102 S.Ct. 1781, 72 L.Ed.2d 66 (1982).

A. Promotions and Evaluations

The Secretary of State argues that appellants’ claim of “class-wide promotion discrimination lie[s] at the heart of this case.” Appellee’s Brief at 58. We agree.

Appellants claim that the State Department discriminated against women in promoting FSOs from class 5 to class 4 from 1976 to 1983. According to the government's own evidence, fewer women than expected were actually promoted to class 4 during that time period, given the number of promotion-eligible women in class 5. The government’s own statistical analysis, whose methodology the District Court found to be more accurate than appellants’, concluded that the discrepancy between the actual and expected number of women promoted measured 1.76 standard deviations. See 616 F.Supp. at 1557; Defendant’s Exhibit 8A at 14 (Table 1, Model 2). As the District Court noted, this measurement means that the probability of an underpromotion of women this large or larger (a one-tailed inquiry) occurring randomly measures slightly less than 4%. 616 F.Supp. at 1557. As we have discussed, under a one-tailed test this number meets the 5% level ,set forth in Segar. But the corresponding probability of a random deviation from the expected number of women, either favoring or disfavoring women (a two-tailed inquiry), with a magnitude this large or larger is slightly less than 8%. See Defendant’s Exhibit 8A at 14 (Table 1, Model 2). Thus under a two-tailed test, this number fails to meet the 5% level. •

For the reasons set forth in Part III. A., we do not think this evidence alone is sufficient to prove an intent to discriminate against women. Appellants at trial, however, relied on additional evidence to prove a discriminatory motive. Appellants first point to evidence in the record of a general prejudicial attitude against women within the Foreign Service during this time period and argue that this evidence supports the proposition that the discrepancy between the actual and expected number of women promoted to class 4 results from a prejudicial attitude against women that violates Title VII.

This evidence includes statements made upon cross-examination by the defense witness, Benjamin Reid, who was Undersecretary of State for Management from 1977-1981. Reid testified that the Foreign Service, as a result of traditionally being “white, male, and Ivy League,” had “set ways of doing things” and that although during his tenure the Foreign Service “had come a long way,” it nevertheless “still had a long way to go” at the time he left in correcting these biased attitudes. Tr. at 3279-80. Similarly, the appellants introduced into evidence a report written in 1977 by a committee within the State Department asserting that “both attitudinal resistance to equal employment opportunity and discriminatory behavior are still widespread in the Department.” Plaintiffs’ Exhibit 29 at 6. The appellants also introduced into evidence a report published in 1984 by the Women’s Research and Education Institute of the Congressional Caucus for Women’s Issues, which stated that “ ‘what some identify as traditional elitist attitudes have [worked] to limit severely employment opportunities for women and minorities [in the Foreign Service].’ ” Plaintiffs’ Exhibit 88 at 10 (quoting a 1981 report prepared by the U.S. Commission on Civil Rights).

More specifically, as proof that the underpromotion of women FSOs from class 5 to class 4 resulted from a prejudicial attitude against women, the appellants relied upon evidence that the State Department believed that women FSOs had less potential for advancement than men FSOs even though men and women FSOs performed their duties with the same skill. A random sample of the evaluation reports for over 400 FSOs in classes 5 and 6 revealed that although “there was no significant difference in the performance ratings of men and women, ... the disparity between men and women [in their potential ratings] measured 2.49 standard deviations.” 616 F.Supp. at 1549 (fl 62) (emphasis added). As the District Court noted, this measurement means the likelihood of women being randomly underrated to this degree or greater (a one-tailed inquiry) is only about 7 times in 1,000. Id. Correspondingly, the likelihood of women randomly being either underrated or overrated to this degree or greater is 14 times in 1,000. Either way the odds are very small indeed.

The relevance of this evidence to whether the underpromotion of women from class 5 to class 4 resulted from a discriminatory attitude against women is obvious. As the State Department itself asserted and the District Court expressly found, competitive promotion decisions in the Foreign Service were based primarily on an “assessment of the officer’s potential to perform at the next higher level.” 616 F.Supp. at 1555 (11114); Defendant’s Post-Trial Brief at 92. If a biased attitude towards women was causing the State Department to underrate the potential of class 5 women FSOs in their evaluation reports, even though these women were on average performing equally as well as their male counterparts, one might well expect that this same biased attitude would be at work in the promotion decision itself.

The District Court, however, never considered the evidence of a discriminatory attitude about the potential of women derived from the evaluations in deciding whether appellants had proved, by a preponderance of all the evidence, discriminatory intent in the decisions pertaining to promotions from class 5 to class 4. Rather, the District Court offered the following grounds for rejecting the evidence relating to the evaluation reports:

In view of the finding that female FSO’s are promoted equally with male and given the same job opportunities, the Court finds that plaintiffs’analysis of the disparity on potential ratings does not establish that the [evaluation reports] of female FSO’s are discriminatory in any fashion.

616 F.Supp. at 1560 (1125).

In our view this reasoning puts the cart before the horse. The District Court cannot determine that the State Department did not discriminate against women in promotions from class 5 to class 4 until it considers whether or not all the evidence demonstrates a biased attitude towards women and their capabilities. It cannot reject relevant evidence of discriminatory intent on the basis of a conclusion that no discrimination occurred without reference to the relevant evidence. To rule otherwise would convert Title VII into a Catch-22: in order to establish a promotional disparate treatment claim, a plaintiff must prove discriminatory intent; but she cannot offer proof of discriminatory intent in the form of disparate ratings between men and women as to their potential unless she has already established a promotional disparate treatment claim. We hold that appellants were entitled, as a matter of law, to have the District Court consider evidence in the ratings of a discriminatory attitude about the potential of women when evaluating appellants’ disparate treatment claim concerning promotions from class 5 to class 4. Conversely, it was an error of law for the District Court to “reason” backwards and dismiss appellants’ claim that the disparity in potential ratings was a violation of Title VII on the grounds that the court had already determined that the State Department did not discriminate against women in promoting FSOs from class 5 to class 4.

Thus, we reverse both the District Court’s decision that the State Department did not discriminate against women in evaluating the potential of FSOs and its decision that there was no discrimination shown in promoting FSOs from class 5 to class 4. Following the command of Pullman-Standard v. Swint, 456 U.S. 273, 291-92, 102 S.Ct. 1781, 1791-92, 72 L.Ed.2d 66 (1982), we remand the case for further factfinding where the record permits more than one resolution of a factual issue. With respect to the question of whether the State Department discriminatorily under-promoted women from class 5 to class 4 from 1976 to 1983, we cannot say that the totality of the evidence compels an affirmative or- a negative answer.

Upon remand the District Court must consider whether, on the basis of the existing record, the evidence pertaining to the disparity in potential ratings, together with the nonstatistical evidence of a generally hostile attitude against women in the Foreign Service and the statistical evidence of the disparity in class 5 to class 4 promotions, is sufficient proof that, more likely than not, the underpromotion of women from class 5 to class 4 was based on discrimination. The evidence in the record cutting the other way is the failure of the appellants’ statistical evidence to make out even a prima facie case that the State Department discriminated against women at other grades of the promotional process. Of course, as we have pointed out, appellants need not prove discrimination in these other promotion decisions in order to prevail in their disparate treatment claim concerning promotions from class 5 to class 4. Indeed, it is quite plausible that a discriminatory attitude about women and their potential for further advancement might affect promotions only at a mid-level step— like the transition from class 5 to class 4. First of all, as we discussed in Part I. A, supra, the promotions in the junior ranks (classes 7 and 8) were noncompetitive. Second, the Secretary’s own statistical analysis showed that fewer women than one would expect were actually promoted from class 6 to class 5, although his study indicated that this disparity was just as likely to be a random deviation in a nondiscriminatory system as a symptom of discrimination. See Defendant’s Exhibit 8A, Table 1, Model 2. Finally, one might surmise that those women who survive a discriminatory bias in critical mid-level promotion decisions have demonstrated such superior skill and aptitude that they would encounter less resistance to advancement in upper level positions. Despite all these considerations, the District Court is entitled to determine for itself on remand whether the government’s evidence of nondiscrimination at other promotional levels is sufficient to outweigh the appellants’ evidence, which as we have said includes three distinct elements: the disparity itself measuring 1.76 standard deviations, testimony and documented evidence of a general bias against women in the State Department, and the specific evidence as to discriminatory attitudes about the potential of women FSOs for future advancement, revealed in the evaluation reports of class 5 and 6 FSOs.

With respect to the evaluation reports, we note that the District Court committed a further error of law. In discussing the appellants’ statistical analysis of the potential ratings for men and women, the court stated that:

The methodology utilized by plaintiffs’ expert ... fails to allow for one vital characteristic, that being female FSO’s have less time in class than males. This inexperience would account for the lower potential ratings when compared with males who have more time in class____ While the actual performance of males and females may not be reflected by this inexperience, a subjective judgment on the potential capacity of an FSO may certainly be affected by such inexperience resulting from less time in class.

616 F.Supp. at 1549 (¶ 65).

There was, in fact, no evidence whatsoever introduced at trial on which the District Court could rely to base its assumption that despite equivalence in actual performance officers with less experience would be viewed as having lower potential than those with more experience. See Appellants’ Brief at 42. Moreover, the District Court’s assumption is counterintuitive: if officers with less experience managed to perform at the same level as officers with more experience, one would expect that the less experienced officers would be seen as quick learners with more, not less, potential. In any event, the District Court was not entitled to rely on mere conjecture to undercut the probative force of appellants’ statistics. See, supra, Part III.C. On remand, in deciding whether appellants’ evidence concerning the evaluation reports demonstrated a bias against women, the District Court shall not rely upon any unsupported hypotheses, such as the relatively lower number of years experience of women in grade.

We note further that, even if the rating evidence proves insufficient to prove a discriminatory motive in promotions, appellants are entitled, as a matter of law, to bring an independent claim of disparate treatment with respect to the evaluation reports themselves. As we have seen, the Foreign Service Act of 1980 specifically includes any “evaluation” as a “personnel action” that must be free from discrimination. In light of this express statutory language, we cannot but read the words “all personnel actions” in 42 U.S.C. § 2000e-16 as encompassing such a claim. Thus, under Title VII, the State Department may not discriminate against women in their evaluations regardless of any demonstrated effect the evaluations ultimately can be shown to have on promotion opportunities. We need not now consider what remedy might be appropriate for discriminatory evaluations; the parties bifurcated the issues of liability and remedies.

To recapitulate, insofar as the District Court required appellants to prove discrimination in promotions in order to prove discrimination in evaluation reports, the District Court erred as a matter of law in two significant respects. First, the District Court unreasonably rejected a major portion of appellants’ evidence that the promotion decisions at issue were infected with a discriminatory motive. Second, the District Court deprived appellants of their right under Title VII to bring a disparate treatment claim as to evaluations, regardless of how those evaluations might affect other employment decisions. Consequently, we remand to the District Court both the issue of whether the State Department discriminated against women in its decisions concerning promotions from class 5 to class 4 and the issue of whether it discriminated in its evaluations of the future “potential” of women FSOs.

B. Assignments

Appellants brought disparate treatment claims with respect to various types of Foreign Service assignment decisions. We consider first appellants’ claim that the State Department discriminated against women in “out-of-cone” assignments by overassigning women to positions in the consular cone and by underassigning women to the “prestigious” program direction cone. 616 F.Supp. at 1553-54.

1. Out-of-cone assignments

The District Court found that appellants’ evidence disclosed the following facts about out-of-cone assignments to the consular cone:

a) Between 1976 and 1983, 40.4 percent of all out-of-cone assignments received by women in the political cone were to consular positions, while only 15.5 percent of the out-of-cone assignments received by men in the political cone were to consular positions. This difference [measures 5.84 standard deviations and therefore the probability of a disparity of this magnitude or greater (either overselecting or underselecting women) resulting by chance is less than one in one hundred million].
b) For the same time period, the plaintiffs’ statistics show 22.9 percent of all out-of-cone assignments received by women in the economic cone were to consular positions, while only 11.6 percent of all out-of-cone assignments received by men in that cone were to consular positions. This difference measures 2.68 standard deviations [which means the probability of women being randomly overassigned or underassigned to this degree or greater is 0.74 percent].
c) During the same time period, plaintiffs’ analysis indicated that 50.8 percent of all out-of-cone assignments received by women in the administrative cone were to the consular cone while only 33.2 percent of all out-of-cone assignments received by men were to the consular cone. This difference measures 2.62 standard deviations [which means that the probability of a disparity of this magniude or greater resulting by chance is 0.88 percent].

616 F.Supp. at 1553-54 (II101). Appellants contended that these extreme disparities resulted from the prevalent belief in the Foreign Service that women were, especially suited for consular work. The government, in contrast, argued that the disparities resulted from the fact that women on the whole preferred consular assignments, and the Foreign Service merely honored these preferences. The District Court accepted the government's explanation of the disparities:

The [plaintiffs’ statistical] analysis does not account for the unique feature of the FSO’s bidding, or requesting, their assignments pursuant to the Open Assignment Policy. A more accurate analysis would measure the requésts by the FSO’s, as the observations made by plaintiffs’ expert may result as much from the function of requesting different assignments as the assignment of FSO’s.

Id. at 1554 (¶ 101). On this basis, the District Court found appellants' statistical evidence “unconvincing” and concluded that appellants had failed to prove sex discrimination in out-of-cone assignments to the consular cone. Id. at 1560 (¶ 22).

It is true, as the District Court pointed out, that assignments are made in part pursuant to the bid lists submitted by members of the Foreign Service. But as the District Court acknowledged, bid lists were only one element of the assignment process, and the selection boards based their assignment decisions in larger measure on the perceived needs of the bureaus to which the assignments were made. See, supra, Part I.A. Moreover, the Secretary submitted no evidence showing that more women than men preferred out-of-cone assignments to the consular cone. Appellants’ Brief at 55. The Secretary, on appeal, concedes as much.

The Secretary, however, would have us affirm the District Court’s decision on the grounds that “an analysis which ignores ‘preference’ ... is simply not probative on this issue.” Appellee’s Brief at 55. This argument, however, is precluded by the Supreme Court’s Bazemore decision. According to Bazemore, appellants’ statistical evidence concerning out-of-cone assignments to the consular cone is probative of discrimination despite the fact that it did not include individual preferences as a possible explanatory factor. There was no basis in the record on which the District Court could assume that women indicated preferences for consular work more frequently than men did. Consequently, the District Court contravened the dictates of Bazemore by refusing to credit the appellants’ statistical evidence. Under Bazemore and Segar, the District Court is not entitled to dismiss plaintiffs’ statistical evidence on mere conjecture.

As a result of this legal error, “unless the record permits only one resolution of the factual issue,” we must remand the issue of out-of-cone assignments to the District Court. Pullman-Standard, 456 U.S. at 292, 102 S.Ct. at 1792. Given the strength of appellants’ statistics on this issue, and given the fact that the Secretary offered only an unsupported hypothesis to rebut the inference of discrimination generated by the statistics, we might legitimately conclude that the evidence permits only one answer to the question whether the overassignment of women to the consular cone resulted from an unlawful prejudice towards women. Nevertheless, because we have already determined that the District Court must conduct further factfinding on other issues in this case, and ever mindful of the Supreme Court’s injunction that appellate courts not usurp the fact-finding function of district courts, we conclude the better course is to allow the District Court to reconsider, on the basis of the existing record, its determination of this issue in light of Bazemore.

With respect to out-of-cone assignments to the program direction cone, the District Court found that appellants’ evidence showed that “38.5 percent of all out-of-cone assignments received by men in the political cone were to senior program direction cone positions, while only 14.6 percent of the out-of-cone assignments received by women in the political cone were to program direction cone positions.” 616 F.Supp. at 1554 (II 105a). The District Court further found that this underselection of women measured 4.46 standard deviations, id., which means that either an underselection or an overselection of women of this magnitude or larger has a probability of occurring randomly in less than one in 100,000 times.

Appellants’ evidence also demonstrated that “12.4 percent of the out-of-cone assignments received by men in the consular cone were to program direction positions, while only 6.6 percent of the out-of-cone assignments received by women in the consular cone were to program direction positions.” 616 F.Supp. at 1554 (¶ 105b). This underselection of women measured 2.23 standard deviations, id., which means that the probability of women being randomly either underselected or overselected to this degree or greater is about 2.6%.

The appellants argued that this underassignment of women to program direction cone positions from the political and consular cones resulted from the discriminatory belief within the Foreign Service that women were unsuitable for prestigious leadership-track positions. It is unclear from the District Court’s opinion why the District Court rejected this argument, and found, to the contrary, that the State Department did not discriminate against women in assignments from the political and consular cones to the program direction cone. The District Court did observe that “Defendant’s expert produced an analysis indicating that, as to those men and women who did attain transfer to the Program Direction cone, there was no disparity in the amount of time spent in class before attaining the transfer.” 616 F.Supp. at 1554 (ÍI106). Although the District Court found this evidence to “indicate[] that females are not discriminated against in their attainment of conversion to the Program Direction cone,” it concluded, accurately, that this evidence could not be “dispositive” because “it measures the time in class and service of those who actually attain the Program Direction cone, and plaintiffs complain of a disparity in the number of men and women who are given out-of-cone assignments to positions which carry the program direction skill code and would thus qualify them for transfer to the Program Direction cone itself.” Id. (¶ 107). The issue was not whether those women who were able to transfer to the program direction cone did so with the same speed as their male counterparts; rather, the issue was whether proportionally fewer women than men were able to transfer to program direction positions at all.

Despite the District Court’s concession that appellee’s rebuttal evidence could not be “dispositive,” it offered no other basis for rejecting appellants’ claim of discrimination in out-of-cone assignments to the program direction cone positions. Specifically, it did not mention individual preference as a possible nondiscriminatory explanation for the disparity between men and women in their selection rates for these positions, probably because there was absolutely no evidence in the record indicating that women preferred assignment to the “prestigious” program direction cone less than men.

Thus, we conclude that the District Court failed to articulate any sufficient grounds for rejecting appellants’ proof of discrimination in out-of-cone assignments to the program direction cone. The sole basis offered by the government was properly found by the court to be insufficient. It cited no other basis in the record for its decision, and we can find none. Therefore, we reverse and remand the issue for reconsideration, on the basis of the existing record. The inference of discrimination raised by the significant disparities between men and women given out-of-cone assignments to these “prestigious” positions is thus far unrebutted. Unless the District Court can find valid basis supported in the record for rejecting the inference of discrimination, it must rule in favor of the appellants on this claim.

2. Stretch and Downstretch Assignments

The appellants also claim that the State Department discriminated against women in “stretch” and “down-stretch” assignments. The evidence that appellants introduced at trial in support of this claim included the following statistics. First, between 1976 and 1981, “32.2% of the women in Class 4 were given downstretch assignments, while only 17.6% of the men in that class were given down-stretch assignments.” 616 F.Supp. at 1552 (If 92). As the District Court noted, this disparity measures 6.72 standard deviations, id., and the chances of women being randomly overassigned or underassigned to this degree or greater is less than one in ten billion. See D.B. Owens, Handbook of Statistical Tables 13 (1962) (Plaintiffs’ Exhibit 168).

Second, “20.8% of the women in Class 5 received down-stretch assignments, while only 14.2% of the men received them. This difference measures 4.04 standard deviations.” 616 F.Supp. at 1552-53 (If 92). The probability of a random overselection or underselection of women of this magnitude or larger is about 1 in 20,000. See Plaintiffs’ Exhibit 168 at 13.

Third, 19.9% of the women in class 7 received down-stretch assignments, whereas only 14.3% of the men in class 7 did. This disparity measured 2.39 standard deviations, which corresponds to a (two-tailed) probability value of about 1.6%. See Plaintiffs’ Exhibit 57; - Elementary Statistics-, supra n. 8, at 479.

Fourth, with respect to stretch assignments, only 19.1% of women in class 4 received stretches, whereas 28.4% of the men in class 4 did. This underselection of women measured 3.74 standard deviations, which means that the probability of either an underselection or overselection of women of this magnitude or larger resulting from chance is about one in 5,000. See Plaintiffs’ Exhibits 57, 168.

Fifth, only 31.6% of women in class 5 received stretch assignments, whereas 37.7% of the men in class 5 did. This disparity measured 2.79 standard deviations, which corresponds to a (two-tailed) probability value of 0.52%. See Plaintiffs' Exhibit 57; Elementary Statistics, supra, n. 8, at 479.

The appellants argued that this overassignment of women to downstretch positions and underassignment of women to stretch positions resulted from unlawful sexist attitudes in the Foreign Service. As additional evidence to support their contention, the appellants pointed to a 1977 report prepared within the State Department, which stated that stretch assignments “are not commonly given to those in EEO categories,” meaning women and minorities. Plaintiffs’ Exhibit 29 at 6. The District Court nonetheless rejected the appellants’ claim, offering several reasons for its decision. These reasons, however, do not support the District Court’s decision. All but one are erroneous as a matter of law, and the other is a clearly erroneous finding of fact.

First, the District Court stated that appellants had failed to show that the overassignment of women to downstretch positions and underassignment of women to stretch positions adversely affected the opportunities of these women for promotion. See 616 F.Supp. at 1553 (1194). Once again, we repeat that appellants are entitled under 42 U.S.C. § 2000e-16 to bring a claim of sex discrimination with respect to “all personnel actions,” including any category of assignments, regardless of how these assignments relate to other personnel actions, like promotion decisions. By relying on this determination, the District Court contravened the express provisions of Title VII.

Second, the District Court concluded that appellants’ statistical evidence was “of little value in persuading that discrimination existed in assigning stretch and down-stretches” because, in part:

Plaintiffs’ expert, by analyzing the situation class by class, appears to ignore cross-class competition for any given assignment. For example, an officer vying for a Class 4 stretch position may compete against officers from at least Classes 6, 5, 4, and 3.

616 F.Supp. at 1553 (111196, 98).

While it is absolutely true that officers in any given class will be competing against officers from other classes, it is also absolutely irrelevant to the point of appellants’ evidence. Appellants are trying to demonstrate, for example, that women in class 5 are less likely, than men in class 5 to stretch into assignments labelled class 4 or higher, and that this disparity results from a widespread prejudice within the Foreign Service that women are less able than men despite their equivalent rank. Given this purpose, it is entirely irrelevant that officers from other classes may compete with men and women in class 5 for those assignments that are stretches for officers in class 5. Appellants are not interested in comparing how well the men and women in class 5 compete against officers in another class. They are only interested, and properly so, in how similarly situated men and women compete against each other.

It was an error of law for the District Court to reject the probative value of appellants’ statistical evidence because of this irrelevant factor of “cross-class competition.” Certainly, the Supreme Court’s decision in Bazemore stands for the proposition that the “missing factor” identified by the District Court as a reason for discounting statistical proof of disparate treatment must at least be relevant to the point of the statistics. In Bazemore itself, the Supreme Court noted that “certain conclusions of the District Court are inexplicable in light of the record.” 106 S.Ct. 3011 n. 15. For instance,

the District Court complained about the inclusion of the County Chairman in the petitioners’ regression analysis, fearing that the fact that they were disproportionately white would skew the salary statistics to show whites earning more than blacks. Yet, because the regressions controlled for job title, adding County Chairman as a variable in the regression would simply mean that the salaries of white County Chairmen would be compared with those of nonwhite County Chairmen.

Id. In this case, the District Court’s reliance on the omission of “cross-class competition” as a basis for rejecting appellants’ evidence of discrimination in stretch and downstretch assignments is similarly “inexplicable.”

Third, the District Court found appellants’ statistics concerning stretch and downstretches to be “flawed” in another respect. The data from which the statistical analysis was made was tabulated in terms of the total number of years each FSO served in a stretch or a downstretch assignment rather than in terms of the number of such assignments. The District Court found that this methodology “does not accurately reflect the number of assignments given out by the Foreign Service.” 616 F.Supp. at 1553 (IT 95). The appellants contend, however, that the data they used were the only available data, and the Secretary does not dispute this contention. See Appellants’ Brief at 49; Appellee’s Brief at 53. Moreover, the Secretary has introduced no evidence tending to show that the imperfections of the data caused the disparities produced by the statistical analysis. See Appellants’ Brief at 50. Thus, the government once again relies on mere conjecture to rebut appellants’ statistics. Finally, and perhaps most important, the appellants received their data from the State Department’s employment records, and the reason why the data, were tabulated in terms of number-of-years rather than number-of-assignments was that the State Department’s employment records were tabulated, in this form. Id. In these circumstances, as this court has previously stated, “plaintiffs cannot be legitimately faulted for gaps in their statistical analysis when the information necessary to close those gaps was possessed only by defendant ].” Trout v. Lehman, 702 F.2d 1094, 1102 (D.C.Cir.1983) (quoting 517 F.Supp. 873, 883 (D.C.1981)), vacated on other grounds, 465 U.S. 1056, 104 S.Ct. 1404, 79 L.Ed.2d 732 (1984); see also Segar, 738 F.2d at 1276 (“Both the policies underlying Title VII and general principles of evidence suggest that the burden of production of such evidence must rest with the defendant.”). Therefore, insofar as the District Court relied on this reason to reject the probative value of appellants’ statistics, we find its decision in conflict with the precedents of this circuit.

Finally, the District Court found that “Plaintiffs’ analysis did not allow for the preference of the individual FSO.” 616 F.Supp. at 1553 (1197). The District Court's reliance on this “preference” argument in the context of stretch and downstretch assignments differs significantly from its role in the context of out-of-cone assignments. To recall, the District Court had no evidence for believing that women more than men would prefer out-of-cone assignments to the consular cone and that this preference — rather than a discriminatory treatment of women — best explained the disparities in out-of-cone assignments. Here, in contrast, there is some evidence that women preferred downstretch assignments more than men did. As the District Court states, the record contains “testimony that down-stretch assignments are requested for various reasons, including the desire to gain an assignment with a spouse who is also a State Department employee.” Id. If this testimony were indeed “extensive,” as the District Court characterized it, we would conclude that the District Court’s decision that the State Department did not discriminate in stretch and downstretches was not clearly erroneous. But we can find in the record only two instances in which a woman FSO subordinated her own career in favor of her husband’s Foreign Service career — and in one of these instances, the witness testified that her decision in this instance was part of an alternating practice she and her husband agreed to of trading-off less desirable assignments. Compare Appellee’s Brief at 53-54 n. 58 with Tr. 876, 1765, 2150. These two (or more accurately, one and a half) isolated instances do not amount to “extensive” testimony. Alone they do not establish a sufficient basis for undermining the probative weight of appellants’ statistics. We must recall that some of the disparities between men and women in downstretch assignments were especially extreme, measuring 6.72 and 4.04 standard deviations. Given these kinds of numbers, it takes more than a few isolated examples of individual decisions by women to seek downstretches for the District Court not to conclude, that more likely than not, the disparities result-. ed from unlawful discrimination. Therefore, from our review of the totality of the evidence presented on the issue of discrimination in stretch and downstretch assignments, we must conclude that the District Court’s finding of no discrimination was clearly erroneous. We reverse the District Court’s decision on this issue of liability and remand for appropriate proceedings on the question of remedies.

3. Deputy Chief of Mission Assignments

Appellants also claim that the State Department discriminated against women in selecting Deputy Chiefs of Mission. The Deputy Chief of Mission (DCM) is the second in command, directly below the Ambassador, at each American embassy. As the District Court found, appellants introduced evidence showing that only “nine women were appointed DCM between 1972 and 1983, out of a total of 586 appointments.” 616 F.Supp. at 1552 (1188). The District Court then noted:

Plaintiffs’ expert calculated that the expected number of women appointed during that period, based on the number of women in the grade levels from which DCM’s were chosen, is 26.8. The difference between the actual and expected number of women measures 3.54 standard deviations.

Id. The probability of a disparity this large or larger, either favoring or disfavoring women for the DCM position, resulting by chance in a selection process that did not differentiate between men and women, is about one in 2,500 times. Given this extremely low probability, this evidence, standing alone, raises a strong inference of disparate treatment.

The District Court offered several reasons for concluding that the State Department did not discriminate against women in DCM assignments. All of these reasons are erroneous as a matter of law. First, the District Court found this evidence “unconvincing" because appellants were unable to show “statistically significant disparities]” in the selection rates for five other “high visibility positions.” 616 F.Supp. at 1560 (1120). (The other “high visibility” positions were: Deputy Assistant Secretary, Office Director, Country Director, Principal Officer, and Executive Director.)

Once more, we remind that under 42 U.S.C. § 2000e-16 appellants are not required to prove sex discrimination in assignments to six different types of jobs in order to establish discrimination in assignments to a single position. We have, however, also said that evidence of nondiscrimination in some jobs may be probative of whether discrimination occurred in selections for another kind of job. Adherence to both these legal rules may be difficult at times. But in this case it is clear that the District Court contravened the first of these two legal rules. Here, appellants introduced evidence showing that the underselection of women for DCM positions was so extreme that the chance of women being randomly underselected or overselected to this degree or greater was only one in 2,500 times. Not even a stipulation that the State Department did not discriminate against women in assignments to five other kinds of “high visibility” positions could defeat the inference of disparate treatment raised by this evidence. A defendant must produce other evidence directly relating to the job at issue to rebut this inference of discrimination. In this case, the District Court rejected appellants’ strong inference of disparate treatment in part because appellants did not generate an inference of discrimination in five other types of assignments. This was legal error.

Second, the District Court stated:

Plaintiffs’ analysis of the number of women ... in DCM positions failed to adequately consider the bottom-entry nature of the Foreign Service. It failed to allow for the time necessary for the large number of female FSO’s presently in the service to advance to the higher ranks.

616 F.Supp. at 1560 (¶19). It is not clear what the District Court meant by this statement. As we have seen, the District Court elsewhere acknowledged that appellants’ statistical analysis was “based on the number of women in the grade levels from which DCM’s were chosen.” Id. at 1552 (1188). Thus, according to the District Court itself, the appellants properly limited their study to the relevant applicant pool and therefore controlled for the fact not many women in the Foreign Service had reached a position in which they were eligible for appointment as Deputy Chief of Mission. What else, then, could the District Court have meant by saying that appellants “failed to allow for the time necessary for the large number of female FSO’s presently in the Service to advance to the higher ranks”? We can only surmise that the District Court meant that when more women reached these higher ranks, more women would be appointed to DCM positions. But even if that is what the District Court meant, then it once again committed legal error. The fact that in absolute numbers, more women will be appointed to DCM positions is irrelevant to the present discrimination claim at issue. Appellants claim that even after accounting for the small number of women eligible for selection to a DCM position, women have been proportionally underselected when compared to the number of eligible men selected and that this underselection has no legitimate, nondiscriminatory explanation. Appellants are entitled to a consideration of this claim regardless of whether the reason for the currently small number of eligible women is the “bottom-entry nature of the Foreign Service.” It seems as if the District Court lost sight of the relevant legal question under Title VII, and the issue must be remanded for reconsideration in accordance with a proper conception of the law.

Third, the District Court found that “[plaintiffs’] statistical analysis is of little significance in that it encompasses the period 1972 through 1983, while the relevant time period for this case is 1976 to 1983.” 616 F.Supp. at 1552 (¶ 89). This determination is directly contrary to the precise holding of the Bazemore decision. As discussed in Part III. B., the Supreme Court found that the inclusion of pre-Act data in a statistical study did not undercut the probative value of that study. Previously, the Supreme Court has held that evidence of discrimination by the defendant for years that are time-barred is equivalent to evidence of pre-Act discrimination. United States v. Evans, 431 U.S. 553, 558, 97 S.Ct. 1885, 1889, 52 L.Ed.2d 571 (1977). In this case, evidence of discrimination from the years 1972-1976 is not directly actionable because it is time-barred, but its inclusion in appellants’ study certainly cannot render the entire study of “little significance.” See Rossini v. Olgivy & Mather, Inc., 798 F.2d 590, 604 n. 5 (2d Cir.1986). Moreover, in this case, the State Department does not contend that reviewing the data from 1972-1976 would reveal disparities of a different magnitude. See Appellants’ Brief at 57; Appellee’s Brief at 58 n. 68.

Thus, the three reasons the District Court gave for rejecting appellants’ strong inference of disparate treatment in DCM assignments are inadequate as a matter of law. On appeal, the Secretary suggests an alternative nondiscriminatory explanation for the underselection of women to this position: more women might have been appointed Ambassador instead. Appellee’s Brief at 57. We note that the District Court made no such finding and the only evidence in the record to which the Secretary directs us is a statement by a single witness that perhaps this fact might explain the underselection of women for DCM positions. Tr. at 1766. We think that the proper course under Pullman-Standard is to remand the issue to the District Court for further factfinding, on the basis of the existing record.

C. The Superior Honor Award

The appellants also claim that the State Department discriminated against women in granting the Superior Honor Award to Foreign Service Officers. As the District Court found, appellants presented the following evidence:

4.8% of the award recipients were females, although 10.1% of the Class 1 through 5 FSO's during the time period were females. These results indicate that twice as many women would be expected to receive the Superior Honor Award as actually received it. The difference measures 3.1 standard deviations.

616 F.Supp. at 1548 (1148). The chances are only one in 500 that a deviation of this magnitude or larger, either favoring or disfavoring women, would occur randomly if the process of granting Superior Honor Awards treated men and women equally. Elementary Statistics, supra n. 8, at 481.

Once again, the reasons that the District Court gave for rejecting appellants’ discrimination claim are contrary to law. First, the District Court stated that appellants failed to show how “the failure of women to receive the Superior Honor Award affected the opportunity for promotion.” Id. (It 49). Appellants, however, are entitled to bring a sex discrimination claim under 42 U.S.C. § 2000e-16 with respect to personnel decisions involving awards regardless of how these decisions affect promotions. As we have seen, the Foreign Service Act of 1980 specifically includes “any ... award of performance pay or special differential” as among the personnel actions that must be free from sex discrimination, and we do not construe “all personnel actions” in 42 U.S.C. § 2000e-16 to have a lesser scope.

Second, the District Court rejected appellants’ claim involving the Superior Honor Award as “unconvincing” because the appellants were unable to produce equivalent evidence with respect to other State Department Honor Awards. But as with the evidence concerning the DCM assignments, appellants’ evidence concerning the Superi- or Honor Award is sufficiently strong to withstand even a stipulation that the State Department did not discriminate against women in granting other types of Honor Awards. To rebut the inference of discrimination here, the State Department was required to present evidence explaining the extreme disparity between the numbers of men and women receiving the Superior Honor Award.

Third, the District Court discredited appellants’ evidence because the District Court thought that appellants’ statistical “analysis was based on a faulty assumption that all female FSO’s were equally qualified for the Superior Honor Award.” 616 F.Supp. at 1548 (11 50). But appellants’ evidence assumes nothing of the sort. The District Court apparently thought that appellants made this “faulty assumption” because, in the court’s own words, appellants made “no showing ... of what portion of female FSO’s were qualified for the Superi- or Honor Award.” Id. But the statement reveals a fundamental misunderstanding of the role of relevant statistical evidence in a Title VII case. Appellants do not suggest that one FSO is as equally qualified to receive an award as another. These awards are obviously based on merit and are supposed to be given to only the outstanding FSOs. Appellants merely assume that the ranks of men and women FSOs would produce these outstanding individuals at (roughly) equal rates, and the State Department offered no reason for rejecting this assumption. Appellants’ statistical analysis is based on the contention that if the State Department awarded this prize without bias against women, the percentage of eligible women receiving the award would be the same as the percentage of eligible men receiving the award (and thus the male/female ratio among award recipients would be the same as the male/female ratio in the pool of eligible candidates). Appellants properly limited their analysis to only FSOs in classes 1 through 5, because only FSOs in those classes received this award. Given that appellants limited their statistical analysis to the relevant pool, and the analysis revealed an underselection of women measuring 3.1 standard deviations, the inference of disparate treatment generated by this evidence is entitled to stand unless and until the government presents a credible nondiscriminatory explanation of why men in classes 1 through 5 more frequently received the Superior Honor Award than women in the same classes. See, supra, n. 6. By stating that appellants had established no basis for comparing actual awards with expected awards, and in believing that appellants assumed all female FSOs equally qualified for the award, the District Court revealed failure to understand the way in which statistics can prove discrimination in a Title VII case. Therefore, we reverse for legal error.

Moreover, because the State Department did not offer any explanation for the disparity between men and women in receiving the Superior Honor Award, we must order the District Court to uphold appellants’ claim of discrimination on this issue. We need not address what kind of remedy might be appropriate, as only issues of liability are properly before the court at this time.

V. Initial Cone Assignments: The Claim Involving the Disparate Impact Theory

Appellants characterize their claim concerning initial cone assignments as both a disparate treatment and a disparate impact claim. This characterization, unfortunately, lacks a certain degree of clarity and may indicate some confusion on the appellants’ part. Perhaps this confusion stems from the fact that the initial cone assignments involve two distinct groups of FSOs: those that took entrance exams and those that did not. See, supra, Part I & n. 2. It appears that appellants wish to bring a disparate treatment claim on behalf of both these groups and a disparate impact claim on behalf of the exam-takers. The appellants introduced statistical evidence of a disparity in initial cone assignments for which the pool was both the exam-takers and the nonexam-takers. Appellants' Brief at 22. This study was based on data supplied by the State Department. Id. The appellants also introduced statistical evidence of a disparity in the initial cone assignments for the exam-takers alone. Id. at 24. This study, by contrast, was based on data supplied by the Educational Testing Service (ETS) which administers the Foreign Service entrance exams and monitored the test results. Id. (The appellants apparently did not introduce any evidence regarding the nonexam-takers alone.) We do not believe, however, that in this case the appellants can pursue both a disparate treatment and a disparate impact claim with respect to the exam-taker’s initial cone assignments. We will explain our reasons for this conclusion.

To apply the disparate treatment theory to the evidence concerning exam-takers, the appellants must allege and prove that the observed, nonrandom disparities were caused by intentional discrimination against women. To apply the disparate impact theory, the appellant must allege and prove that the disparities were caused by a “facially neutral” selection criterion that disadvantaged women more than men. Here, the appellants point to the political functional field portion of the Foreign Service Entrance Examinations. They have introduced evidence that from 1975 to 1980 men received higher scores than women on this test and that statistical analysis rejects the hypothesis that this disparity was a random sample of the deviation that would normally occur if men and women tested equally. See 616 F.Supp. at 1546 (1127).

Of course, the appellants might have presented alternative claims: e.g., the disparity in initial cone assignments was caused either by discriminatory intent, or by the results of the entrance examinations. Nothing in Title VII or the Federal Rules of Civil Procedure prevents appellants from pursuing alternative claims or theories, even if they are mutually inconsistent. But in this case appellants seem to argue only that the results of the entrance examinations caused the disparity in initial cone assignments; they make no explicit charge of discriminatory intent. Indeed, appellants introduced an additional regression analysis study (also based on the ETS data) which showed that the test scores were the one and only factor that explained the disparity in initial cone assignments. At trial, appellants’ expert witness, who had conducted the statistical study, testified that with respect to “the exam takers, the reason you see this pattern [of disparity in initial cone assignments] is because of their test scores.” Tr. at 3402. The appellants argued to the District Court that this evidence demonstrates that “[t]he adverse impact of the functional field test causes the disparities in cone assignment observed by Dr. Siskin [the expert witness]____ [T]est scores on the functional field test were determinative of cone assignments.” Plaintiffs’ Post-Trial Brief at 33. They repeat this argument on appeal. Appellants’ Brief at 35. Because appellants have specifically identified the examinations, and not intent, as causing the disparity in initial cone assignments of the exam-takers, we will treat their claim concerning this disparity as relying solely on the disparate impact theory.

Once over that initial hurdle, the resolution of appellants’ disparate impact claim seems straightforward. The only basis which the District Court gave for rejecting appellants’ statistical evidence that correlated test scores with initial cone assignments was that these statistics were “flawed and inconclusive.” 616 F.Supp. at 1561 (¶ 28). •

Plaintiffs’ analysis of exam takers is flawed and inconclusive in establishing disparate impact in cone assignments. It was established that the expert's determination of total FSO hires for the year 1981 was incorrect. Plaintiffs’ expert at times had difficulty identifying the cone at hire of the FSO’s and chose to delete those officers from the analysis, along with any FSO’s not assigned to the four major cones. Though the expert disclaimed the significance of those actions, the Court is not persuaded.

Id. at 1546 (¶ 29). Unfortunately, this finding of fact is itself flawed. Although the District Court is correct in saying that there was some confusion about the correct data for 1981 in some of appellants’ statistics, this confusion did not involve the specific statistical studies relevant to the disparate impact .claim involving the entrance examination: the data which were supplied by ETS. There was no dispute about the accuracy of this data. The confusion over the 1981 numbers arises from data supplied by the State Department’s employment records. The State Department data were used in appellants’ statistical studies involving both exam-takers and nonexam-takers and this evidence was unnecessary for the disparate impact claim involving exam-takers only.

Because the ETS data on which the disparate impact claim relies do not include the “flaw” referred to by the District Court, this finding of fact must be reversed as clearly erroneous. Indeed, the State Department makes no attempt to support this finding of fact. Instead, the State Department suggests that preference, and not the results from the functional field portions of the entrance examinations, explains the disparity in the initial cone assignments of male and female exam-takers. It is not at all clear from the opinion that the District Court adopted this argument. The District Court refers to the existence of a study that the State Department introduced in support of this argument, but makes no evaluation of the study. 616 F.Supp. at 1546 (II30). We think it appropriate that the District Court, rather than an appellate court, evaluate this evidence in the first instance. We note that the government’s study involves only the years 1973 and 1974, when exam-takers were tested in only one functional field and were allowed to select the field in which they wished to be tested. But we are not prepared to say that the results of that study, whatever they might be, are entirely irrelevant for the years 1975 and after, when all exam-takers were tested in all four functional fields. Apparently, preference played some role in initial cone assignments for some FSOs in the period after 1975. See 616 F.Supp. at 1545 (If 16). On remand, therefore, the District Court must determine whether, on the basis of the existing record, the apparent disparity in initial cone assignments for the exam-takers was, more likely than not, caused by the disparity in test scores for male and female FSOs — or, as the State Department contends, by different assignment preferences between male and female FSOs.

Notably, the one obvious defense that the State Department never raised was that there was a legitimate “business” necessity for the test. Indeed, the District Court specifically found that “[defendant did not rely on a showing that the political functional field test was job related.” 616 F.Supp. at 1546 (¶ 31). Thus, if the District Court concludes that the examination caused the disparity in initial cone assignments, the District Court must conclude that the test violated Title VII. See, e.g., Albemarle Paper Co. v. Moody, 422 U.S. 405, 95 S.Ct. 2362, 45 L.Ed.2d 280 (1975). Consequently, we reverse the decision of the District Court and remand for the appropriate factfinding.

Conclusion

We have reviewed the District Court’s decision in this case in detail and have concluded that it committed a number of legal errors and made several clearly erroneous errors of fact. Consequently, we reverse the judgment of the District Court and remand this action for further proceedings not inconsistent with this opinion. With respect to a number of the appellants’ claims, we have held that the determination of liability under Title VII requires further factfinding by the District Court, to be conducted on the basis of the existing record. See C. Wright & A. Miller, Federal Practice and Procedure § 2577 (1971). We offer no views at this point on any issues relating to the remedies phase of this litigation.

It is so ordered. 
      
      . Before 1975, the Foreign Service tested each applicant in only one of the four functional areas, and required the applicant to select the cone in which he or she wished to be tested. Defendant’s Post-Trial Brief at 40. From 1975 to 1979, applicants were admitted into the Service on the basis of general test scores alone; the results of the functional field cone tests were used to make initial cone assignments. Since 1980, admission has depended upon overall performance on the functional field tests, but applicants must achieve a certain cut-off score on the particular cone test in order to be eligible for appointment to that cone. Id. at 43.
     
      
      . Another relatively small group have entered the junior ranks of the Foreign Service without going through the examination process. Below 1984, minority applicants who entered the Foreign Service through the Affirmative Action Junior Officer Program were not required to take the entrance examinations. Similarly, the Mustang Program, which allows State Department employees not in the Foreign Service to become members of the Service, has not used the examination. Individuals who have entered the Service pursuant to these programs have received initial cone assignments based on their background and experience.
     
      
      . The decision to grant a career candidate tenure as a Foreign Service Officer is made independently of the promotion process. Tenure decisions are made by the Secretary of State pursuant to 22 U.S.C. § 3946, which provides that the Secretary’s decisions shall be based on the recommendations of special tenure boards. See Defendant's Post-Trial Brief at 100-02; see also Daniels v. Wick, 812 F.2d 729 (D.C.Cir.1987) (holding that § 3946 provides the only means for receiving tenure under the Foreign Service Act of 1980).
     
      
      . The Junior Applicant Consent Decree settled all claims involving entry-level decisions into the junior ranks of the Foreign Service. The Mid-Level Applicant Consent Decree settled all issues of lateral entry into the Foreign Service.
     
      
      . The appellants have not appealed all issues raised at trial.
     
      
      . As the quotation from Segar reflects, the statistical analysis must focus "on the appropriate labor pool” in order to properly establish a prima facie case of discrimination. If a statistical analysis of selection rates is premised on a faulty calculation of the number of men and women who are eligible for selection, as a result, for example, of a misunderstanding of the eligibility criteria, the statistical conclusions lose much of their probative force. If, for instance, to be eligible for a promotion from assistant professor to professor at a particular university a person must have seven years experience and a Ph.D. degree, a statistical study which defines the number of women and men eligible for this promotion as those with seven years experience, overlooking the requirement of a Ph.D. degree, might lead to skewed results, for there might well be some reason why more female than male assistant professors had not achieved a Ph.D. degree after seven years of teaching. "In order to ensure that a plaintiffs methodology has eliminated the common nondiscriminatory explanation of a lack of qualifications, this circuit has developed a requirement that statistical evidence of disparities account for the minimum objective qualifications for the position at issue." Segar v. Smith, 738 F.2d 1249, 1274 (1984), cert. denied sub. nom. Meese v. Segar, 471 U.S. 1115, 105 S.Ct. 2357, 86 L.Ed.2d 258 (1985) (emphasis in original) (citations omitted). Conversely, as long as a plaintiffs statistical analysis has properly defined the pool of eligible candidates, by accounting for "minimum objective qualifications," the burden then shifts to the defendant to introduce evidence of a legitimate, nondiscriminatory explanation if the analysis reveals a statistically significant disparity. Id. at 1276. In this case, there is no dispute that appellants properly accounted for the minimum objective qualifications for the various positions and benefits that are the subject of their disparate treatment claim. Instead, this case involves disputes about whether the government succeeded in demonstrating the existence of other, legitimate factors that would explain the apparent disparities in selection rates, or whether it demonstrated other methodological flaws or inadequacies in appellants’ statistics. For a discussion of the legal principles involved in evaluating attempts to rebut plaintiffs’ statistics, see, infra, Part III. B.
     
      
      . The "standard deviation” is a unit of measurement that allows statisticians to measure all types of disparities in common terms. Technically, a "standard deviation” is defined as "a measure of spread, dispersion, or variability of a group of numbers equal to the square root of the variance of that group of numbers.” D. Baldus & J. Cole, Statistical Proof of Discrimination 359 (1980) (emphasis in original). The "variance" of the group of numbers is computed by subtracting the “mean,” or average, of all the numbers, "squaring the resulting difference, and computing the mean of these squared differences." Id. at 361.
     
      
      . The discussion of statistics in this portion of the opinion relies on the following sources: D. Baldus & J. Cole, Statistical Proof of Discrimination (1980 & 1986 Supp.); W. Curtis, Statistical Concepts for Attorneys (1983); W. Dixon & F. Massey, Jr., Introduction to Statistical Analysis (4th ed. 1983); B. Lindgren & D. Berry, Elementary Statistics (1981) [hereinafter cited as Elementary Statistics]; R. Wehmhoefer, Statistics in Litigation (1985).
      We are not expert statisticians and we discuss statistics only insofar as necessary to give a comprehensible explanation of our view of the proper application of Title VII law to the facts of this case. Nor do we pretend to cover all of the issues that relate to the use of statistics in a Title VII case. For example, we note that there are various methods for deriving a "test statistic" measured in numbers of “standard deviations": the z-test, the t-test, etc. We have no opinion on the choice of these methodologies as this case does not call them into question. Similarly, we are aware that our discussion of statistics requires sufficiently “large” samples in order to be accurate; we have avoided the “small sample problem” because apparently none of the claims on appeal here involves small samples.
     
      
      . In any event, given the language of the Supreme Court in Castenada and Hazelwood, we do not believe that we can allow the threshold at which statistical evidence alone raises an inference of discrimination to be lower than 1.96 standard deviations, whether one views this number as signifying a 5% probability of randomness using a two-tailed approach or a 2.5% probability of randomness using a one-tailed approach. If plaintiffs in Title VII cases are ever to be allowed to establish a prima facie case by evidence of disparity measuring lower than 1.96 standard deviations, this decision under the current law must be made by the Supreme Court (or Congress). Cf. Meier, Sacks & Zabell, "What Happened in Hazelwood," reprinted in, M. DeGroot, S. Fienberg & J. Kadane, Statistics and the Law 15 (1986) (adopting 1.96 standard deviations as the threshold for Title VII cases even under the assumption that one should use a one-tailed test in Title VII litigation).
     
      
      . In this respect, we follow the approach to statistical evidence adopted in Craik v. Minnesota State University Bd., 731 F.2d 465, 476 n. 13 (8th Cir.1984):
      Statistical evidence showing less marked discrepancies [than two standard deviations] will not alone establish something other than chance is causing the result, but we shall consider it in conjunction with all the other relevant evidence in determining whether the discrepancies were due to unlawful discrimination.
      This approach follows Baldus and Cole in viewing disparities between 1.65 and 1.96 standard deviations as falling into an "intermediate" zone. See Baldus & Cole (Supp.) at 131-32. Numbers in this intermediate range go some of the way toward establishing a prima facie case of discrimination, but they cannot make the distance on their own. But cf., Meier, Sacks & Zabell, supra n. 9, at 12 (the appropriate intermediate zone falls between 1.96 and 2.33 standard deviations).
     
      
      . 22 U.S.C. § 3905 states explicitly that “all personnel actions ... shall be made in accordance with merit principles," which excludes sex or race as a permissible criterion for a job action. See H.R.Rep. No. 992, pt. 1, 96th Cong., 2d S^ss. 8 (1980). Furthermore, this section goes on to direct the Secretary of State to "prescribe such rules as may be necessary to ensure that members of the Service, as well as applicants for appointments in the Service ... are free from discrimination on the basis of ... sex." 22 U.S.C. § 3905(b). The statute also states that this section does not extinguish any rights under Title VII. Id. § 3905(e).
     
      
      . Because the Supreme Court was sharply divided on a separate issue in the Bazemore case, the Supreme Court's unanimous opinion on this issue comes in the unusual form of a concurring opinion. The Court issued a short per curiam opinion stating:
      We hold, for the reasons stated in the opinion of Justice BRENNAN, ... the Court of Appeals erred in disregarding petitioners' statistical analysis because it reflected pre-Title VII salary disparities, and in holding that petitioners’ regressions were unacceptable as evidence of discrimination.
      106 S.Ct. at 3002. As Justice Brennan’s opinion reflects the reasoning of the unanimous Court, we have dispensed with the conventional practice of citing to it as a concurring opinion.
     
      
      . As the Supreme Court said in Bazemore, ‘‘[w]hether, in fact, [plaintiffs' statistics will] carry the plaintiffs’ ultimate burden will depend in a given case on the factual context of each case in light of all the evidence presented by both the plaintiff and the defendant.” This statement contemplates that defendants generally must introduce evidence to support their attack on plaintiffs’ statistics. Mere conjectures and assertions usually will not suffice.
      We note also that leading commentators support this corollary to the Bazemore rule. Baldus and Cole emphasize that "when otherwise relevant evidence is challenged on methodological grounds, the burden should normally be on the challenger (a) to present credible evidence that the statistical proof is defective and (b) to present a plausible explanation of how the asserted flaw is likely to bias the results against his or her position.” D. Baldus & J. Cole, Statistical Proof of Discrimination at vii (1986 Supp.).
     
      
      . Other opinions of this court are in accord. See Trout v. Lehman, 702 F.2d 1094, 1102 (D.C. Cir.1983), vacated on other grounds, 465 U.S. 1056, 104 S.Ct. 1404, 79 L.Ed.2d 732 (1984); DeMedina v. Reinhart, 686 F.2d 997, 1008 (D.C. Cir.1982).
     
      
      . This statistical evidence was further supported by additional statistics demonstrating a disparity in the potential ratings between men and women who achieved exactly the same performance rating. For example, men with performance ratings of "6” received, on average, higher potential ratings than the women who received performance ratings of "6.” This disparity measured 2.55 standard deviations. 616 F.Supp. at 1549 (¶ 63).
     
      
      . The District Court actually said, "This difference produces a standard deviation of 5.84, and therefore is likely to be the product of chance less than once in 1,000,000.” 616 F.Supp. at 1553. While it is true that a disparity measuring 5.84 standard deviations corresponds to a probability value of less than one in a million, whether using a one-tailed or a two-tailed test, our reading of the standard statistical tables tells us that the probability of a random disparity measuring 5.84 standard deviations is much smaller than even the District Court indicated. The one-tailed probability value associated with a statistic measuring 5.8 standard deviations is 3.3157 X 10-9, or about 3 in one billion. The corresponding two-tailed probability value is twice that or about 6 in one billion, which is less than one in one hundred million. See Plaintiffs’ Exhibit 168 at 13.
     
      
      . The 0.74% probability mentioned in text reflects a two-tailed approach. The District Court, again, apparently used a one-tailed approach. The District Court stated that the (one-tailed) probability was 5 in 1000, or 0.5%, but our reading of the standard tables reveals a slightly lower one-tailed probability of 0.37%. See Elementary Statistics, supra n. 8, at 479.
     
      
      . Again, the 0.88% probability reflects a two-tailed approach. A one-tailed probability value for 2.62 standard deviations is 0.44%. Elementary Statistics, supra n. 8, at 479.
     
      
      .The State Department's approach here is remarkably similar to the defendant’s rejected approach in Bazemore:
      
      Respondents’ strategy at trial was to declare simply that many factors go into making up an individual employee’s salary; they made no attempt that we are aware of — statistical or otherwise — to demonstrate that when these factors were properly organized and accounted for there was no significant disparity between the salaries of blacks and whites.
      106 S.Ct. at 3010-11 n. 14. Similarly, here the State Department presented no evidence at all that preference would explain the disparities related to sex.
     
      
      . On the contrary, the Supreme Court found this evidence "quite probative.”
     
      
      . We have no occasion to rule today that with respect to a particular disparity (like initial cone assignments) a disparate treatment claim and a disparate impact claim are mutually inconsistent. As this court has previously recognized, a disparate treatment claim can turn into a disparate impact claim if a defendant rebuts an allegation of discriminatory intent by claiming that a facially neutral selection criterion caused a disparity in selections. See Segar, 738 F.2d at 1270. Indeed, it may even be possible to claim that both discriminatory intent and a facially neutral, although disadvantageous, selection criterion simultaneously caused a particular disparity, each contributing to the end result. Fortunately, we need not decide any of these complex questions about partial causality, since appellants themselves state that after accounting for the difference between men and women in their test scores, there is no statistically significant disparity between men and women in their initial cone assignments. See Plaintiffs’ Post-Trial Brief at 22; infra, n. 22.
     
      
      . This study considered the effect of the following variables on initial cone assignments: level of educational attainment, major field of study, functional test scores, and sex. See 616 F.Supp. at 1546 (¶ 28). The study found that neither level of educational attainment nor major field of study explained disparities in initial cone assignments, and that when controlling for functional test scores, women were not underassigned to the political cone or overassigned to the consular cone to a statistically significant degree. See Tr. at 1076-82.
     
      
      . Appellants’ confusion over the difference between a disparate treatment and a disparate impact claim is illustrated by the following assertion in their brief: “[Plaintiffs’ expert] found that test scores substantially correlate with or explain cone assignments____ Thus, there can be no doubt that plaintiffs have established a disparate treatment [claim] in cone assignment." Plaintiffs’ Post-Trial Brief at 22. As discussed in text, this evidence supports a disparate impact, and not a disparate treatment, claim. Appellants at times, incorrectly, suggest that they can maintain a disparate treatment claim simply by demonstrating a disparity in initial cone assignments. See, e.g., Appellants' Brief at 22. But, as discussed in text, a disparate treatment claim must prove both a disparity and discriminatory intent — even if proof of intent is circumstantial and the disparity itself raises an inference of intent. See, e.g., Teamsters, 431 U.S. at 335 n. 15, 97 S.Ct. at 1854 n. 15.
     
      
      .Because we have concluded that appellants have properly presented only a disparate impact claim regarding the initial cone assignments of the exam-takers, the only remaining disparate treatment claim involves the initial cone assignments of those who did not take the entrance examinations. As we have mentioned, however, the appellants presented no independent statistical evidence to show that the State Department intentionally discriminated against women in this group of nonexam-takers. The data which included this group also included the exam-takers, but as any study based on this data is drastically overinclusive with respect to the nonexam-takers, we do not believe this evidence can create even a prima facie case of discrimination. Consequently, we affirm the District Court's decision insofar as appellants failed to prove disparate treatment in the initial cone assignments of the nonexam-taker group.
     
      
      . See, supra, n. 1.
     
      
      . We note, however, that the statistical analysis on which the appellants' disparate impact claim was based covered only those applicants who took the examinations between 1975 and 1980 and were subsequently hired between 1976 and 1983. Apparently, there was not sufficient data from those who took the entrance examinations after 1980 and who were thereafter hired in the relevant time period, for a meaningful statistical analysis to be conducted about the effect of these examinations. Therefore, the determination of liability under the disparate impact theory can extend only to those who took the examinations between 1975 and 1980.
     