
    Gary TURPIN, Plaintiff-Appellant, Betty Turpin, Individually and as Parent and Natural Guardian of Brandy Turpin, a Minor, Plaintiff, v. MERRELL DOW PHARMACEUTICALS, INC., Defendant-Appellee.
    No. 90-5690.
    United States Court of Appeals, Sixth Circuit.
    Argued Feb. 14, 1992.
    Decided March 11, 1992.
    Peter Perlman (briefed), Lexington, Ky., and Barry J. Nace (argued and briefed), Paulson, Nace, Norwind & Sellinger, Washington, D.C., for plaintiff-appellant.
    Frank W. Woodside, III (argued and briefed), Stephen M. Rosenberger (briefed), Dinsmore & Shohl, Cincinnati, Ohio, for defendant-appellee.
    Before: MERRITT, Chief Judge; KENNEDY and JONES, Circuit Judges.
   MERRITT, Chief Judge.

For a judicial system founded on the premise that justice and consistency are related ideas, the inconsistent results reached by courts and juries nationwide on the question of causation in Bendectin birth defect cases are of serious concern. In this Bendectin causation case, Judge Eugene Siler concluded that the evidence adduced on summary judgment was insufficient to allow a rational jury to find that Bendectin, a drug given to pregnant women to counteract the nausea of “morning sickness,” caused the minor plaintiff’s birth defects. 736 F.Supp. 737.

The general issue for review here is whether the trial judge erred by withdrawing the case from the jury and by granting summary judgment for the defendant pharmaceutical company. The more specific issues are, first, whether a court should judge for itself the validity of the reasoning process by which various competing qualified experts have reached their conclusions or should instead leave that question for the jury; and second, whether the evidence in this case, if so reviewed, is sufficient to withstand the defendant’s motion for summary judgment.

We agree with Judge Siler that, although judges should respect scientific opinion and recognize their own limited scientific knowledge, nevertheless courts have a duty to inspect the reasoning of qualified scientific experts to determine whether a case should go to the jury. Based on the record before us, we also agree with Judge Siler that whether Bendectin caused the minor plaintiff’s birth defects is not known and is not capable of being proved to the requisite degree of legal probability based on the scientific evidence currently available. Taken in the light most favorable to the plaintiffs, the scientific evidence that provides the foundation for the expert opinion on causation in this case is not sufficient to allow a jury to find that it is more probable than not that Bendectin caused the minor plaintiff’s injury. Therefore the case should not go to a jury.

We will first summarize the Bendectin causation issue and the case law that has developed during the past twelve years. We will then analyze the evidence in greater detail and show why it does not meet the legal test of causation.

I. Overview

The nausea of morning sickness affects many pregnant women and, although the causes are not completely understood, in extreme cases may cause permanent injury to the sufferer’s unborn child. Merrell Dow manufactured and marketed Bendec-tin as an anti-nauseant prescription for morning sickness from 1956 until 1983 when it took the drug off the market despite continued approval from the Food and Drug Administration. Estimates indicate that Bendectin was prescribed from 1957 until 1982 to over 30 million women worldwide and to more than 17.5 million women in the United States. These women commonly took Bendectin during the first trimester of pregnancy.

Approximately seven weeks after becoming pregnant, Betty Turpin ingested Ben-dectin to combat morning sickness. The initial development of the fetus’s fingers and toes occurs some four to eight weeks after conception. Seven months after Ms. Turpin first took the drug, her child, Brandy Turpin, the infant plaintiff in this case, was born with “limb reduction defects”: severely deformed hands and feet, specifically fused joints and shortened or missing fingers and toes. Ms. Turpin took no other drugs during the course of her pregnancy, nor can her child’s deformities be traced to any known genetic disorders.

Causation here is a matter of trying to measure probabilities. It requires a complex series of inferences drawn from scientific experiment and observation and statistical comparisons. For example, the plaintiffs rely primarily on animal experiments from which an inference is drawn that since chemical compounds in Bendectin, if administered at certain levels, cause birth defects in animals, they may cause similar defects in humans. The plaintiffs draw a further inference that Bendectin caused the birth defects in this particular case. These inferences are necessary because physicians who treated Brandy Turpin and other similarly situated children cannot diagnose the cause of these anomalies.

The defendant, too, reasons from the results of scientific studies to a particularized conclusion with respect to these plaintiffs. Merrell Dow relies primarily on statistical studies that purport to show that the incidence of certain birth defects is no higher with women who used Bendectin than with those who did not or, in the alternative, that where statistical associations indicating a possible causal relationship exist, they would not lead a reasonable expert to infer that Bendectin causes birth defects.

The causation proof in Bendectin birth defect cases is offered by expert witnesses who speak in terms of population groups and statistical samples rather than specific individuals. The expert witnesses on each side are often the same, from case to case, and even when different the scientific conclusions and theories are based on the same or similar statistical studies and scientific experiments. The cases are variations on a theme, somewhat like an orchestra which travels to different music halls, substituting musicians from time to time but playing essentially the same repertoire.

A brief survey of the reported Bendectin cases illustrates the inconsistency of courts that have dealt with the scientific problem of causation. We find only one reported case finally upholding a finding of causation. In Oxendine v. Merrell Dow Pharmaceuticals, Inc., 506 A.2d 1100 (D.C.App. 1986), aff'd in part on appeal after remand, 563 A.2d 330 (D.C.App.1989), cert. denied, 493 U.S. 1074, 110 S.Ct. 1121, 107 L.Ed.2d 1028 (1990), the appellate court reversed the trial court’s grant of a judgment n.o.v. and motion for new trial to the defendant and reinstated the jury’s $750,-000 verdict for the plaintiffs. On the other hand, in four other reported cases, juries returned verdicts for the defense which were allowed to stand. Wilson v. Merrell Dow Pharmaceuticals, 893 F.2d 1149 (10th Cir.1990) (affirming judgment for the defendant and noting also that the plaintiffs’ motion for judgment n.o.v. was correctly denied by the district judge); Will v. Richardson-Merrell, Inc., 647 F.Supp. 544 (S.D.Ga.1986) (denying plaintiffs’ motion for judgment n.o.v.); In re Richardson-Merrell, Inc. “Bendectin” Products Liability Litigation, 624 F.Supp. 1212 (S.D.Ohio 1985), aff'd, 857 F.2d 290 (6th Cir.1988) (denying plaintiffs’ motion for judgment n.o.v. in an order addressing 818 of 844 consolidated multidistrict cases in the largest of all Bendectin cases); and Cosgrove v. Merrell Dow Pharmaceuticals, Inc., 117 Idaho 470, 788 P.2d 1293 (1990) (affirming jury’s finding that Ben-dectin was not the proximate cause of child’s injuries).

Four federal circuits have held that plaintiffs failed as a matter of law to establish causation of birth defects. The Fifth Circuit, without ruling specifically on the admissibility of the plaintiffs’ expert testimony, reversed a jury verdict for the plaintiffs and granted judgment n.o.v. to the defendant because adequate proof of causation was lacking. Brock v. Merrell Dow Pharmaceuticals, Inc., 874 F.2d 307, reh’g. denied, 884 F.2d 166 (5th Cir.1989), cert. denied, 494 U.S. 1046, 110 S.Ct. 1511, 108 L.Ed.2d 646 (1990), limited by Christopherson v. Allied-Signal Corp., 902 F.2d 362, 367 (5th Cir.1990), rev’d on reh’g on other grounds, 939 F.2d 1106 (5th Cir.1991) (en banc). Another circuit, the Ninth, affirmed a grant of summary judgment for the defendant after holding that the plaintiffs’ reanalyses of Merrell Dow’s epidemiological studies were unreliable for lack of peer review. Daubert v. Merrell Dow Pharmaceuticals, Inc., 951 F.2d 1128 (9th Cir.1991). Two other circuits reached the same result by ruling inadmissible the plaintiffs’ expert testimony on grounds that it was not the type “reasonably relied upon” by qualified experts in the specific fields of study. Richardson v. Richardson-Merrell, Inc., 857 F.2d 823 (D.C.Cir.1988), cert. denied, 493 U.S. 882, 110 S.Ct. 218, 107 L.Ed.2d 171 (1989) (reversing jury verdict for the plaintiff; in the face of the defendant’s epidemiological evidence, an insufficient foundation existed for the plaintiffs’ animal and chemical studies); Lynch v. Merrell-Nat’l Labs., 830 F.2d 1190 (1st Cir.1987) (holding that the plaintiff’s in vivo and in vitro studies were inadmissible; therefore, insufficient evidence existed to avoid summary judgment for the defendant); see also Ealy v. Richardson-Merrell, Inc., 897 F.2d 1159 (D.C.Cir.), cert. denied, — U.S. -, 111 S.Ct. 370, 112 L.Ed.2d 332 (1990) (reversing jury verdict for the plaintiff for $20 million in compensatory damages and punitive damages of $75 million, and granting judgment n.o.v. to the defendant after concluding that the plaintiffs evidence was inadmissible under Richardson), and Ambrosini v. Richardson-Merrell, Inc., No. 86-278 (D.D.C. June 30, 1989) (relying on Richardson in granting judgment for the defendants).

Four District Court cases nationwide have granted summary judgment to the defendant for various reasons. Lee v. Richardson-Merrell, Inc., 772 F.Supp. 1027 (W.D.Tenn.1991) (relying on Richardson, Brock, and Judge Siler’s opinion in this case); Cadarian v. Merrell Dow Pharmaceuticals, Inc., 745 F.Supp. 409 (E.D.Mich.1989) (holding that an inadequate foundation existed for expert’s opinion); Hull v. Merrell Dow Pharmaceuticals, Inc., 700 F.Supp. 28 (S.D.Fla.1988) (finding that the body of scientific literature established Bendectin’s safety and that the infant plaintiff’s mother took the drug too late in her pregnancy to affect the fetus); and Monahan v. Merrell-Nat’l Labs., No. 83-3108-WD, 1987 WL 90269 (D.Mass. Dec. 18, 1987) (finding that summary judgment for the defendant was required under the First Circuit’s earlier holding in Lynch).

In contrast, other courts have either denied or reversed on appeal grants of summary judgment for the defendant in eight cases. In DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d 941 (3rd Cir.1990), the Third Circuit reversed the trial court’s grant of summary judgment to Merrell Dow and remanded for the District Court to analyze the reasonableness of an expert witness’s epidemiological opinion of causation under Federal Rule of Evidence 702. See also Longmore v. Merrell-Dow Pharmaceuticals, Inc., 737 F.Supp. 1117 (D.Idaho 1990) (expressly declining to adopt Richardson, Lynch, and Brock approaches); In re Bendectin Products Liability Litigation, 732 F.Supp. 744 (E.D.Mich.1990) (holding that collateral es-toppel did not bar the plaintiffs on the issue of causation and that, in the face of experts’ disagreements on necessity of epidemiological proof, the court could not reject other types of evidence); and DePyper v. Navarro, No. 116390 (Mich.Ct.App. May 9, 1991) (holding that the trial court erred under state law by not inquiring whether experts in field generally accepted the methodology of the plaintiffs’ expert). For other denials of summary judgment or denials of the defendant’s motions for directed verdict, see Hagen v. Richardson-Merrell, Inc., 697 F.Supp. 334 (N.D.Ill.1988) (denying summary judgment on causation but granting summary judgment on punitive damages); Mangels v. Richardson-Merrell, Inc., No. R-83-3272 (D.Md. Aug. 17, 1987) (summary judgment denied because a triable issue of fact existed for jury resolution); and Lanzilotti v. Merrell Dow Pharmaceuticals, Inc., No. 82-0183, 1986 WL 7832 (E.D.Pa. July 10, 1086) (denying the defendant’s motion for a directed verdict).

The fundamental reasons for the inconsistency of the legal system in handling Bendectin claims appear to be first, the difficulty of scientists and hence of judges, lawyers and jurors in knowing what reasonable inferences of causation to draw from animal experiments and epidemiological studies; and second, the uncertainty of judges about how far they should enter the scientific thicket of conflicting inferences in order to determine whether the basis of a scientific opinion concerning causation is sufficiently plausible to allow a jury to ground a verdict on it. There are two important questions here: How hard should judges look at the reasonableness of scientific theories and inferences before they decide whether there is enough to the case for it to go to the jury? If we apply a “hard look” doctrine, as we are inclined to do in scientific cases based primarily on expert testimony, what exactly are the general scientific experiments and studies capable of showing about whether Bendectin causes birth defects in a particular case?

We believe that close judicial analysis of such technical and specialized matter is necessary not only because of the likelihood of juror misunderstanding, but also because expert witnesses are not necessarily always unbiased scientists. They are paid by one side for their testimony. Although there is no suggestion of unethical scientific conduct in the present case, the potential for exaggeration and fraud on the court is present and may be impossible to discover without close inspection and careful consideration of the record. As Judge Leventhal observed in the context of administrative law, in some circumstances there exists a “combination of danger signals” requiring enhanced “judicial vigilance to enforce the Rule of Law.” Greater Boston Television Corp. v. F.C.C., 444 F.2d 841, 851-52 (D.C.Cir.1970). In such situations, “a court does not depart from its proper function when it undertakes a study of the record, hopefully perceptive, even as to evidence on technical and specialized matters_” Id. at 850. We find that this case presents a scenario justifying the type of judicial review recommended by Judge Leventhal.

II.

In this legal context, we review the evidence and arguments offered by the parties on summary judgment. In determining and applying the correct standards of proof on summary judgment in scientific cases, we look to the rules of sufficiency of the evidence to decide whether juries should be allowed to hear the evidence as well as the rules of admissibility of expert testimony that shape the facts and opinions to be considered. This case, we believe, should be decided on the rules of the sufficiency of evidence of causation on summary judgment, as Judge Siler held below in the alternative.

In the instant case, the plaintiffs claim that their infant daughter’s birth defects were caused by the mother’s use of Ben-dectin during her pregnancy and specifically by one ingredient, doxylamine succinate. The plaintiffs’ case relies on animal studies and attacks the defendant’s epidemiological studies. The defendant’s case relies on the epidemiological studies and attacks the animal studies.

The plaintiffs offered expert opinions from ten witnesses in eight scientific fields to assess whether Bendectin is “teratogenic,” i.e., capable of causing birth defects. These opinions were based on in vitro and in vivo animal studies, and reassessment of the defendant’s epidemiological studies derived from study of humans. In support of its motion for summary judgment, the defendant relies primarily on 35 human epidemiological studies supporting a finding that the use of Bendectin does not cause birth defects. Some of these studies were conducted by scientists under contract with the defendant. Others were independent.

A. Defendant’s Proof — Epidemiological Studies

Both sides appear to accept the fact that limb defects generally appear in less than one in 1,000 live births. The defendant’s proof consists in large measure of the 35 extant studies published in medical and scientific journals on the statistical relationship between the use of Bendectin and the incidence of various forms of birth defects in babies, none of which conclude that a causal connection exists. For an extended explanation of the complex statistical methodology used in such epidemiology studies, including the use of such terms of art as the “null hypothesis,” “significance testing,” “P value,” “relative risk” and “confidence interval,” see Part l.B of the Third Circuit’s recent Bendectin opinion, DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d 941, 945-49 (3rd Cir.1990).

The following examples illustrate the nature and findings of these 35 studies, with six studies representative of the group:

1. The San Francisco Study: Two University of California researchers studied effects of six anti-nauseant drugs on 11,481 pregnancies in the San Francisco area over seven years. Bendectin was prescribed in only 628 of these cases. Birth defects were monitored at three ages: one month, one year, and five years (though limb defects in particular were not isolated and reported). The average rate of all types of birth defects for Bendectin cases at the age of one month was 0.8 in every 100 births. This average rate is less than the average rate of the birth defects found in mothers who did not use Bendectin. The comparative rates at the one and five-year periods were similar. Although no specific relative risk was assigned to Bendectin use, the authors concluded that Bendectin, when taken at a recommended dosage level, was not teratogenic. Lucille Milkovich & Bea Van den Berg, An Evaluation of the Tera-togenicity of Certain Antinauseant Drugs, 125 Am. J. Obstetrics & Gynecology 244, 245-48 (1976).

2. The Boston and Harvard Study: Six doctors from Boston University and the Harvard School of Public Health evaluated a group study of 50,282 mothers and children for Bendectin’s possible effect on birth defects. Of these mothers, 1,169 took Bendectin during their first four months of pregnancy, with 79 births resulting in various birth defects. Limb defects were not specifically reported. The 49,113 mothers who were not exposed to Bendectin gave birth to 3,169 infants who had various birth defects. Overall, 4.7 percent of those mothers exposed to Bendectin gave birth to deformed infants, while 4.5 percent of non-exposed mothers did. Thus, the relative risk of birth defects for Bendectin use was I.07 at a 95 percent confidence level. The confidence interval was not listed. Taking this into account, the authors concluded that no “statistically significant” association existed, providing no evidence that Bendectin’s components — including doxya-lamine succinate — were harmful to the fetus. Samuel Shapiro et al., Antenatal Exposure to Doxyalamine Succinate and Dicyclomine Hydrochloride (Bendectin) in Relation to Congenital Malformations, Perinatal Mortality Rate, Birth Weight, and Intelligence Quotient Score, 128 Am. J. Obstetrics & Gynecology 480, 481-84 (1977). •

3. The Atlanta Study: Over 280,000 births were monitored in the Atlanta area over a ten-year period for maternal exposure to various drugs including Bendectin. This population base was twenty times larger than that in the San Francisco study. Of 1,231 birth defects cases, 117, or 9.5 percent, of the mothers took Bendectin. Of 129 children born with limb defects, 14 (10.9 percent) had mothers who took Ben-dectin. The study calculated a relative risk of 1.18 at a 95 percent confidence interval between 0.65 and 2.13. However, for one subgroup of limb defects known as the “amniotic band complex,” a higher relative risk — 3.88—was reported. Therefore, the risk of a Bendectin-exposed mother giving birth to a child with this specific condition was almost four times greater than that which would occur in a population of non-exposed mothers. Two other forms of birth defects (hemiated-brain and esophageal defects) had relative risks of 1.84 and 2.47, with 95 percent confidence levels between 0.63 and 5.37 and between 0.84 and 4.89, respectively.

For most birth defects, including limb defects, the authors determined that no “statistically significant” associations could be traced. As for the three above-mentioned possible associations, the authors noted possible confounding factors that might weaken the power of such associations, so that “the data [did] not suggest that Bendectin is causally associated” with birth defects. Jose Cordero et al., Is Ben-dectin a Teratogen?, 245 JAMA 2307 (1981).

4. Pyloric Valve Defects: The highest associations found between Bendectin use and birth defects were focused not on limb defects but on pyloric stenosis (abnormal constriction of the stomach’s pyloric valve). In one Yale School of Medicine study, a relative risk of 1.40 was detected between mothers using Bendectin (1,427 cases) and non-users (3,001) for birth deformities. When the survey exclusively focused on infants with pyloric valve defects, six mothers taking Bendectin gave birth to children with this defect, as opposed to 29 mothers who did not use the drug. At a 95 percent confidence interval between 1.75 and 10.75, the relative risk of this stomach valve defect was 4.33. “Thus, more than one in 10 cases of pyloric stenosis may be due to maternal use of Bendectin,” although no direct causal relationship could be ascertained. Brenda Eskenazi & Michael Bracken, Bendectin (Debendox) As a Risk Factor for Pyloric Stenosis, 144 Am.J. Obstetrics & Gynecology 919, 921-24 (1982). The same study noted that one child with a limb reduction defect was born to a mother taking Bendectin, while five defects occurred in children of women not taking the drug. Although a relative risk of 4.19 resulted for limb defects, this was regarded as “nonsignificant” by Eskenazi and Bracken, id. at 923, possibly due to the smallness of the group studied and the wide range of the confidence interval.

A later study conducted by Boston University Medical Center of 13,346 births around Puget Sound tended to support Eskenazi’s and Bracken’s findings. In 3,385 cases involving Bendectin use, 13 babies were bom with pyloric valve defects, while 13 women out of 9,511 not exposed to the drug gave birth to affected children. These findings yielded a 2.5 relative risk ratio at a 95 percent confidence interval between 1.2 and 5.2. Despite this positive association, however, the Boston University researchers noted “the absence of any apparent biologic basis” to support the connection between Bendectin use and development of purely internal deformities like pyloric stenosis. The researchers also acknowledged that severity of the mothers’ nausea could have confounded the results. Pamela Aselton et al., Pyloric Stenosis and Maternal Bendectin Exposure, 120 Am.J.Epidemiology 251, 252-56 (1984). An earlier study by two members of the same group focused on limb disorders in the same geographic region. Of 5,255 women studied, two infants out of 1,364 born to women who used the drug had limb defects, while four infants in the 3,841 Ben-dectin-free mothers had defects. The study yielded a relative risk of 1.4 and a 95 percent confidence interval between 0.26 and 7.71. A related study disclosed that of 2,255 infants born to Bendectin users, two children were born with malformed limbs, compared with six similar limb defects in babies born to 4,582 non-users. From this, a relative risk of 0.9 at a 95 percent confidence level between 0.29 and 2.98 was derived. The researchers concluded this was “evidence against a strong association.” Pamela Aselton & Herschel Jick, Additional Follow-Up of Congenital Limb Disorders in Relation to Bendectin Use, 250 JAMA 33 (1983).

5. The Sydney Study: University of Sydney researchers compared pregnancy histories for mothers of 155 children born with limb reduction defects with those for the mothers of 274 control group children; 26 percent of the 429 mothers in both groups used Bendectin during the first trimester of pregnancy. The relative risk resulting was 1.1, with a 95 percent confidence interval between 0.8 and 1.5. The Australian researchers concluded that “[o]n these figures, there is no evidence that women who take [Bendectin] ... are more likely to bear a limb-deficient child than women who do not take this drug.” Janet McCredie et al., The Innocent Bystander, 140 Med.J.Austl. 525, 526-27 (1984).

6. The National Institute of Health Study: In the most recent Bendectin study, two National Institute of Health researchers evaluated 31,564 births in Northern California. Of those women 2,771 (nine percent) had used Bendectin. For 58 categories of defects studied — limb defects, however, were not specifically monitored— 135 defects occurred in cases of Bendectin exposure, while 1,439 defects occurred in non-exposed cases. Relative risks were greatest in three categories: lung defects (4.6), microcephaly, i.e., small head size (3.1), and cataracts (3.7). The 95 percent confidence intervals varied widely for these three categories, ranging between 1.9 to 10.9 for lung defects, 1.8 to 15.6 for mi-crocephaly, and 1.2 to 24.3 for cataracts.

Despite these findings, the authors surmised that the three statistically significant defect groups were possibly spurious, as they were “exactly the number ... that would have been expected by chance.” How this determination was made, however, is not specified by the authors. The authors concluded that no increase in overall rates of defects existed after Bendectin use and that the three associations “are unlikely to be causal.” Patricia Shiono & Mark Klebanoff, Bendectin and Human Congenital Malformations, 40 Teratology 151, 152-55 (1989).

In addition to the 35 epidemiological studies, the defendant also offers as evidence the fact that no one has detected a decrease in the incidence of birth defects after Bendectin was removed from the market in 1983. Dr. Lamm so testified for the defendant based on a number of studies.

B. Plaintiffs’ Challenge to the Epidemiological Proof

The plaintiffs claim that the defendant’s 35 studies are based on samples which are too small to prove the absence of causation in light of the infrequency of instances of birth defects in general; that they do not adequately isolate limb reduction defects from other birth defects; that they do not control for many confounding factors such as smoking and the use of other drugs; that they impose a much higher level of scientific certainty of association (95 percent) than required by the preponderance of the evidence standard of proof (i.e., 51 percent); and that some of the studies can be read to show some statistically significant association if a much lower level of certainty is used. In essence the plaintiffs argue that the 35 statistical studies do not prove or disprove anything concerning the relationship between Bendectin and limb reduction defects.

At least two expert witnesses for the plaintiffs attack the persuasive force of the defendant’s statistical comparisons of the incidence of Bendectin-related birth defects. Dr. Glasser criticized the defendant’s use of studies by Aselton, Jiek, Cordero and Eskenazi as not correctly considering other birth defects, such as heart and pyloric valve defects or cleft palates, in assessing Bendectin’s capacity for limb birth defects. In his affidavit, Dr. Glasser also criticized the Cordero, Eskenazi and McCredie studies for incorrectly inferring that no association existed between Ben-dectin use and infant limb reduction. Dr. Swan, similarly, rejected these studies’ sole reliance on a relative risk of 1.0 within a 95 percent confidence interval as a basis for concluding that Bendectin does not cause birth defects in humans. Dr. Swan further claimed that several of the studies were conducted using insufficient populations or control groups, so that scientists wrongly calculated exposures to the drug. Dr. Swan viewed these and other factors as confounding the validity and power of such reports; however, both Drs. Glasser and Swan relied on these studies as the basis for their own recalculations, using a lower confidence interval that is claimed to derive a higher relative risk. Both experts concluded from their own reassessments that to a reasonable degree of epidemiological certainty, there is some association between Bendectin and limb reduction defects.

Although we agree with the defendant that its epidemiological studies and Dr. Lamm’s testimony constitute evidence on which a jury might ground a defendant’s verdict, we agree with the plaintiffs’ experts that this evidence is by no means conclusive. The defendant’s claim overstates the persuasive power of these statistical studies. An analysis of this evidence demonstrates that it is possible that Ben-dectin causes birth defects even though these studies do not detect a significant association.

Limb reduction defects occur in such a small percentage of both Bendectin and non-Bendectin live births — as noted, these occur in less than one in every 1,000 — that it would take a carefully controlled comparison of a very large number of births to instill confidence in the predictive power of the outcome. Also, many of the defendant’s studies apparently do not control for many factors that may be crucial for scientists to accord great weight to the studies, such as the stage of pregnancy during which the mother took Bendectin, the other drugs the mother may have taken, or other harmful conditions, natural or otherwise, that may have been part of the mother’s environment.

The Bendectin epidemiological studies are examples of a large number of studies of birth defects that demonstrate significant scientific uncertainty concerning their causes. A recent epidemiological study observes that “the etiology of congenital anomalies is poorly understood, with a estimated 60 [percent] of all human birth defects having no known cause.” Patricia Olshan et al., Paternal Occupation and Congenital Anomalies in Offspring, 20 Am.J. of Indus.Med. 447 (1991) (exploratory study finding a correlation between birth defects and the occupation of the father, but reasons for result yet unknown). The science of epidemiology is currently unable to identify the causes of many birth defects or to exclude from consideration many possible causes, including Bendectin and a host of other outside agents and environmental factors.

C. Plaintiff’s Proof — Animal Studies

The cartilage cells that later become the bones of fingers and toes begin to form in the human embryo during the fourth through eighth weeks of pregnancy. The plaintiffs’ theory is that chemical compounds in Bendectin interfere with the formation of these cartilage cells, or chondro-genesis, and that this causal relationship is shown by animal experiments. The plaintiffs’ proof includes experiments with animal cells and embryos, known as in vitro studies, performed by developmental biologists to observe possible toxic effects on animal tissue when tested in petri dishes. Other animal experiments, in vivo studies, consisted of tests performed on animals such as rabbits, chickens, monkeys, rats and dogs to determine if Bendectin’s ingredients created birth defects at various dosage levels. In these experiments, doxyala-mine succinate, an ingredient of Bendectin, was injected into animal cells that produce or grow into cartilage that becomes the bones in which limb defects may occur. The plaintiffs’ scientific hypothesis based on these studies is this: Because doxyala-mine succinate interferes with cartilage cell formation in animal cells and test animals, Bendectin is “capable” of causing similar limb defects in humans. The following examples illustrate the nature and findings of these animal cell studies:

1. In vitro studies: As a developmental biologist, Dr. Newman stated that he performs experiments on embryonic cells in petri dishes to determine how those cells develop and create tissue. Due to their similarities to the human embryo, chicken and mice or rat embryos are most frequently used in these studies. Limb-forming cells are removed from the embryo — for chickens, wing and leg formation cells are used — and are isolated in a dish, where selected cells are treated with a suspected teratogen. Changes in cell differentiation between the control group of untreated cells and the exposed group are observed and recorded.

Dr. Newman pointed to experiments on rat cells with substances similar to doxyala-mine succinate. In these in vitro tests, various defects including limb reduction were observed. Other in vitro tests performed by National Institute of Health experts and relied upon by Dr. Newman found that doxyalamine succinate interfered with cartilage development in mice and chicken limb cells. In one experiment, the addition of 10 micrograms of Bendectin to an animal cell culture reduced one of the components of cartilage cells, proteogly-can, by 30 percent. Similarly, 50 micrograms of Bendectin per milliliter of a culture reduced proteoglycan production by 50 percent, thus suggesting a strong terato-genic effect in the animal cells tested.

Like the other scientists who testify concerning animal experiments, Dr. Newman can only testify that these chemical compounds connected with Bendectin are “capable of causing” limb defects in humans, not that they do cause such defects.

2. In vivo studies: Dr. Gross, a pathologist and veterinary medical expert with the Environmental Protection Agency, described the nature of the in vivo studies proffered by the plaintiffs. In these experiments, suspected teratogens are administered to pregnant female animals. Shortly before their birth, the infants are removed from the mother and studied for defects. Dr. Gross examined a variety of these studies, including Bendectin experiments performed by the defendant on rats and rabbits. In one such study, doxyalamine succi-nate was given daily to female rabbits at three varying dosage levels. No defects were observed at the two lower levels; however, 40 percent of the litters born to females at the highest dosage had some congenital defects observed. As dosages were increased even higher, “outright death” of animal infants occurred. Dr. Gross rejected several of Merrell Dow’s other studies as being confounded by the presence of defects in the non-exposed control group, while other studies indicated defects that were assessed by Dr. Gross as being compatible with teratogenicity, although not all of the observed defects were limb-related. Dr. Gross stated that testing animals at levels higher than human dosages and then extrapolating the results to humans was a generally accepted practice. Based on his review of the studies, Dr. Gross gave his opinion that doxyalamine succinate in Bendectin has the “capacity” to interfere with human cell development at normal dosages but could not testify that it does cause such defects.

A recognized text on teratology states the customary scientific view that “it has become axiomatic in experimental teratolo-gy that agents capable of causing any adverse biological effects can usually also be shown to be embryotoxic under the right conditions of dosage, developmental stage, and species susceptibility,” and that “virtually all drugs and a great range of chemicals can indeed be shown to be embryotoxic under appropriate laboratory conditions.” James Wilson, Current Status of Teratolo-gy, in Handbook of Teratology 60 (J. Wilson & C. Fraser, eds. 1977). The author concludes that to “eliminate drugs and chemicals because they can be shown to be embryotoxic at high dosage would be unacceptable” because to do so “would eliminate most drugs and many useful chemicals upon which modern society depends heavily.” Id.

The weakness of the plaintiffs’ case results from the care with which reputable scientists use animal experiments to predict causation in humans. This weakness arises from the fact that different species of animals react differently to the same stimuli for reasons not entirely understood. Immune systems, nervous systems, and metabolisms (i.e., physical processing of chemical compounds) may differ greatly between species. No doubt there may be other animal experiments which, to cite one example, because of the extreme toxicity of the substance tested, would permit a reasonable jury to find that it is more probable than not that the substance causes a similar harm to humans. But Bendectin is not such a case.

The decisive weakness in the plaintiffs’ animal studies is that the factual and theoretical bases articulated for the scientific opinions stated will not support a finding that Bendectin more probably than not caused the birth defect here. On summary judgment, under the doctrine of Celotex Corp. v. Catrett, 477 U.S. 317, 106 S.Ct. 2548, 91 L.Ed.2d 265 (1986), and Anderson v. Liberty Lobby, Inc., 477 U.S. 242, 106 S.Ct. 2505, 91 L.Ed.2d 202 (1986), the expert evidence must show the elements required for a finding of causation. Here, except for Dr. Palmer’s testimony discussed below, the plaintiffs’ experts stop short of testifying that Bendectin more probably than not caused the birth defects in babies. They stop short because they have no factual or theoretical basis for a stronger hypothesis. They testify that the animal studies show that Bendectin is “capable of causing,” “could cause” or its effects are “consistent with causing” birth defects, not that it probably causes birth defects in general or that it did in this case. In short, they testify to a possibility rather than a probability.

Dr. Palmer, a medical doctor, is the only witness who testified in his affidavit that Bendectin caused Brandy Turpin’s defects. He stated:

It is my opinion ... that [animal in vivo and in vitro studies, and epidemiological and other human data] shows that Ben-dectin and specifically its component, doxyalamine succinate, has teratogenic properties.... I have also examined the medical records pertaining to Brandy Turpin and it is my opinion_that Ben-dectin did cause the limb defects from which she suffers.

We cannot find, however, that this testimony is anything more than a personal belief or opinion. The grounds for his opinion are subject to the same criticism as the animal studies and epidemiological reana-lyses submitted by the plaintiffs’ other experts: the evidence cited in support of his conclusion is insufficient to meet the plaintiffs’ burden of proof. Dr. Palmer does not testify on the basis of the collective view of his scientific discipline, nor does he take issue with his peers and explain the grounds for his differences. Indeed, no understandable scientific basis is stated. Personal opinion, not science, is testifying here. Dr. Palmer's own expressed skepticism as to the value of extrapolating human conclusions from animal studies further confounds the issue. Upon analysis, we conclude that Dr. Palmer’s conclusions go far beyond the known facts that form the premise for the conclusion stated. This conclusion so overstates its predicate that we hold that it cannot legitimately form the basis for a jury verdict. Beyond that Dr. Palmer’s opinion testimony, to the extent that it is personal opinion as described above, is inadmissible. Fed.R.Evid. 703; see also Viterbo v. Dow Chem. Co., 826 F.2d 420, 423-24 (5th Cir.1987) (physician’s unsupported personal opinion of causation held inadmissible), and Calhoun v. Honda Motor Co., 738 F.2d 126, 131-32 (6th Cir.1984) (expert testimony must be based on the evidence, so as to be removed from the realm of guesswork and speculation).

We do not mean to intimate that animal studies lack scientific merit or power when it comes to predicting outcomes in humans. Animal studies often comprise the backbone of evidence indicating biological hazards, and their legal value has been recognized by federal courts and agencies. See, e.g., International Union, UAW v. Johnson Controls, Inc., — U.S. -, 111 S.Ct. 1196, 1215, 113 L.Ed.2d 158 (White J., concurring) (citing Industrial Union Dep’t v. American Petroleum Institute, 448 U.S. 607, 657 n. 64, 100 S.Ct. 2844, 2871 n. 64, 65 L.Ed.2d 1010 (1980)); Environmental Defense Fund, Inc. v. EPA, 548 F.2d 998, 1006-07 (D.C.Cir.1976); Proposed Guidelines for Assessing Female Reproductive Risk, 53 Fed.Reg. 24,834, 24,836-39 (1988) (discussing the use of animal studies to identify and assess reproductive hazards for human females); Proposed Guidelines for Assessing Male Reproductive Risk, 53 Fed.Reg. 24,850, 24,853-60 (1988) (discussing the use of animal studies to identify and assess reproductive hazards for human males).

Here, the record’s explanation of the animal studies is simply inadequate. Although the animal studies themselves may have been scientifically performed, the exact nature of these tests is explained only in general terms. The record fails to make clear why the varying doses of Bendectin or doxyalamine succinate given to the rats, rabbits and in vitro animal cells would permit a jury to conclude that Bendectin more probably than not causes limb defects in children born to mothers who ingested the drug at prescribed doses during pregnancy. The analytical gap between the evidence presented and the inferences to be drawn on the ultimate issue of human birth defects is too wide. Under such circumstances, a jury should not be asked to speculate on the issue of causation.

Accordingly, the judgment of the District Court is AFFIRMED. 
      
      . Several statistical terms of art tire essential to an understanding of the meaning of these epidemiological studies. The risk of injuries from a suspected cause is expressed as relative risk. To calculate relative risk, the number of occurrences of a given birth defect in an exposed group is divided by the number of occurrences in the control, or unaffected group. If the given defect occurs with equal frequency between the exposed and control groups, the relative risk would be 1.0. A relative risk of 1.0 is considered inconclusive, in that a researcher cannot state that a suspected agent does or does not cause a defect (i.e., the "null hypothesis” or “no association"). See generally Allen Mitchell et al., Adverse Drug Effects and Drug Surveillance, in Pediatric Pharmacology 68-69 (Sumner Yaffe ed. 1980). A relative risk of less than 1.0 suggests that a suspected agent does not cause a birth defect. A relative risk greater than 1.0 suggests that the substance may cause a given birth defect.
      
        To gauge the reliability and credibility of their reports when repeated randomly, statisticians use a device known as the confidence interval. The confidence interval is not a “burden of proof in the legal sense; rather, it is a common sense mechanism upon which statisticians rely to confirm their findings and to lend persuasive power within their profession. The confidence interval has two components: a percentage, and an interval or range. The percentage part is established by the statistician in advance of performing the studies. Frequently this percentage is set at 95 percent, although that value is somewhat arbitrary and 85 or 90 percent figures are also used. The interval, on the other hand, represents a range of possible values at high and low ends of a scale of relative risk. See, e.g., Kenneth Rothman, Modem Epidemiology 119 (1986). At a 95 percent interval the true relative risk value will be between the high and low ends of the confidence interval 95 percent of the time. See Neil Cohen, Confidence in Probability: Burdens of Persuasion in a World of Imperfect Knowledge, 60 N.Y.U.L.Rev. 385, 398-400 (1985) [hereinafter Confidence in Probability ], for example of confidence intervals and their use.
      To better understand confidence intervals, it may be helpful to picture a line, marked at hundredths intervals and extending from zero to infinity. The marking at 1.0 represents a relative risk of 1.0, the "null value." If a confidence interval of "95 percent between 0.8 and 3.10” is cited, this means that random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10. Because this confidence interval includes relative risk values both less than and exceeding 1.0, the null value, a researcher cannot state that the results are statistically significant. David Kaye, Is Proof of Statistical Significance Relevant?, 61 Wash.L.Rev. 1333, 1343-44 (1986), cited in DeLuca v. Merrell Dow Pharmaceuticals, Inc., 911 F.2d 941, 948 (3rd Cir.1990). For an example of such a study, see Jose Cordero et al., Is Bendectin a Teratogen?, 245 JAMA 2307 (1981), cited infra. Similarly, it is possible that a range may be entirely below 1.0, meaning that the agent does not cause birth defects. If, however, the confidence interval spans a range entirely above 1.0 — e.g., from 1.75 to 10.75 — then this interval would be statistically significant and would show a greater likelihood that the suspected agent did cause the studied defect. For an example of such an interval, see Brenda Eskenazi & Michael Bracken, Bendectin (Deben-dox) As a Risk Factor for Pyloric Stenosis, 144 Am. J. Obstetrics & Gynecology 919 (1982), cited infra.
      
      The sample size for any study also has an effect, both on the confidence interval and the "power" of the study. Power is the study’s probability of detecting a difference in outcomes between exposed and nonexposed groups. See Office of Technology Assessment, Report No. OTA-BA-266, Reproductive Health Hazards in the Workplace 166 (1985). The higher the study’s power, the stronger are its conclusions and findings regarding its outcome. If a sample population is small, however, the power of the study will likely be less. The information behind the study is less, and the confidence interval will likely span a wider range for a smaller sample group than for a larger one. The power is less for the smaller group than for the larger group, even though the confidence interval may still be set at 95 percent, because the smaller study's predictive value is lessened by a wide confidence interval range. A statistician could accordingly describe the probability of choosing an expected outcome in the larger study as being greater than in the smaller study. Cohen, Confidence in Probability, at 398-99.
      As can be seen, in many aspects, the concepts of confidence intervals, sampling sizes, population and power are mutually interdependent. For an overview of how epidemiology is used in risk assessment, see Proposed Guidelines for Assessing Female Reproductive Risk, 53 Fed. Reg. 24,834, 24,840-41 (1988).
     
      
      . While scientists’ use of confidence intervals is as a common-sense device to give professional weight to their results, such confidence intervals are not the same as the preponderance of the evidence standard of proof. This requires proving one’s case by the greater weight of the evidence. Where the weight is equally divided between the plaintiff and the defendant, the party bearing the burden of proof must lose. Reduced to a percentage, this requires proof of one’s case to at least 51 percent of the evidence.
     
      
      . As stated in his affidavit: “It is my opinion ... specifically with respect to my experience with limb development and knowledge of the impact of the antihistamine Bendectin on cell function, that Bendectin is capable of interfering with the development of the limb in the human being and causing birth defects in the developing limb ... and that Bendectin is a human teratogen." Dr. Newman states further that, although he is not a medical doctor, his review of the medical files of Brandy Turpin and her mother indicates the infant plaintiffs defects are "consistent with the effects of teratogen," with Bendectin being "capable" of being such a teratogen. Aff., Dr. Newman, Turpin v. Merrell Dow, No. 84-105 (E.D.Ky. Mar. 27, 1990), at 8, 10.
     
      
      . Several animal studies of cortisone, for example, found that it causes severe cleft palate birth defects in several animal species, but it does not cause this effect in humans. Alfred Bongiovan-ni & Arthur McPadden, Steroids During Pregnancy and Possible Fetal Consequences, 11 Fertility & Sterility 181, 184-45 (1960) ("With doses more closely resembling those employed in medical practice,” doses of cortisone led to frequent cleft palate defects in several species, especially mice, while surveys of human pregnancies demonstrated that almost none of this defect could be attributed to cortisone treatments. “It would appear, therefore, that in spite of the awesome results reported in other species, the human fetus is rarely injured by maternal treatment with corticoids.”). See also Beverley Murphy St Charlotte Branchard, Fetal Metabolism of Cortisol, 5 Current Topics in Experimental Endocrinology 197, 221 (1983) (injections of cortisone will precipitate labor in pregnant sheep, cows, goats, rabbits, and rats; this does not occur in humans), and Elwyn Grimes, Jamil Fayez & Gerald Miller, Cushing’s Syndrome and Pregnancy, 42 Obstetrics & Gynecology 550, 558 (1973) (accepting findings of the Bongiovanni and McPadden report).
     