
    John F. Carroll et al., Petitioners, v Juan U. Ortiz, as Director of the Department of Personnel of the City of New York, et al., Respondents.
    Supreme Court, Special Term, New York County,
    December 22, 1983
    APPEARANCES OF COUNSEL
    
      Gordon, Schechtman & Gordon (.Murray A. Gordon, Kenneth E. Gordon and Richard Imbrogno of counsel), for petitioners. Frederick A. O. Schwarz, Jr., Corporation Counsel (Seth J. Cummins, Judith A. Levitt, Thomas A. Crane and Michael C. Harwood of counsel), for respondents.
   OPINION OF THE COURT

Arnold Guy Fraiman, J.

By this proceeding brought pursuant to CPLR article 78, petitioners, 307 sergeants in the New York City Police Department, seek an order invalidating part III of examination No. 1613 for promotion to the rank of lieutenant in the police department. Examination No. 1613 consisted of three parts: the administrative test, the technical knowledge test and part III, the interactive test. The administrative and technical knowledge tests were written tests administered in June, 1982. Only candidates who passed both tests were eligible to take the interactive test, which was oral. A candidate had to pass all three parts of the examination to be placed on the lieutenants eligibility list. The interactive test was given on 11 separate days over a two and one-half week period in October, 1982.

All of the petitioners passed the first two parts of examination No. 1613 and were among 1,043 candidates who took the interactive test. Originally, a score of 4.0 out of a possible 7.0 was deemed a passing grade for part III, but this was subsequently reduced to 3.8. Petitioners all initially received grades below 3.8 and accordingly were not placed on the lieutenants eligibility list. However, as a result of a recomputation of scores necessitated by a mathematical adjustment in the method of scoring, and an administrative appeal procedure conducted in the summer of 1983, approximately 90 of the petitioners were ultimately credited with passing scores.

The interactive test consisted of a simulated meeting between a candidate who was asked to assume the role of “Lieutenant Talbot”, a recently assigned lieutenant to a hypothetical precinct in the city, and an actor who played the role of “Sergeant Warner”, the precinct’s patrol supervisor. Thirty-four different actors or role players selected from the police department, and the Housing Authority and Transit Authority police departments, played the role of Sergeant Warner over the course of the test. The meeting between the candidate and Sergeant Warner lasted for 30 minutes and was video taped. Immediately prior to assuming the role of Lieutenant Talbot at the meeting, the candidate was furnished with a packet of materials, which he was given 45 minutes to review. The materials consisted of memoranda and reports dealing with 10 separate issues or problems in the precinct, which Lieutenant Talbot’s commanding officer has asked him to discuss with Sergeant Warner and to take action on.

The interactive test was designed to measure the candidate’s ability to perform 32 different “task statements”, each representing a particular behavior or activity purportedly required for the job of police lieutenant. The task statements were classified into five separate skill groups or clusters which were identified as “directing others”, “monitoring”, “coaching”, “training”, and “resolving conflicts”. Some or all of the skills were required for the handling of each of the 10 problems the candidate, as Lieutenant Talbot, was confronted with.

Among the written instructions given each candidate prior to his meeting with Sergeant Warner was that he “should keep in mind that the responsibilities of a Lieutenant include monitoring performance, directing others, training (including giving instructions), giving guidance to subordinates and resolving conflicts.” In addition, the first “memorandum” in the packet which was provided each candidate was addressed to Lieutenant Talbot from his precinct commander, and advised him that he was to monitor Sergeant Warner’s performance, provide specific direction to him, assist him in directing subordinates, as well as resolving conflicts among them, and see that he (Sergeant Warner) received adequate training and guidance. Thus, prior to the interview session the candidates were specifically made aware of the skills which the interactive test was designed to measure.

Each candidate’s performance was graded separately by two raters or assessors who viewed the candidate’s video tape. A total of 34 assessors rated the candidates who took the interactive test. To facilitate their grading, the assessors were furnished with a task cluster checklist. This required them to assign a grade ranging from plus three to minus three to checklist statements which were subdivided into the five clusters or skill groups. Thus, under “directing others”, there were seven checklist statements to be graded; under “coaching” an additional seven checklist statements were listed; “training” contained five checklist statements; “monitoring” had eight checklist statements; and “resolving conflicts” had five. In all, 32 checklist statements were graded. A typical checklist statement (this one taken from the “directing others” cluster) was as follows: “With respect to Item 9, the memo from Captain Caster [the Precinct commander] regarding Operations Order 33, did the participant inform Sgt. Warner that a staff meeting should be held concerning the implementation of this order in the Precinct? (To what extent was Warner given specific instructions on how to implement/ communicate this order in the Precinct?)”

The assessors were instructed that a grade of plus three should be given when the memo was discussed and handled extremely well; plus two, when the memo was discussed and handled well; plus one, when the memo was discussed and handled satisfactorily or slightly better than satisfactorily; minus one, when the memo was discussed but handled slightly less than satisfactorily; minus two, when the memo was discussed but handled poorly; and minus three, when the memo was not discussed at all.

The grades for the checklist statements in each of the five clusters were separately totaled and then converted by means of a conversion table to a rating between one and seven for each cluster. The two assessors then compared their ratings for each cluster. Where there was a difference of more than one point on any cluster, the assessors discussed and reviewed their checklists and notes concerning the items within that cluster. They then independently rerated the candidate on the cluster or clusters on which their scores differed by more than one point. The candidate’s final over-all score was then computed by averaging the scores of the two assessors for each cluster, and then combining them after weighing them according to the number of checklist statements within each cluster. As indicated, the passing grade was 3.8.

Petitioners seek to invalidate the interactive test on the ground that it did not meet the requirements of section 6 of article V of the New York State Constitution. That section provides in relevant part that “[appointments and promotions in the civil service of the state and all of the civil divisions thereof * * * shall be made according to merit and fitness to be ascertained, as far as practicable, by examination which, as far as practicable, shall be competitive”. Specifically, petitioners contend that the interactive test was not competitive on three principal grounds. First, it is argued that the performances of the different actors who played the role of Sergeant Warner were so uneven that the identity of the role player had a statistically significant effect upon the grades of the candidates. Second, a similar argument is made with respect to the assessors. And third, it is contended that candidates who took the interactive test during the second and third weeks in which it was given had an advantage over those who took it in the first week because they were able to ascertain the nature and format of the test from the candidates who preceded them. Petitioners raise a number of other objections to the examination, mainly concerning the manner of its scoring, but the court has considered these and found them to be without merit and will limit its discussion to the three issues referred to above. Before doing so, some discussion must be had concerning the background of the interactive test.

Prior to examination No. 1613, examinations for promotion in the police department were traditionally in written form and consisted of multiple choice or similar type questions which could be graded on a completely objective basis. However, in 1980 the United States District Court held that a police department entrance level examination given the previous year, which was in the traditional form, unjustifiably discriminated against blacks and Hispanics in violation of title VII of the Civil Rights Act of 1964 (US Code, tit 42, § 2000e-2). (Guardians Assn. v Civil Serv. Comm., 484 F Supp 785, mod 630 F2d 79, cert den 452 US 940.)

In affirming the lower court’s decision, the Court of Appeals for the Second Circuit faulted the city for its failure to seek outside assistance from a firm specializing in test preparation in developing the challenged examination. It noted that “employment testing is a task of sufficient difficulty to suggest that an employer dispenses with expert assistance at his peril.” (Guardians Assn. v Civil Serv. Comm., 630 F2d 79, 96, supra.) Even more significantly, it concluded that the written multiple choice examination which had been employed failed to adequately test human relations skills, one of the four abilities it sought to test (the others were remembering details, filling out forms, and applying general principles to specific facts). The court conceded the difficulty in assessing this somewhat abstract skill, but held, nonetheless, that if it constituted a significant aspect of the job it could still be tested, with the caveat that “[tjestmakers will be well advised to obtain highly qualified assistance in constructing this portion of an exam.” (Supra, p 97.) It then made the following suggestion: “One desirable approach would be to confront applicants with simulated real life situations and assess the appropriateness of their volunteered response. See Firefighters Institute for Racial Equality v. City of St. Louis, 616 F.2d 350 (8th Cir. 1980). That technique is normally too costly for large numbers of applicants, but might have usefulness as a testing device to be used toward the end of the overall selection procedure, after an initially large group of applicants has been narrowed down by the results of a written exam and a background check.” (Supra, P 97.)

In an effort to comply with the recommendations and holdings of the Federal court, the city, contrary to its usual procedure of preparing its own examinations, retained Assessment Designs, Inc. (ADI) to develop and prepare examination No. 1613. ADI had extensive experience in the area of personnel evaluation and research and much of its work had been done for law enforcement agencies. Its task was to design an examination which would meet the requirements of both section 6 of article V of the New York State Constitution and title VII of the Civil Rights Act. To fulfill its mandate ADI made an in-depth analysis of the job of police lieutenant, to insure that the examination would actually test for what police lieutenants do. Its analysis consisted, first, of interviewing at considerable length, 46 incumbent lieutenants and five captains concerning the work of a lieutenant. In addition, the patrol guide, operations orders, legal bulletins, policy memoranda and other department materials were reviewed for a further understanding of the duties and responsibilities of a lieutenant. All of the information thus obtained was consolidated into a so-called task analysis questionnaire which listed 134 “task statements” (i.e., “exchanges information with dispatcher”; “resolves conflicts among subordinates”; “monitors inventory of supplies and equipment”), and 60 “knowledges” which ADI found were most commonly noted as performed or known by lieutenants. The questionnaire was sent to the approximately 744 incumbent lieutenants. Five hundred and four responses were received. The lieutenants completing the questionnaire were asked to rate each task or knowledge statement on a scale of 1 to 5 for its importance, and on a similar scale for how frequently the task occurred or the knowledge was used. In addition, the tasks and knowledges were rated for three other factors: (1) available time; (2) length of time; and (3) source of training. With respect to available time, the questionees were asked as to each listed task whether they would normally have sufficient time to seek assistance from another source if they were not sure how to do it. With respect to the length of time factor, they were asked to indicate how long it would take to learn the task if one did not know how to perform it the first day on the job. And with respect to the source of training factor, questionees were asked the source of knowledge for each task (i.e., reading, practice on the job, formal training class, etc.).

From the responses received, ADI prepared a comprehensive job analysis report which identified a number of task statements involving interactive relationships with others that a police lieutenant most often is called upon to employ. It was to test these task statements, which have been subsumed into the five clusters or skill groups referred to above, i.e., directing others, monitoring, coaching, training, and resolving conflicts, that the interactive test was designed.

ADI determined that these human relations skills could best be tested by placing the candidate in a simulated real-life situation requiring the employment of such skills, and evaluating his responses. While this is the first instance in which an oral test was given as a part of a police promotion examination, the use of oral examinations are not precluded by section 6 of article V of the New York Constitution. (Matter of Sloat v Board of Examiners, 274 NY 367; Matter of Fink u Finegan, 270 NY 356.) The determination of whether it is appropriate to employ an oral test depends upon whether it has a reasonable and rational basis, that is, whether the position which is being tested for requires interactive skills not easily measured by a written test. (Matter of Dixon v Bahou, 67 AD2d 767, 768.) The instant case clearly meets this criterion. The skills tested by the interactive test are commonly performed by lieutenants in the police department, as disclosed by ADI’s job analysis report, and the test formulation appropriately tested these skills. Moreover, each of the skills are used in interrelationships with others and cannot be effectively evaluated by a written examination.

We turn now to a consideration of the three specific defects cited by petitioners with respect to the manner in which the test was given and how it was graded. It is petitioners’ contention that singly or collectively the three alleged defects resulted in the interactive test not being a competitive examination. First, petitioners argue that there was a wide divergence in behavior by the various role players acting as Sergeant Warner and that as a result, there was a statistically significant disparity (i.e., one not due to random chance) in grades between candidates, depending upon which of the 34 actors the candidates were confronted with.

In support of this contention, petitioners point to instances in which role players failed to respond in a uniform fashion to inquiries by candidates; to instances in which role players were exceptionally reticent or unresponsive; and to instances in which role players displayed varying degrees of familiarity with the issues discussed with the candidates. According to petitioners’ experts, these differences among the 34 role players resulted in a statistically significant variance in the candidates’ mean score for each of the five skill groups or clusters and this alleged variance accounted for between 5% and 6% of the total variance in such scores.

Respondent’s expert concedes that 4% of the variance on the candidates’ final scores, as distinguished from the variance in the mean scores for each cluster, was attributable to the role players. However, the presence of a statistically significant variance, not attributable to the candidates themselves, does not, standing alone, require a determination that the examination should be nullified because it was not competitive. Section 6 of article V of the New York Constitution requires that a civil service examination be competitive “as far as practicable”. With over 1,043 candidates taking the interactive test, culled from an initial pool of 2,153, it is clear that numerous role players had to be used, thus precluding the degree of uniformity which would otherwise be attainable from the use of a single role player for all 1,043 candidates. In an effort to minimize the inevitable variance resulting from the use of multiple role players, a detailed role play guide was issued to each actor. This contained suggested responses to all questions which were anticipated with respect to each of the 10 problems the candidates were to discuss. In addition, the role players were given two days of extensive training which included the viewing of video tapes of a hypothetical interactive exercise, and practice sessions in which the actors actually assumed the role of Sergeant Warner with another actor playing the role of Lieutenant Talbot.

The court is satisfied that these measures insured that the role players’ collective behavior was as uniform as practicable under the circumstances. In an article 78 proceeding, absent a finding of arbitrariness or capriciousness, the court may not substitute its judgment for that of respondents’ experts. (Matter of Bruno v LeBow, 95 AD2d 731, 732.) Here, respondents’ experts have concluded that even using petitioners’ statistics, the percentage of role-player induced variance is well within the acceptable range for an interactive exercise.

Petitioners next contend that the interactive test should be nullified because there were statistically significant differences among the average scores given by the assessors for each of the five task group clusters. It is alleged that this ranged between 7% and 9% of the total variation in the candidates’ test scores. Petitioners also allege that an analysis of the ratings assigned for each of the 32 checklist statements by the two assessors who graded each candidate discloses that they disagreed on 52.2% of all such ratings. It is urged that this percentage indicates such a low degree of inter-rater reliability that it constitutes a further basis for invalidating the test.

Respondents concede that the variation in assessors’ scores for each of the task group clusters is statistically significant, but they compute the percentages of total variance attributable to the assessors as being between one quarter and one half less than petitioners’ percentages. Further, they note with some merit the inappropriateness of evaluating results on the cluster level when promotion is based upon the candidate’s over-all score. The variance due to the assessors for over-all scores was 6.7% according to respondents.

Regardless of whose statistics are adopted, or whether the cluster score or final score is used, it is the opinion of respondents’ experts that, while statistically significant, the variance percentages are not of such practical significance so as to nullify the examination on the ground of noncompetitiveness. The court concurs in this opinion. Grading an interactive exercise obviously cannot be done with the same precision as grading a completely objective written test containing only multiple-choice questions. (Matter of Sloat v Board of Examiners, 274 NY 367, 373, supra.) The assessors received extensive training for their assignment. The methodology employed in scoring the test, which was an assessment center technique fully in accord with existing approved testing methods, was designed to minimize assessor variance. The court believes these measures achieved their objective. Nor have petitioners demonstrated that the variances they rely upon were excessive for this type of examination.

Comment must also briefly be made concerning petitioners’ claim of an alleged lack of inter-rater reliability, as reflected in the 52.2% rate of disagreements on checklist scores by the two members of the assessor team assigned to grade each candidate. This statistic is seen in its proper light only when it is noted that overall, the assessor teams were in agreement within one unit in their ratings on over 90% of the checklist statements. Stated another way, they were in complete accord in their ratings on about 48% of the statements and within one unit of each other on another 42.8% of the statements. The court believes that, contrary to petitioners’ contention, this reflects a high degree of inter-rater reliability.

Finally, it should be noted that any individual candidate who believes that he was not graded fairly by the assessors may have the scoring of his test reviewed by the court in a so-called Acosta hearing. (Matter of Acosta v Lang, 13 NY2d 1079.) While the court is cognizant that such a hearing is designed only to review the grading of an individual candidate’s test performance, rather than pass upon the validity of the examination itself, such a procedure does provide an opportunity for the rectification of specific grading errors. As indicated, the court does not believe that the degree of unreliability evidenced by petitioners reached a level sufficient to warrant the invalidation of the entire test.

The third basis alleged for invalidating the interactive test is the claim that candidates taking the test in the second and third week in which it was given had an advantage over those who took it in the first week, thus rendering the test noncompetitive. It is alleged that the advantage enjoyed by those who took the test after the first week derived from an absence of security which enabled them to learn of the format and content of the test. This allegedly permitted them to study and rehearse the 10 problems or scenarios they were confronted with, and thereby attain higher scores than the candidates who were tested during the first week. Specifically, it is alleged that 58.9% of the candidates who took the test in the first week received a passing grade, as compared to 64.8% of the candidates in the second and third weeks combined who received passing grades. It is further alleged that an analysis of the test scores disclosed that in the first week a candidate was significantly more likely to receive a very high or a very low score than was a candidate taking the test in the second or third week. That is, there was a much smaller spread in the test scores of the candidates who took the test during the second and third weeks, their scores being bunched much more closely to the mean than the scores of the first-week candidates.

The interactive test was given on four days during the week of October 11, 1982, five days during the following week, and two days during the week of October 25. Eleven testing days were required because only 10 candidates at a time could be tested in view of the limited availability of video tape facilities and role players. After completing the test, candidates were not permitted to take any of the written materials or notes out of the test rooms, but no restrictions were placed upon their discussing the test with other candidates.

In the absence of such a restriction, it must be assumed that many of the candidates who took the examination after the first day discussed it with those who had already taken it and were advised of its format and some or all of the substantive problems which were to be considered. However, the effect such foreknowledge had on the grades of the candidates who were its beneficiaries was insignificant. The daily mean scores of the candidates did not vary significantly over the 11 days on which the test was administered. With respect to the pass rate, while, as indicated, it was 58.9% for the first-week candidates and 64.8% for the second- and third-week candidates combined, this is not a statistically significant variance. Statistical significance is generally recognized where there is a probability of 5% or less that the difference is due to random chance. Petitioners concede that there is a probability of 5% to 7% that the variance between the percentage of passing grades for the first week’s candidates and the passing grades for the second and third week’s candidates was due to random chance. Further, when an analysis is made which compares, separately, the pass rate for each of the three weeks in which the test was given, which would appear to be more consistent with proper statistical practice than a comparison of week one to weeks two and three combined, the probability that the variances are due to random chance rises to 16%, according to petitioners’ own expert.

This absence of statistical significance in the variance in the daily mean scores of candidates and in the passing grade rates is consistent with respondents’ contention that advance coaching or studying for this type of test is not fruitful. The interactive test was designed to measure a candidate’s ability to interact with a subordinate. It was not designed to test his technical knowledge or administrative skills, which was the purpose of the other two parts of examination No. 1613. Thus, whether the technical information conveyed by the candidate to Sergeant Warner was accurate, or whether the administrative procedures he employed were in accordance with police department regulations, was totally irrelevant in determining his score. Indeed, the assessors had no police training and would not have known whether the information or procedures were correct. Accordingly, advance knowledge and concomitant study and coaching with respect to the 10 substantive problems on the test would have been unrewarding, as in fact, the statistics indicate.

With respect to the bunching phenomenon observed by petitioners in connection with the candidates who took the test in the second and third weeks, even assuming its statistical significance, the court fails to see its relevance in determining the validity of the test, since it has no relationship to the candidates’ mean scores or passing rates.

Finally, the court does not believe there was any viable alternative to the manner in which the interactive test was administered. It obviously could not have been given on a single day. To have varied the test from day to day would have created more problems than it would have solved. And an instruction to candidates that they were not to discuss the test could not have been enforced.

In conclusion, the court finds that the interactive test was as competitive as practicable within the meaning of section 6 of article V of the New York Constitution, and was designed in compliance with the mandate of the Federal court in Guardians Assn. v Civil Serv. Comm. (484 F Supp 785, mod 630 F2d 79, supra). The petition to invalidate it is dismissed. 
      
       As indicated, this checklist statement was concerned with one aspect of the candidate’s handling of item No. 9 of the 10 issues or problems he had been instructed to discuss with Sergeant Warner. At least two and as many as six checklist statements related to each of the 10 issues or problems.
     