Response to Bem’s Comments

James Alcock

James Alcock responds to Daryl Bem’s comments on his critique.

Note: This post is a response to Daryl Bem’s comments, which can be found here. Alcock’s original article may be viewed here.

Outrage
and ad hominem condescension. Bem’s response brings to mind
an old adage from the legal world: “If the facts are against you,
argue the law; if the law is against you, argue the facts; if the facts
and
the law are against you, yell like hell.” And yell like hell
he does. Rather than deal with much of the substance of my critique,
he directs his attack at my abilities. However, the issue is not about
my intelligence or his or my character or his. It is about the data
and the way in which they were gathered and analyzed.

Not only
does he shoot the messenger; much of his defense rests on an appeal
to authority – in this case, the authority of the reviewers at the
Journal of Personality and Social Psychology
who recommended that
his article be published. While I do not understand why they decided
as they did, their decision cannot make a silk purse out of a sow’s
ear; it does not make the serious flaws in this research go away.
Bem also employs the tired defence that it is not enough for a critic
to find flaws in a study; the critic must also show that the obvious
flaws could themselves account for the observed results. This is of
course simply nonsense. The burden of proof is on the individual who
presents the data, and significant flaws in the research militate against
confidence that the researcher did not make other undetected and unreported
errors as well. This is all the more of concern when one is claiming
evidence for phenomena that contradict well-established knowledge in
physics, neurology and psychology.

In response
to his various points:

    1. He accuses me of
      an imaginative rewriting of parapsychological history. However, in essence,
      I simply pointed out that none of the much-touted supposed parapsychological
      “breakthroughs” of the past have succeeded in establishing the reality
      of psychic phenomena so far as the larger scientific community is concerned.
      If he believes otherwise, then it is his imagination and not mine that
      is at play.
    1. Bem chooses not
      to address most of my criticisms about his procedure, but instead misleadingly
      states that my major criticism is the “selection and deployment of
      the pictorial stimuli used in six of the nine experiments.”
      My major methodological criticism actually has to do with the chaotic,
      careless nature of his procedures; his “selection and deployment of
      stimuli” only reflects the more general problem, and is not the key
      problem itself. Whether gay participants were provided with homosexual
      pictures is not in and of itself at issue. However, his response actually
      adds further obfuscation. Let me explain: Although the presumed impact
      of erotic stimuli is central to several of his experiments, he made
      no effort to assess whether any of the erotic stimuli are actually erotic.
      Instead, begging the question in the true philosophical sense of the
      term, he simply assumed that because females “psychically” scored
      above chance with the erotic pictures, while males did not, this gender
      difference must have come about because the males did not find the pictures
      to be erotic enough. He states, in regard to my critique of Experiment
      1:
        “Male
        and female raters differed markedly in their ratings of negative and
        erotic pictures. Male raters rated every one of the negative pictures
        as less negative and less arousing than did the female raters, and they
        gave more positive ratings than the female raters to the most explicit
        erotic pictures. Possibly reflecting this sex difference, female participants
        showed significant psi effects with negative and erotic stimuli in my
        earliest experiment but male participants did not. Accordingly, I decided
        to introduce different sets of pictures for men and women in subsequent
        experiments, choosing more extreme and more arousing pictures for the
        men.”

      This
      would seem to suggest that he directly assessed the erotic nature of
      the stimuli, that his participants actually rated the pictures. They
      did not. His conclusion about gender differences in erotic response
      is based on the ratings provided with the IAPS set of pictures that
      he used, ratings that were available to him before he began his
      experiments. He apparently had no concern about those gender differences
      until after he had collected his data, at which time, having
      found that males did not show what he refers to as “significant psi
      effects” (actually, just significant deviations from the chance response
      rate), he decided to modify his stimuli set for males for Experiment
      6, and in other subsequent experiments including the strangely-numbered
      Experiment 1. However, in his original article, there is nothing
      in his description of Experiment 1 that discusses male and female reactions
      to the stimuli, apart from a reference to Experiment 5, (which was run
      before Experiment 1!). It is in Experiment 6 where he makes clear where
      the “ratings” came from. He states (in his original article):

        “…
        we decided to use sets of negative and erotic pictures that were different
        for men and women. As noted above, women showed a significant psi effect
        on the negative trials in Experiment 5, but men did not. … The
        ratings supplied with the IAPS pictures reveal that male raters rated
        every one of the negative pictures in the set as less negative and less
        arousing than did female raters.
        … So, for this replication, we
        supplemented the IAPS pictures for men with stronger and more explicit
        negative and erotic images obtained from Internet sites.” (My
        underlining)

      So,
      after relying on a standardized set of pictures, he now chose who-knows-what
      pictures from the internet to “supplement” the IAPS pictures (How
      many internet pictures were involved, he does not say). Again, let me
      make clear that it is not the picture set per se that is my concern;
      it is the arbitrary way in which pictures were chosen and switched about,
      which reflects the systemic experimental carelessness. .

    1. Bem points out that
      correctly that I was confused with regard to the procedure in Experiment
      1. I was initially mistaken in interpreting the chance rate for erotic
      pictures as being 33% for some of the participants, although I did indeed
      figure out that his primary analysis was with regard the hit rate on
      the erotic targets for all 100 participants, with a chance rate of 50%.
      Upon review, I can assure the reader that none of this has any bearing
      on my overall conclusions.
      Compare
      the relative clarity of his present description to the description in
      the original article: (Keep in mind that by “session,” he
      is referring to the trials conducted with a single participant).

      His
      present response to me:

        “There
        were 100 sessions in this experiment, and on each of 36 trials, participants
        saw images of two curtains side-by-side on the computer screen. They
        were told that a picture would be behind one of the curtains and a blank
        wall would be behind the other. Their task on each trial was to click
        on the curtain they felt concealed the picture. After they made their
        selection, the selected curtain opened, revealing either a picture or
        a blank wall. … On randomly selected trials, the picture was erotic;
        on other trials, it was a nonerotic, and the participant had no (non-psi)
        way of knowing which kind of picture would be used on any given trial.
        Because there were two alternatives on each trial—left curtain or
        right curtain—the probability that the participant would correctly
        select the location of the picture by chance was always 50%.”

      His
      original article:

        “For
        this purpose, 40 of the sessions comprised 12 trials using erotic pictures,
        12 trials using negative pictures, and 12 trials using neutral pictures.
        The sequencing of the pictures and their left/right positions were randomly
        determined by the programming language’s internal random function.
        The remaining 60 sessions comprised 18 trials using erotic pictures
        and 18 trials using nonerotic positive pictures with both high and low
        arousal ratings. These included eight pictures featuring couples in
        romantic but nonerotic situations (e.g., a romantic kiss, a bride and
        groom at their wedding). The sequencing of the pictures on these trials
        was randomly determined by a randomizing algorithm … Although it is
        always desirable to have as many trials as possible in an experiment,
        there are practical constraints limiting the number of critical trials
        that can be included in this and several others experiments reported
        in this article. In particular, on all the experiments using highly
        arousing erotic or negative stimuli a relatively large number of nonarousing
        trials must be included to permit the participant’s arousal level
        to “settle down” between critical trials. This requires including
        many trials that do not contribute directly to the effect being tested…”

      However,
      thinking again about his procedure, another problem becomes apparent.
      There was a relatively large data set for the erotic pictures, since
      all participants were exposed to them, but the data set for each of
      the various categories of non-erotic images was considerably smaller,
      which makes it less likely that one would detect small but significant
      effects in those data even were they to exist. Thus, the procedure is
      weighted such that it is less likely that significant above-chance rates,
      if they exist, would be detected for the non-erotic stimuli because
      smaller sample sizes provide less power in the statistical tests.
      So, while he makes much of having found significant “psi” effects
      only with the erotic stimuli (notwithstanding the statistical problems
      associated with the multiple testing), this difference might well be
      due only to differences in the power of the tests, reflecting the differences
      in sample size.

    1. Bem deals with my
      critique of his statistical analysis in a purely ad hominem manner.
      He completely fails to respond to one of my most serious criticisms,
      about his use of one-tailed tests. However, he sneers at my concerns
      about the multiple t-tests that he relies upon, and argues that:
        “…multiple
        t tests demonstrated that participants did no better than chance on
        any of the subcategories of nonerotic pictures. It is here that Alcock
        first complains about my performing multiple tests without adjusting
        the significance level for the number of tests performed. In this case,
        Alcock is almost right. Suppose that in testing each of the four subcategories
        of nonerotic pictures, I had found that one of them (e.g., romantic
        pictures) showed a significant precognitive effect. Because this finding
        would have emerged post hoc, only after I had first performed separate
        tests on four different picture types, I would have had to adjust the
        significance level to be less significant. If I did not, I would be
        illegitimately capitalizing on the likelihood that at least one of the
        four tests would have yielded a positive result just by chance. But
        there was no psi effect on any of the subcategories of nonerotic pictures.”

      Bem
      thus justifies the decision not to control for multiple testing on
      the basis of an examination of the data themselves!
      Because
      he found no psi effect for the non-erotic categories, he argues that
      there was no problem with conducting the several tests without controlling
      for multiple testing!! One simply cannot make one’s choice of statistical
      procedure after looking at the data. One must adjust the statistical
      criterion in advance to reflect the number of tests to be done, taking
      into account not only the t-tests for the non-erotic images, but the
      test for the erotic images as well. Since he obviously did not do so,
      he consequently employed the wrong criterion value, and that, combined
      with the use of one-tailed testing, greatly magnifies the likelihood
      of finding statistical significance when none exists.

    1. With regard to his
      comments about the Priming studies, the reader should note that, contrary
      to Bem’s complaint, I did not object to the use of complex data analyses
      involving two transformations and outlier cut-off criteria. I simply
      said that because of the complexity of the analyses, one cannot fairly
      judge his interpretations of the data without seeing the data and the
      analyses that were done.
      It
      is interesting that Bem states that:

        “At
        least one expert in priming experiments has also argued that one should
        always perform several analyses using different transformations and
        different cut-off criteria to ensure that the priming effects hold up
        across these variations. That is precisely what I did.”

      That
      may be just fine, but the cautious observer will reserve judgment in
      the absence of detailed information about those various analyses, and
      the transformations and cut-off criteria upon which they relied.

    For all
    his sound and fury, this seriously flawed set of experiments is still
    a seriously flawed set of experiments. If Bem wants the scientific world
    to pay attention to his claims of psi, he must first produce meaningful
    data from a well-designed, well-executed and well-analyzed experiment.
    Neither excuses for careless research, nor angry defences of it, will
    achieve this; he must simply do it right.

James Alcock

James E. Alcock, PhD, is professor of psychology at York University, Toronto, Canada. He is a fellow of the Canadian Psychological Association and a member of the Executive Council of the Committee for Skeptical Inquiry and the Editorial Board of the Skeptical Inquirer. Alcock has written extensively about parapsychology and anomalous experience and has for many decades taught a psychology course focusing on these topics. His most recent book is Belief: What It Means to Believe and Why Our Convictions Are So Compelling (Prometheus Books, 2018).