## Wednesday, March 8, 2017

Please  help.  This is a real-life case of likely false conviction where your input can help.  A man is spending life in jail without parole for a murder he likely did not commit.

### Background:

• In 1969, Jane Mixer, a law student, was murdered.  The case went cold.
• The case was reopened 33 years later when crime-scene evidence was submitted to DNA analysis.
• The DNA yielded two matches; both matches were from samples that were analyzed in the same lab and at the same time of the crime-scene DNA analysis.  All three samples were analyzed in late 2001 and early 2002.
• One match was to John Ruelas.  Mr. Ruelas was 4 in 1969 and was excluded as a suspect.
• The other match was to Gary Leiterman.  Mr. Leiterman was 26 at the time.  He was convicted in 2005 and is serving life without parole.  His appeal was denied in 2007.
• There is no doubt that Mr. Leiterman's DNA was deposited on the crime scene sample.  The match is 176-trillion-to-1.
• The question is whether the DNA was deposited at the crime scene in 1969 or if there was a cross-contamination event in the lab in 2002.

### A Very Easy and Helpful PowerPoint:

• This case comes from John Wixted, a psychologist at UCSD
• He has made a detailed and convincing presentation.  Click here for The Power Point from John's website.
• John has helped to persuade the Innocence Clinic at the University of Michigan to investigate the Leiterman case.
• John and I are convinced this is a an injustice.  We are working pro bono.

### Our Job:

• Our job is to make an educated assessment of Mr. Leiterman's guilt or innocence.  It would greatly help the Innocence Clinic to assess whether there is sufficient evidence to appeal.
• The jury heard that DNA is a trillion-to-1 accurate and there was only a very tine chance of cross contamination.   Yet, we know these are the wrong conditional probabilities to compute.
• Consider the two hypotheses above that Leiterman's DNA was deposited at the crime scene or, alternatively, that it was deposited in the lab through cross contamination.  Conditional on the match, compute posterior probabilities.

### My Analysis:

I have done my own analyses and typeset them.  But reasoning is tricky, and I would like some backup.  It is just too important to mess up.  Can you try your own analysis?  Then we can decide what is best.

You will need more information.  I used  the following specifications.  Write me if you want more:

• John and I assumed 2.5M people are possible suspects in 1969.  It is a good guess based on population estimate of Detroit metro area.
• The lab processes 12,000 samples a year.  The time period the DNA overlapped can be assumed to be 6 months, that is 6,000 other samples could be cross contaminated with Mixer or Leiterman.
• The known rate of DNA cross-contamination is 1-in-1500.  That is, each time they do a mouth swab from one person, they end up with two or more DNA profiles with probability of 1/1500. We assume this rate holds for unknowable cross-contamination such as that in processing a crime scene.
• The probability of getting usable DNA from a 33-year-old sample is 1/2.

Jeff's answer is at GitHub, https://github.com/rouderj/leiterman

Thank you,
Jeff Rouder
John Wixted

## Tuesday, January 3, 2017

### Why Is It So Hard To Organize My Lab?

It is clear I need to pay more attention to the organization of my lab.   Organization is a challenge to me, it causes much apprehension, and seems to be a chronic need in all aspects of my life.  Let's focus on the lab.

### Parameters:

1. Minimizing mistakes.  There is no upside in analyzing the wrong data set, using the wrong parameters, including the wrong figure, or reporting the wrong statistics.  These mistakes are in my view unacceptable in science.  Minimizing them is the highest priority

2.  Knowing what we did.  Some time in the future, way in the future, we or someone else will visit what we did.  Can we figure out what happened?  I'd like to plan on the time scale of decades rather than months or years.

3.  Planning for Human fallible.  Some people think science is for those who are meticulous.  Then count me out.  I am messy, careless, and chronically clueless.  A good organization anticipates human mistakes.

4. Easy to learn.  I collaborate with a lot of people.  The organization structure should be fairly intuitive self explanatory.

### What we do:

1. Data acquisition and curation.  I think we have this wired.  We use a born-open data model where data are collected, logged, versioned, and uploaded nightly to GitHub automatically.  We also automatically populate local mysql tables including information on subjects and sessions, and have additional tables for experiments, experimenters, computers, and IRB info.  We even have an adverse-events table to record and address any flaws in the organizational system.  The basic unit of organization is the dataset, and it works well.

2. Outputs.  We have the usual outputs: papers, talks, grant proposals, dissertations, etc.  Some are collaborative; some are individual; some are important; some go nowhere.  The basic unit here is pretty obvious---we know exactly where each paper, talk, dissertation, etc., begins and ends.

3. Value-added endeavors.  A value-added endeavor (VAE) is a small unit of intellectual contribution.  It could be a proof, a simulation, a specific analysis, or (on occasion) a verbal argument.  VAEs, as important as they are, are ill-defined in size and scope.  And it is sometimes unclear (perhaps arbitrary) where one ends and another begins.

### The Current System, The Good:

Perhaps the strongest elements of my lab's organization is that we use really good tools for open and high-integrity science.  Pretty much everything is script based, and scripts are in many ways self-documenting, especially when compared to menu-driven alternatives.  Our analyses are done in R, our papers in Latex and Markdown, and the two are integrated with RMarkdown and Knitr.  Moreover, we use a local git server and curate all development in repositories.

### The Current System, The Bad and Ugly:

We use projects as our basic organization unit.  Projects are basically repositories on our local git server.  They contain ad-hoc organizations of files.  But what a project encompasses and how it is organized is ad-hoc, disordered, unstandardized, and idiosyncratic.   Here are the issues:

1. There is no natural relation between the three things we do, acquire and curate data, produce outputs, and produce VAEs and projects.  One VAE might serve several different papers; likewise, one dataset might serve several different papers.  Papers and talks encompass several different experiments (usually) and VAEs.

2. Projects have no systematic relations to VAEs, outputs or datasets.  This is why I am unhappy.  Does a project mean one paper?  Does it mean one analysis?  One development?  A collection of related papers?  A paper and all talks and the supporting dissertation?  We have done all of the these.

### Help

What do you do?  Are there good standards?  What should be the basic organization unit?  Stay with project?  I am thinking about a strict output model where every output is a repository as the main organizing unit.  The problem is what-to-do about VAEs that span several outputs.  Say I have an analysis or graph that is common for a paper, a dissertation, and a talk.  I don't think I want this VAE repeated in three places.  I don't want symbolic links or hard codings because it makes it difficult to publicly archive.  That is why projects were so handy.   VAEs themselves are too small and too ill-defined to be organizing units.  Ideas?

## Friday, October 28, 2016

### A Probability Riddle

Some flu strains can jump from people to birds, and, perhaps vice-versa.

Suppose $$A$$ is the event that there is a flu outbreak in a certain community say in the next month, and let $$P(A)$$ denote the probability of this event occurring.    Suppose $$B$$ is the even that there is flu outbreak among chickens in the same community in the same time frame, with $$P(B)$$ being the probability of this event as well.

Now let's focus in on the relative flu risk to humans from chickens.  Let's define this risk as
$R_h=\frac{P(A|B)}{P(A)},$
If the flu strain jumps from chickens to people, then the conditional probability, $$P(A|B)$$ may well be higher than baserate, $$P(A)$$, and the risk to people will be greater than 1.0.

Now, if you are one of those animal-lover types, you might worry about the relative flu risk to chickens from people.  It is:
$R_c=\frac{P(B|A)}{P(B)}$

At this point, you might have the intuition that there is no good reason to think $$R_h$$ would be the same value as $$R_c$$.  You might think that the relative risk is a function of say the virology and biology of chickens, people, and viruses.

And you would be wrong.  While it may be that chickens and people have different base rates and different conditions, it must be that $$R_h=R_c$$.  It is a matter of math rather than biology or virology.

To see the math, let's start with the Law of Conditional Probability:
$P(A|B) = \frac{P(B|A)P(A)}{P(B)}.$

We can move $$P(A)$$ from one side to the other, arriving at
$\frac{P(A|B)}{P(A)} = \frac{P(B|A)}{P(B)} .$

Now, note that the left-hand side is the risk to people and the right hand side is the risk to chickens.

I find the fact that these risk ratios are preserved to be a bit counterintuitive.  It is part of what makes conditional probability hard.

## Sunday, April 10, 2016

### This Summer's Challenge: Share Your Data

"It would take me weeks of going through my data and coordinating them, documenting them, and cleaning them if I were to share them." anonymous senior faculty member

"Subject 7 didn't show. There is an empty file. Normally the program would label the next person Subject 8 and we would just exclude Subject 7 in analysis. But now that we are automatically posting data, what should I do? Should I delete the empty file so the next person is Subject 7?" anonymous student in my lab

"Why? Data from a bad study is, by definition, no good." @PsychScienctists, in response to my statement that all data should be curated and available.

All three of the above quotes illustrate a common way of thinking about data. Our data reflect something about us. When we share them, we are sharing something deep and meaningful about ourselves. Our data may be viewed as statements about our competence, our organizational skills, our meticulousness, our creativity, and our lab culture. Even the student in my lab feels this pressure. This student is worried that our shared data won't be viewed as sufficiently systematic because we have no data for Subject 7. Maybe we want to present a better image.

#### The Data-Are-The-Data Mindset

I don't subscribe to the Judge-Me-By-My-Data mindset. Instead, I think of data as follows:
• Scientific data are precious resources collected for the common good.
• We should think in terms of stewardship rather than ownership. Be good stewards
• Data are neither good nor bad, nor are they neat nor messy. They just are.
• We should judge each other by the authenticity of our data

### Mistake-Free Data Stewardship through Born-Open Data

To be good stewards and to insure authentic data, we upload everything, automatically, every night. Nobody has to remember anything, nobody makes decisions---it all just happens. Data are uploaded to GitHub where everyone can see them. In fact, I don't even use locally stored data for analysis; I point my analyses to the copy on GitHub. We upload data from well-though-out experiments. We upload data from poorly-thought-out-bust experiments. We upload pilot data. We upload incomplete data. If we collected it, it is uploaded. We have an accurate record of what happened in the lab, and you all are welcome to look in at our GitHub account. I call this approach born-open data, and have an in-press paper coming out about it. We have been doing born-open data for about a year.

So far, the main difference I have noticed is an increase in quality control with no energy or time spent to maintain this quality. Nothing ever gets messed up, and there is no after-the-fact reconstruction of what had happened. There is only one master copy of data---the one on GitHub. Analysis code points to the GitHub version. We never analyze the wrong or incomplete data. And it is trivially easy to share our analyses among lab members and others. In fact, we can build the analyses right into our papers with Knitr and Markdown. Computers are so much more meticulous than we will ever be. They never take a night off!

#### This Summer's Challenge: Automatic Data Curation

I'd like to propose a challenge: Set up your own automatic data curation system for new data that you collect. Work with your IT people. Set up the scripts. Hopefully, when next Fall rolls around, you too are practicing born-open data!

## Tuesday, April 5, 2016

### The Bayesian Guarantee And Optional Stopping.

Frequentist intuitions run so deep in us that we often mistakenly interpret Bayesian statistics in frequentist ones. Optional stopping has always been a case in point.  Bayesian quantities, when interpreted correctly, are not affected by optional stopping.  This fact is guaranteed by Bayes' Theorem.  Previously, I have shown how this guarantee works for Bayes factors.  Here, let's consider the simple case of estimating an effect size.

For demonstration purposes, let's generate data from a normal with unknown mean, $$\mu$$, but known variance of 1.  I am going to use a whacky optional stopping rule that favors sample means near .5 over others.  Here is how it works:  I. As each observation comes in, compute the running sample mean. II. Compute a probability of stopping that is dependent on the sample mean according to the figure below.  The probability favors stopping for sample means near .5.  III. Flip a coin with sides labeled "STOP" and "GO ON" with the below probability.  IV. Do what the coin says (up to a maximum of 50 observations, then stop no matter).

The results of this rule is a bias toward sample means near .5.   I ran a simulation with a true mean of zero for ten thousand replicates (blue histogram below).  The key property is a biasing of the observed sample means higher than the true value of zero.   Bayesian estimation seems biased too.    The green histogram shows the posterior means when the prior on $$\mu$$ is a normal with mean of zero and a standard deviation of .5.  The bias is less, but that just reflects the details of the situation where the true value, zero, is also favored by the prior.

So it might seem I have proved the opposite of my point---namely that optional stopping affects Bayesian estimation.

Nope.  The above case offers a frequentist interpretation, and that interpretation entered when we examined the behavior on a true value, the value zero.  Bayesians don't interpret analyses conditional on unknown "truths".

### The Bayesian Guarantee

Bayes' Theorem provides a guarantee.  If you start with your prior and observed data, then Bayes' Theorem guarantees that the posterior is the optimal set of probability statements about the parameter at hand.  It is a bit subtle to see this in simulation because one needs to condition on data rather than on some unknown truth.

Here is how a Bayesian uses simulation shows the Bayesian Guarantee.

I. On each replicate, sample a different true value from the prior.  In my case, I just draw from a normal centered at zero with standard deviation of .5 since that is my prior on effect size for this  post.  Then, on each replicate, simulate data from that truth value for that replicate.  I have chosen data of 25 observations (from a normal with variance of 1).  A histogram of the sample mean across these varying true values is provided below, left panel.   I ran the simulation for 100,000 replicates.

II. The histogram is that of data (sample means) we expect under our prior.  We need to condition on data, so let's condition on an observed sample mean of .3.  I have highlighted a small bin between .25 and .35 with red.  Observations fall in this bin about 6% of the time.

III.  Look at all the true values that generated those sample means in the bin with .3.  These true values are shown in the yellow histogram.  This histogram is the target of Bayes' Theorem, that is, we can use Bayes Theorem to describe this distribution without going through the simulations.   I have computed the posterior distribution for a sample mean of .3 and 25 observations under my prior, and plotted it as the line.  Notice the correspondence.  This correspondence is the simulation showing that Bayes Theorem works.  It works, by the way, for every bin though I have just shown it for the one centered on .3.

TAKE HOME 1: Bayes Theorem tells the distribution of true values given your prior and the data.

### Is The Bayesian Guarantee Affected By Optional Stopping?

So, we come to the crux move.  Let's simulate the whacky optional stopping rule that favors sample means near .5.  Once again, we start with the prior, and for each replicate we choose a different truth value as a sample from the prior.  Then we simulate data using optional stopping, and the resulting sample means are shown in the histogram on the left.  Optional stopping has affected these data dramatically.  No matter, we choose our bin, again around .3, and plot the true values that led to these sample means.  These true values are shown as the yellow histogram on the right.  They are far more spread out than in the previous simulation without optional stopping primarily because stopping occurred often for less than 25 observations.  Now, is this spread predicted?  Yes.  On each replication we obtain a posterior distribution, and these vary from replication-to-replication because the sample size is random.  I averaged these posteriors (as I should), and the result is the line that corresponds well to the histogram.

TAKE HOME  II: Bayes Theorem tells you where the true values are given your prior and the data, and it doesn't matter how the data were sampled!

And this should be good news.

*****************

R code

set.seed(123)

m0=0
v0=.5^2

runMean=function(y) cumsum(y)/(1:length(y))
minIndex=function(y) order(y)[1]

mySampler=function(t.mu,topN)
{
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
ybar=t(apply(y,1,runMean))
prob=plogis((ybar-.6)^2,0,.2)
another=matrix(nrow=M,ncol=topN,rbinom(M*topN,1,prob))
stop=apply(another,1,minIndex)
return(list("ybar"=ybar[cbind(1:M,stop)],"N"=stop))
}

goodSampler=function(t.mu,topN){
M=length(t.mu)
mean=rep(t.mu,topN)
y=matrix(nrow=M,ncol=topN,rnorm(M*topN,mean,1))
return(apply(y,1,mean))}

M=10000

png('freqResults.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))
t.mu=rep(0,M)
out=mySampler(t.mu,50)
ybar=out$ybar N=out$N
v=1/(N+1/v0)
c=(N*ybar+m0/v0)
hist(ybar,col='lightblue',main="",xlab="Sample Mean",breaks=50,xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
abline(v=mean(ybar),lwd=3,lty=2)
hist(v*c,col='lightgreen',main="",xlab="Posterior Mean",xlim=c(-1,1.25),prob=T,ylim=c(0,2.6))
abline(v=mean(v*c),lwd=3,lty=2)
dev.off()

###############################
set.seed(456)
png('bayesGuarantee.png',width=960,height=480)
par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0))

M=100000
N=25
t.mu=rnorm(M,m0,sqrt(v0))
ybar=goodSampler(t.mu,N)
myBreak=seq(-2.45,2.45,.1)
bars=hist(ybar,breaks=myBreak,plot=F)

mid=.3
good=(ybar >(mid-.05) & ybar<(mid+.05))
myCol=rep("white",length(myBreak))
myCol[round(bars$mids,2)==0.3]='red' plot(bars,col=myCol,xlab="Sample Mean",main="") mtext(side=3,adj=.5,line=0,cex=1.3,"Sample Mean Across Prior") v=1/(N+1/v0) c=(N*mid+m0/v0) hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow', ylim=c(0,2.2),main="",xlim=c(-1.75,1.75)) myES=seq(-2,2,.01) post=1:length(myES) for (i in 1:length(myES)) post[i]=mean(dnorm(myES[i],c*v,sqrt(v))) lines(myES,post,lwd=2) mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3") dev.off() ######################## set.seed(790) png('moneyShot.png',width=960,height=480) par(mfrow=c(1,2),cex=1.3,mar=c(4,4,2,1),mgp=c(2,1,0)) M=100000 t.mu=rnorm(M,m0,sqrt(v0)) out=mySampler(t.mu,50) ybar=out$ybar
N=out$N myBreak=seq(-5.95,5.95,.1) bars=hist(ybar,breaks=myBreak,plot=F) mid=.3 good=(ybar >(mid-.05) & ybar<(mid+.05)) myCol=rep("white",length(myBreak)) myCol[round(bars$mids,2)==0.3]='red'
plot(bars,col=myCol,xlab="Sample Mean",main="",xlim=c(-4,3))

v=1/(N[good]+1/v0)
c=(N[good]*ybar[good]+m0/v0)

hist(t.mu[good],prob=T,xlab=expression(paste("Parameter ",mu)),col='yellow',main="",
ylim=c(0,2.2),xlim=c(-1.75,1.75))
myES=seq(-2,2,.01)
post=1:length(myES)
for (i in 1:length(myES))
post[i]=mean(dnorm(myES[i],c*v,sqrt(v)))
lines(myES,post,lwd=2)
mtext(side=3,adj=.5,line=0,cex=1.3,"True values for sample means around .3")

dev.off()

######################
#stop probability

png(file="probStop.png",width=480,height=480)
par(cex=1.3,mar=c(4,4,1,1),mgp=c(2,1,0))
ybar=seq(-2,3,.01)
prob=plogis((ybar-.6)^2,0,.2)
plot(ybar,1-prob,typ='l',lwd=2,ylab="Stopping Probability",xlab="Sample Mean",ylim=c(0,.55))
mtext("Optional Stopping Depends on Sample Mean",side=3,adj=.5,line=-1,cex=1.3)
dev.off()

## Monday, March 28, 2016

### The Effect-Size Puzzler, The Answer

I wrote the Effect-Size Puzzler because it seemed to me that people have reduced the concept of effect size to a few formulas on a spreadsheet.  It is a useful concept that deserves a bit more thought.

In the example I had provided is the simplest case I can think of that is germane to experimental psychologists.  We ask 25 people to perform 50 trials in each of 2 conditions, and ask what is the effect size of the condition effect.  Think Stroop if you need a context.

The answer, by the way, is $$+\infty$$.  I'll get to it.

### The good news about effect sizes

Effect sizes have revolutionized how we compare and understand experimental results.  Nobody knows whether a 3% change in error rate is big or small or comparable across experiments; everybody knows what an effect size of .3 means.  And our understanding is not associate or mnemonic, we can draw a picture like the one below and talk about overlap and difference.  It is this common meaning and portability that licenses a modern emphasis on estimation.  Sorry estimators, I think you are stuck with standardized effect sizes.

Below is a graph from Many Labs 3 that makes the point.  Here, the studies have vastly different designs and dependent measures.  Yet, they can all be characterized in unison with effect size.

Even for the simplest experiment above, there is a lot of confusion.  Jake Westfall provides 5 different possibilities and claims that perhaps 4 of these 5 are reasonable at least under certain circumstances.  The following comments were provided on Twitter and Facebook: Daniel Lakens makes recommendations as to which one we shall consider the preferred effect size measure.  Tal Yarkoni and Uli Shimmack wonder about the appropriateness of effect size in within subject designs and prefer unstandarized effects (see Jan Vanhove's blog).  Rickard Carlson prefers effect sizes in physical units where possible, say in milliseconds in my Effect Size Puzzler.   Sanjay Srinivasta needs the goals and contexts first before weighing in.  If I got this wrong, please let me know.

From an experimental perspective, The Effect Size Puzzler is as simple as it gets.  Surely we can do better than to abandon the concept of standardized effect sizes or to be mired in arbitrary choices.

### Modeling: the only way out

Psychologists often think of statistics as procedures, which, in my view, is the most direct path to statistical malpractice.  Instead, statistical reasoning follows from statistical models.  And if we had a few guidelines and a model, then standardized effect sizes are well defined and useful.  Showing off the power of model thinking rather than procedure thinking is why I came up with the puzzler.

### Effect-size guidelines

#1:  Effect size is how large the true condition effect is relative to the true amount of variability in this effect across the population.

#2:  Measures of true effect and true amount of variability are only defined in statistical models.  They don't really exist accept within the context of a model.  The model is important.  It needs to be stated.

#3: The true effect size should not be tied to the number of participants nor the number of trials per participant.  True effect sizes characterize a state of nature independent of our design.

### The Puzzler Model

I generated the data to be realistic.  They had the right amount of skew and offset, and the tails fell like real RTs do.   Here is a graph of the generating model for the fastest and slowest individuals:

All data had a lower shift of .3s (see green arrow), because we typically trim these out as being too fast for a choice RT task.  The scale was influenced by both an overall participant effect and a condition effect, and the influence was multiplicative.  So faster participants had smaller effects; slower participants had bigger effects.  This pattern too is typical of RT data.   The best way to describe these data is in terms of percent-scale change.  The effect was to change the scale by 10.5%, and this amount was held constant across all people.  And because it was held constant, that is, there was no variability in the effect,  the standardized effect size in this case is infinitely large.

Now, let's go explore the data.  I am going to skip over all the exploratory stuff that would lead me to the following transform, Y = log(RT-.3), and just apply it.  Here is a view of the transformed generating model:

So, lets put plain-old vanilla normal models on Y.  First, let's take care of replicates.
$Y_{ijk} \sim \mbox{Normal} (\mu_{ij},\sigma^2)$
where $$i$$\$ indexes individuals, $$j=1,2$$ indexes conditions, and $$k$$ indexes replicates.  Now, lets model $$\mu_{ij}$$.  A general formulation is
$\mu_{ij} = \alpha_i+x_j\beta_i,$
where $$x_j$$ is a dummy code of 0 for Condition 1 and 1 for Condition 2.  The term $$\beta_i$$ is the ith individual's effect.  We can model it as
$\beta_i \sim \mbox{Normal}(\beta_0,\delta^2)$
where $$\beta_0$$ is the mean effect across people and $$\delta^2$$ is the variation of the effect across people.

With this model, the true effect size is $d_t = \frac{\beta_0}{\delta}.$ Here, by true, I just mean that it is a parameter rather than a sample statistic.  And that's it, and there is not much more to say in my opinion.   In my simulations the true value of each individual's effect was .1.  So the mean, $$\beta_0$$, is .1 and the standard deviation, $$\delta$$, is, well, zero.  Consequently, the true standardized effect size is $$d_t=+\infty$$.   I can't justify any other standardized measure that captures the above principles.

### Analysis

Could a good analyst have found this infinite value?  That is a fair question. The plot below shows individuals' effects, and I have ordered them from smallest to largest.  A key question is whether these are spread out more than expected from within-cell sample noise alone.  It these individual sample effects are more spread out, then there is evidence for true individual variation in $$\beta_i$$.  If these stay as clustered as predicted by sample noise alone, then there is evidence that people's effects do not vary.  The solid line is the prediction within within-cell noise alone.   It is pretty darn good.  (The dashed line is the null that people have the same, zero-valued true effect).  I also computed a one-way random-effects F statistic to see if there is a common effect or many individual effects.  It was one effect F(24,2450) = 1.03.  Seems like one effect.

These one-effect results should be heeded.  It is a structural element that I would not want to miss in any data set.   We should hold plausible the idea that the standardized effect size is exceedingly high as the variation across people seems very small if not zero.

To estimate effect sizes, we need a hierarchical model.  You can use Mplus, AMOS, LME4, WinBugs, JAGS, or whatever you wish.  Because I am an old and don't learn new tricks easily, I will do what I always do and program these models from scratch.

I used the general model above in the Bayesian context.  The key specification is the prior on $$\delta^2$$.   In the log-normal, the variance is a shape parameter, and it is somewhere around $$.4^2$$.  Effects across people are usually about 1/5th of this say $$.08^2$$.  To capture variances in this range, I would use a  $$\delta^2 \sim \mbox{Inverse Gamma(.1,.01)}$$ prior for general estimation.  This is a flexible prior tuned for the 10 to 100 millisecond range for variation in effects across people.  The following plot shows the resulting estimates of individual effects as a function of the sample effect values.
The noteworthy feature is the lack of variation in model estimates of individual's effects!  This type of pattern where variation in model estimates are attenuated compared to sample statistics is called shrinkage, and it occurs because the hierarchical models don't chase within-cell sample noise.  Here the shrinkage is nearly complete, leading again to the conclusion that there is no real variation across people, or an infinitely large standardized effect size.  For the record, the estimated effect size here is 5.24, which, in effect size units, is getting quite large!

The final step for me is comparing this variable effect model to a model with no variation, say $$\beta_i = \beta_0$$ for all people.  I would do this comparison with Bayes factor.  But, I am out of energy and you are out of patience, so we will save it for another post.

### Back To Jake Westfall

Jake Westfall promotes a design-free version of Cohen's d where one forgets that the design is within-subject and uses an all-sources-summed-and-mashed-together variance measure.  He does this to stay true to Cohen's formulae.  I think it is a conceptual mistake.

I love within-subject designs precisely because one can separate variability due to people, variability within a cell, and variability in the effect across people.  In between-subject designs, you have no choice but to mash all this variability together due to the limitations of the design.   Within-subject designs are superior, so why go backwards and mash the sources of variances together when you don't have to?  This advise strikes me as crazy.  To Jake's credit, he recognizes that the effect-size measures promoted here are useful, but doesn't want us to call them Cohen's d.  Fine, we can just call them Rouder's within-subject totally-appropriate standardized effect-size measures.  Just don't forget the hierarchical shrinkage when you use it!

## Thursday, March 24, 2016

### The Effect-Size Puzzler

Effect sizes are bantered around as useful summaries of the data.  Most people think they are straightforward and obvious.  So if you think so, perhaps you won't mind a bit of a challenge?  Let's call it "The Effect-Size Puzzler," in homage to NPR's CarTalk.  I'll buy the first US winner a nice Mizzou sweatshirt (see here).  Standardized effect size please.

I have created a data set with 25 people each observing 50 trials in 2 conditions.  It's from a priming experiment.  It looks about like real data.  Here is the download.

The three columns are:

• id (participant: 1...25)
• cond (condition: 1,2)
• rt (response time in seconds).

There are a total of 2500 rows.

I think it will take you just a few moments to load it and tabulate your effect size for the condition effect.  Have fun.  Write your answer in a comment or write me an email.

I'll provide the correct answer in a blog next week.

HINT: If you wish to get rid of the skew and stabilize the variances, try the transform y=log(rt-.3)