The last post talked about making sure undergrads get the big picture of their questions. This is essential, but it is not the end. All too often analysis is left for the end and there is no exploring. Ideally, students learn their analytical techniques right from the start. One of my grad students, Tyler Larsen, is doing a great job of this.
Here is something he sent Cara Jefferson who is working on a project that is an extension of his. The details will be different for everyone, so this might not mean a lot to others, but there are things to get out of it.
First, Tyler encourages graphic exploration of the data.
Second, he does not give everything in detail to Cara but instead gives his scripts and encourages her to think about how her data differ from his and modify the analyses. It is too easy to turn off your brain if you are given everything. But he makes it clear he is available for help.
Third, Tyler suggests ways to explore the data, things she might look for. Cara is fairly new on the project so this is a really good idea. But he doesn’t tell her exactly how to do it.
Fourth, Tyler’s questions lead to the important questions of the research project.
Fifth, Tyler makes it clear that science has variation and we have to worry when it is caused by things like date of experiment and helps her see how to look for that.
Sixth, Tyler gives Cara the R code necessary to start the work. The sooner students get comfortable with R, the better, and getting code for one’s project is a good way to start. He gave her the R code in an R format, but I appended it to this post.
What you see below is all the stuff Tyler thought hard about and gave Cara. It is an wonderful example of a careful grad student mentor working with an excellent undergrad.
Cara,
It is valuable when you’ve collected a bunch of data to take the time to play with it and look for patterns. Obviously we will ultimately do statistics to test any hypotheses we ultimately want to test, but in the meantime it is worthwhile to graph things every which way to Sunday and see what pops out.
Here is a non-exhaustive list of questions to answer about your data. Some of these may be easy to tackle in Excel on the spreadsheet itself, while some others will require some R. Consult the R script (attached) for guidance. The first part of it restructures the data, calculates growth rates, and further annotates it with a few columns that will be useful for pulling out specific subsets to graph. If you’ve set up your data correctly you should be able to run it as is (though you’ll have to specify the ‘mainpath’, ‘todaysdate’, and ‘filename’ variables). If you have any trouble running the first part please let me know and I’ll help you sort it out. The second part makes graphs of various sorts and is more fluid. You may find the code as written helpful but it was written for my experiments, not yours, and so some simple modifications will be necessary. See if you can figure them out. Think carefully about what specific question you are hoping a graph will answer. It can be helpful to draw on paper what the final graph would look like if your hypotheses were correct, just to work out what should be on the x axis, y axis, etc.
- In Excel, take a look at the growth curves themselves. Plot at least a few of them as scatter or line graphs. Do they take the shape we talked about, with a flat beginning, an exponential rise, a plateau, and then a slow decline? Take note of any you see that don’t look like this and try to think about why.
- Look at the final OD values reached by different strains. This is the yield. It isn’t as informative as the growth rate (it’s very sensitive to the specific growth conditions), but it may differ between strains.
- Plot the data from the negative control wells. The media only wells should not grow at all (ie just a flat line at around 0). If they do the media was contaminated and that plate is likely not usable. Ideally the Dicty negative controls (the wells with Dicty but not Burk) should not grow either but past experience tells me this is wishful thinking and many often do show some growth, presumably from KP that survived the antibiotic treatment. Take note of how long it took these wells to grow – if it’s later than the Burk strains (as it often is), it probably isn’t a concern. If it’s earlier it is something we need to look more carefully at, lest we accidentally measure the wrong strain’s growth rate.
- Run the first part of the script to calculate growth rates. Note that growth curves with aberrant shapes are hard to calculate growth rate from. The script will notice most of these and refuse to calculate growth rates for them, but it does miss some of them. If you get a number that seems really weird, go back and plot that well’s growth curve. Chances are it’ll be screwy and you’ll know it’s not to be trusted.
- In Excel or R, convert all of the growth rates into doubling times. Doubling time = ln(2)/µ, and is usually expressed in growth per hour. Remember our time points are 15 minutes apart.
- Now compare growth rates or doubling times between strains. Do this first just by eye on the spreadsheet, and then try to produce a graph to answer these questions (one graph may be able to address more than one question at a time).
- How do the ancestor strains’ growth rates compare? (No Dicty present)
- How do evolved strains compare to ancestor strains? (No Dicty present) Are they faster or slower? Does it vary by strain? Why?
- How does adding Dicty change growth rates of ancestor strains? Is it faster or slower? Does it vary by strain? Why?
- How does adding Dicty change growth rates of evolved strains? Are evolved Burk strains more or less sensitive to the effects of adding Dicty? Does it vary by strain? Why?
- How consistent are your results between days? Do you get the same answers? If not, why not?
If you get stuck on any of the initial part (setting up data, calculating growth rates, etc), come find me and I’ll help you. For the interpretation part I encourage you to take a sincere whack at it yourself before you seek help. We’ll go over it once you’ve had the chance to give it some thought.
And he gave her the R code to use:
##Here is a script for analyzing growth rates from the Tecan
##Useful link: https://www.r-bloggers.com/analyzing-microbial-growth-with-r/
##This version of the script was modified on 03/21/18 improve usability and incorporate some statistics
##############################################################################################################
#Install packages – this should only be necessary once per computer.
#install.packages(“ggplot2”)
install.packages(“reshape2”)
#install.packages(“dplyr”)
#install.packages(“readr”)
#install.packages(“devtools”)
#library(devtools)
#install_github(“dcangst/fitr”)
rm(list=ls()) # clear memory and start fresh
#Load packages into memory (will need to be downloaded and installed first)
library(ggplot2) #For plotting
library(reshape2) #For converting data between wide and long forms
library(dplyr) #For grouping and manipulating data
library(readr) #For importing CSV files
library(fitr) #For the actual growth rate analysis
###############################################################################################################
#Set some strings for file naming
mainpath<-“C:/Users/Tyler/Google Drive/School/QSLab/ExperimentalEvolution/Platereaderdata/”
todaysdate<-“042018”
filename<-“combineddata”
##Import data. Data should have columns Date, Well, Bstrain, Bline, Dstrain, Dline, Dtreatment, and Notes (case sensitive)
simpledata <- read_csv(paste(mainpath,”combineddata_042018.csv”,sep=””))
backup<-simpledata #make a backup
#convert the data to long form. The id argument defines which columns are categories identifying the
#data points. Everything else is the data. Variable.name and value.name arguments name the two new columns in the resultant data
meltdata<-melt(simpledata, id=c(“Date”,”Well”,”Bstrain”,”Bline”,”Dstrain”,”Dline”,”Dtreatment”,”Notes”),variable.name=”Time”,value.name=”OD600″)
meltdata$Time<-as.numeric(meltdata$Time) #Convert time from a factor into a number
meltdata$OD600<-as.numeric(meltdata$OD600) #Convert OD600 from a character into a number
meltdata$Bstrain[which(meltdata$Bstrain==”None”)]<-NA
#Save data
write.csv(meltdata, paste(mainpath,filename,”_”,todaysdate,”_”,”melted.csv”,sep=””))
#Read in the cleaned up data
mydata <- meltdata
mydata<-mydata[which(mydata$Bstrain != “NA”),] #Remove blanks
mydata$ID<-paste(mydata$Date,mydata$Well,mydata$Bstrain,mydata$Bline,mydata$Dstrain,mydata$Dline,mydata$Dtreatment, sep=”_”) #add ID column that is a string of all of the columns(except Notes). This is intended to give each line a unique name.
#Read the help for the fitr script
#?d_gcfit
#Run the script, finding mu max values for each, and saving everything into an object called ‘rates’
rates <- d_gcfit(data=mydata, #The growth curve data in long form
w_size=6, #How many data points the script should use for each line fitting
od_name=”OD600″,#The column name containing the absorbance data (y axis)
time_name=”Time”,#The column name containing the time data (x axis)
trafo=”log”,
logBase=2,
RsqCutoff=0.95, #This is the cutoff of how accurate the regression line must be without being rejected
growthCheck=”none”,
parallel=FALSE,
progress=”text”,
.inform=TRUE)
#rates #view the rates object
fitsforexport<-rates$bestfits #save the line fits (but not the rest of the script’s outputs) into a subobject called fitsforexport
#Copy all of the identifying columns and merge them into the fits
mydata2<-mydata[which(mydata$Time == 1),]
mergedata<-merge(mydata2,fitsforexport,by.x=c(“ID”),by.y=c(“ID”))
write.csv(mergedata, paste(mainpath,filename,”_”,todaysdate,”_”,”fit.csv”,sep=””))
#Remove unnecessary columns
drops<-c(“X1″,”Time”,”OD600″,”trafo”,”logBase”,”growth”,”comment”,”numP”,”nTime”,”minT”,”maxT”,”intercept”,”adj.r.sq”,”dt”,”maxOD”,”minOD”)
mergedata<-mergedata[ , !(names(mergedata) %in% drops)]
#Remove lines that the script could not find clean fits for (ie rsq>.95) or lines in which there was no growth
mergedata<-mergedata[which(mergedata$mumax != “NA”),]
#Add a column, Bclade, which specifies which clade/category each Burk strain belongs to (will be useful for stats comparing between clades)
mergedata$Bclade<-mergedata$Bstrain
mergedata$Bclade[which(mergedata$Bclade ==”KP”)]<-NA
mergedata$Bclade[which(mergedata$Bclade %in% c(“Bf”,”BD66″,”soil99″))]<-“nonsymbiont”
mergedata$Bclade[which(mergedata$Bclade %in% c(“bQS70″,”bQS159″,”bQS161”))]<-“B1”
mergedata$Bclade[which(mergedata$Bclade %in% c(“bQS11″,”bQS21″,”bQS69”))]<-“B2”
#Add a column, Bevolved, which specifies if a Burk strain is evolved or not
mergedata$Bevolved<-FALSE
mergedata$Bevolved[which(mergedata$Bline %in% c(“E1″,”E2″,”E3”))]<-TRUE
#Add a column, Devolved, which specifies if a Dicty strain is evolved or not
mergedata$Devolved<-NA
mergedata$Devolved[which(mergedata$Dline %in% c(“E1″,”E2″,”E3″))]<-TRUE
mergedata$Devolved[which(mergedata$Dline ==”Anc”)]<-FALSE
#add a new column, Matched, which indicates whether a B and D strain belong to one another
mergedata$Matched<-FALSE
mergedata$Matched[which(mergedata$Bstrain==”Bf” & mergedata$Dstrain==”QS6″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”BD66″ & mergedata$Dstrain==”QS9″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”soil99″ & mergedata$Dstrain==”QS18″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”bQS70″ & mergedata$Dstrain==”QS70″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”bQS159″ & mergedata$Dstrain==”QS159″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”bQS161″ & mergedata$Dstrain==”QS161″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”bQS11″ & mergedata$Dstrain==”QS11″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”bQS21″ & mergedata$Dstrain==”QS21″)]<-TRUE
mergedata$Matched[which(mergedata$Bstrain==”bQS69″ & mergedata$Dstrain==”QS69″)]<-TRUE
mergedata$Matched[is.na(mergedata$Dstrain)]<-NA
#add a new column, Dictystatus, which indicates whether a Dicty strain is evolved or ancestral (ie collapses E1, E2, and E3)
mergedata$Dictystatus<-“None”
mergedata$Dictystatus[which(mergedata$Devolved ==FALSE)]<-“Anc”
mergedata$Dictystatus[which(mergedata$Devolved ==TRUE)]<-“Evolved”
#Save final dataset
write.csv(mergedata, paste(mainpath,filename,”_”,todaysdate,”_”,”annotated.csv”,sep=””))
finaldata<-mergedata
#finaldata <- read_csv(paste(mainpath,filename,”_”,todaysdate,”_”,”annotated.csv”,sep=””))
###############################################################################################################
####Plotting for runs including Dicty##########################################################################
###############################################################################################################
matcheddata<-finaldata[which(finaldata$Matched %in% c(TRUE, NA)),]
matcheddata<-matcheddata[which(matcheddata$Bstrain!=”KP”),]
#matcheddata<-matcheddata[which(matcheddata$Date != “102217”),] #remove the 102217 run, which was HK and probably shouldn’t have gone in there in the first place
matcheddata<-matcheddata[which(!(matcheddata$Date %in% c(“102217″,”30818″,”22818″,”30918”))),]
#Set the order of the points correctly
matcheddata$Bstrain<-factor(matcheddata$Bstrain, levels=c(“Bf”,”BD66″,”soil99″,”bQS70″,”bQS159″,”bQS161″,”bQS11″,”bQS21″,”bQS69″))
matcheddata$Dline[is.na(matcheddata$Dline)]<-“None”
matcheddata$Dline<-factor(matcheddata$Dline, levels=c(“None”,”Anc”,”E1″,”E2″,”E3″))
matcheddata$Dictystatus<-factor(matcheddata$Dictystatus, levels=c(“None”,”Anc”,”Evolved”))
matcheddata$Bclade<-factor(matcheddata$Bclade, levels=c(“nonsymbiont”,”B1″,”B2″))
matcheddata$Doublingtime<-(0.69314718056/matcheddata$mumax)/4
#Boxplot for growth rate
plot1<-ggplot(data=matcheddata, aes(x=Dline, y=mumax, fill=Dictystatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=”both”) +
theme(legend.position=’none’) +
labs(y=”Specific growth rate (mu)”,x=””) +
ggtitle(“Growth rate by line”)
plot1
#Boxplot for growth rate (collapsed by strain)
plot2<-ggplot(data=matcheddata, aes(x=Dictystatus, y=mumax, fill=Dictystatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=”both”) +
theme(legend.position=’none’) +
labs(y=”Specific growth rate (mu)”,x=””) +
ggtitle(“Growth rate by strain”)
plot2
#Boxplot for growth rate (collapsed by clade)
plot3<-ggplot(data=matcheddata, aes(x=Dictystatus, y=mumax, fill=Dictystatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bclade, switch=’both’) +
theme(legend.position=’none’) +
labs(y=”Specific growth rate (mu)”,x=””) +
ggtitle(“Growth rate by clade”)
plot3
##################
#Do some math to change it into fold changes
#Calculate as fold change
grouped1<-group_by(matcheddata, Dstrain, Dline, Bstrain)
stats1<-summarise(grouped1, N=length(mumax),Average=mean(mumax),StDev=sd(mumax))
stats1$ID<-paste(stats1$Bstrain,stats1$Dstrain,stats1$Dline, sep=”_”)
matcheddata$ID<-paste(matcheddata$Bstrain,”NA”,”None”, sep=”_”)
mergedata<-merge(matcheddata,stats1[,c(5,7)],by.x=c(“ID”),by.y=c(“ID”))
mergedata$Foldchange<-mergedata$mumax/mergedata$Average
#Boxplot for growth rate (scaled)
plot4<-ggplot(data=mergedata, aes(x=Dline, y=Foldchange, fill=Dictystatus)) +
geom_boxplot() +
geom_hline(yintercept=1) +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=”both”) +
theme(legend.position=’none’) +
labs(y=”Fold change of specific growth rate (mu)”, x=””) +
ggtitle(“Growth rate (scaled) by line”)
plot4
#Boxplot for growth rate (scaled) (collapsed by strain)
plot5<-ggplot(data=mergedata, aes(x=Dictystatus, y=Foldchange, fill=Dictystatus)) +
geom_boxplot() +
geom_hline(yintercept=1) +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=”both”) +
theme(legend.position=’none’) +
labs(y=”Fold change of specific growth rate (mu)”, x=””) +
ggtitle(“Growth rate (scaled) by strain”)
plot5
#Boxplot for growth rate (scaled) (collapsed by clade)
plot6<-ggplot(data=mergedata, aes(x=Dictystatus, y=Foldchange, fill=Dictystatus)) +
geom_boxplot() +
geom_hline(yintercept=1) +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bclade, switch=”both”) +
theme(legend.position=’none’) +
labs(y=”Fold change of specific growth rate (mu)”, x=””) +
ggtitle(“Growth rate (scaled) by clade”)
plot6
#############
############################################################################
#Doin some stats
########
#STATS STUFF:
#Use non parametric test to see if Anc differ from evolved
wilcox.test(mergedata$Foldchange[which(mergedata$Dictystatus==”Anc”)], mergedata$Foldchange[which(mergedata$Dictystatus==”Evolved”)])
#They do.
teststats<-aov(mumax~Dictystatus*Bclade,data=matcheddata)
summary(teststats)
#Status (whether D is evolved or not), clade, and status*clade are all quite significant
#Try again with fold change
teststats<-aov(Foldchange~Dictystatus*Bclade,data=mergedata)
summary(teststats)
#Status (whether D is evolved or not) and status*clade are quite significant
#################################
#Boxplot for doubling time
plot4<-ggplot(data=matcheddata, aes(x=Dline, y=Doublingtime, fill=Dictystatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=’both’) +
ylim(0,15) +
theme(legend.position=’none’) +
labs(y=”Doubling time (hours)”) +
ggtitle(“Doubling time by line”)
plot4
#Boxplot for doubling time (collapsed by strain)
plot5<-ggplot(data=matcheddata, aes(x=Dictystatus, y=Doublingtime, fill=Dictystatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=’both’) +
ylim(0,15) +
theme(legend.position=’none’) +
labs(y=”Doubling time (hours)”) +
ggtitle(“Doubling time by strain”)
plot5
#Boxplot for doubling time (collapsed by clade)
plot6<-ggplot(data=matcheddata, aes(x=Dictystatus, y=Doublingtime, fill=Dictystatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#999999″,”#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bclade, switch=’both’) +
ylim(0,15) +
theme(legend.position=’none’) +
labs(y=”Doubling time (hours)”) +
ggtitle(“Doubling time by clade”)
plot6
###############################################################################################################
####Plotting for runs without Dicty##########################################################################
###############################################################################################################
#Remove lines with Dicty present
matcheddata2<-finaldata[which(is.na(finaldata$Dstrain)),]
matcheddata2<-matcheddata2[which(matcheddata2$Date %in% c(“30918″,”31618″,”30818”)),]
#Set the order of the points correctly
matcheddata2$Bstrain<-factor(matcheddata2$Bstrain, levels=c(“KP”,”Bf”,”BD66″,”soil99″,”bQS70″,”bQS159″,”bQS161″,”bQS11″,”bQS21″,”bQS69″))
matcheddata2$Bline[is.na(matcheddata2$Bline)]<-“Anc”
matcheddata2$Bclade[is.na(matcheddata2$Bclade)]<-“KP”
matcheddata2$Bclade<-factor(matcheddata2$Bclade, levels=c(“KP”,”nonsymbiont”,”B1″,”B2″))
matcheddata2$Doublingtime<-(0.69314718056/matcheddata2$mumax)/4
matcheddata2$Bstatus<-“None”
matcheddata2$Bstatus[which(matcheddata2$Bevolved ==FALSE)]<-“Anc”
matcheddata2$Bstatus[which(matcheddata2$Bevolved ==TRUE)]<-“Evolved”
#Boxplot for growth rate
plot7<-ggplot(data=matcheddata2, aes(x=Bline, y=mumax, fill=Bline)) +
geom_boxplot() +
scale_fill_manual(values=c(“#F8666D”,”#00BFC4″,”#00BFC4″,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=’both’) +
theme(legend.position=’none’) +
labs(y=”Specific growth rate (mu)”, x=””) +
ggtitle(“Growth rate (Dictyostelium absent) by line”)
plot7
#Boxplot for growth rate (collapsed by strain)
plot8<-ggplot(data=matcheddata2, aes(x=Bstatus, y=mumax, fill=Bstatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=’both’) +
theme(legend.position=’none’) +
labs(y=”Specific growth rate (mu)”, x=””) +
ggtitle(“Growth rate (Dictyostelium absent) by strain”)
plot8
#Boxplot for growth rate (collapsed by clade)
plot9<-ggplot(data=matcheddata2, aes(x=Bstatus, y=mumax, fill=Bstatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bclade, switch=’both’) +
theme(legend.position=’none’) +
labs(y=”Specific growth rate (mu)”, x=””) +
ggtitle(“Growth rate (Dictyostelium absent) by clade”)
plot9
#Boxplot for doubling time
plot10<-ggplot(data=matcheddata2, aes(x=Bline, y=Doublingtime, fill=Bline)) +
geom_boxplot() +
scale_fill_manual(values=c(“#F8666D”,”#00BFC4″,”#00BFC4″,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=’both’) +
theme(legend.position=’none’) +
labs(y=”Doubling time (hours)”, x=””) +
ggtitle(“Doubling time (Dictyostelium absent) by line”)
plot10
#Boxplot for doubling time (collapsed by strain)
plot11<-ggplot(data=matcheddata2, aes(x=Bstatus, y=Doublingtime, fill=Bstatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bstrain, switch=’both’) +
theme(legend.position=’none’) +
labs(y=”Doubling time (hours)”, x=””) +
ggtitle(“Doubling time (Dictyostelium absent) by strain”)
plot11
#Boxplot for doubling time (collapsed by clade)
plot12<-ggplot(data=matcheddata2, aes(x=Bstatus, y=Doublingtime, fill=Bstatus)) +
geom_boxplot() +
scale_fill_manual(values=c(“#F8666D”,”#00BFC4″)) +
geom_point() +
facet_grid(.~Bclade, switch=’both’) +
theme(legend.position=’none’) +
labs(y=”Doubling time (hours)”, x=””) +
ggtitle(“Doubling time (Dictyostelium absent) by clade”)
plot12