Showing posts with label bioinformatics. Show all posts
Showing posts with label bioinformatics. Show all posts

Monday, December 16, 2019

Some Recommended Podcasts and Episodes on AI and Machine Learning

Something I have been interested in for some time now is both is the convergence of big data and genomics and the convergence of causal inference and machine learning.

I am a big fan of the Talking Biotech Podcast which allows me to keep up with some of the latest issues and research in biotechnology and medicine. A recent episode related to AI and machine learning covered a lot of topics that resonated with me.

There was excellent discussion on the human element involved in this work, and the importance of data data prep/feature engineering (the 80% of work that has to happen before the ML/AI can do its job) and the challenges of non-standard 'omics' data. Also the potential biases that researchers and developers can inadvertently introduce in this process. Much more including applications of machine learning and AI in this space and best ways to stay up to speed on fast changing technologies without having to be a heads down programmer.

I've been in a data science role since 2008 and have transitioned from SAS to R to python. I've been able to keep up within the domain of causal inference to the extent possible, but I keep up with broader trends I am interested in via podcasts like Talking Biotech. Below is a curated list of my favorites related to data science with a few of my favorite episodes highlighted.


1) Casual Inference - This is my new favorite podcast by two biostatisticians covering epidemiology/biostatistics/causal inference - and keeping it casual.

Fairness in Machine Learning with Sherri Rose | Episode 03 - http://casualinfer.libsyn.com/fairness-in-machine-learning-with-sherri-rose-episode-03

This episode was the inspiration for my post: When Wicked Problems Meet Biased Data.





#093 Evolutionary Programming -


#266 - Can we trust scientific discoveries made using machine learning






#37 Causality and potential outcomes with Irineo Cabreros - https://bioinformatics.chat/potential-outcomes


Andrew Gelman - Social Science, Small Samples, and the Garden of Forking Paths https://www.econtalk.org/andrew-gelman-on-social-science-small-samples-and-the-garden-of-the-forking-paths/
James Heckman - Facts, Evidence, and the State of Econometrics https://www.econtalk.org/james-heckman-on-facts-evidence-and-the-state-of-econometrics/


Tuesday, June 6, 2017

Professional Science Master's Degree Programs in Biotechnology and Management

As an undergraduate I always had an interest in biotechnology and molecular genetics. However, lab work did not particularly appeal to me. I also recognized early on that science does not occur in a vacuum- its subject to social, political, economic, and financial forces. This drew me to the field of economics, specifically public choice theory.

When it came time for graduate school I was still torn. I really wasn't interested in an MBA and didn't really have the background to work in a lab or do field work in genetic research. I really liked economics. The combination of mathematically precise theories (microeconomics/game theory) and empirically sound methods (econometrics) provided a powerful framework for applied problem solving.

I had two advisers make recommendations that got me thinking outside the box. One suggested ultimately I would find a niche that combined both economics and genetics. The other suggested I look at programs like the Bioscience Management program that was being offered at the time at George Mason University (now Bioinformatics Management). While there were not a lot of programs like that being offered at the time, the Agriculture Department at Western Kentucky University provided enough flexibility in their masters program to include courses in biostatistics, genetics, and applied economics. I was able to work on research projects analyzing consumer perceptions of biotechnology and biotech trait resistance management using tools from econometrics, game theory, and population genetics. Additionally I took courses in applied economics and finance from both the Department of Agriculture and College of Business where I was exposed to tools related to investment analysis, options pricing, and analysis and valuation of biotech companies as well as the impacts of technological change and biotechnology on food and economic development.

With this combination of quantitative training and applied work I have been able to leverage SAS, R, and Python to solve a number of challenging problems throughout a number of professional analytics and consulting roles.

Today there are a larger number of professional science masters programs with curriculums similar to the programs I contemplated over 10 years ago.

According to National Professional Science Master’s Association:

"Professional Science Master's (PSMs) are designed for students who are seeking a graduate degree in science or mathematics and understand the need for developing workplace skills valued by top employers. A perfect fit for professionals because it allows you to pursue advanced training and excel in science or math without a Ph.D., while simultaneously developing highly-valued business skills....PSM programs consist of two years of coursework along with a professional component that includes business, communications and/or regulatory affairs."

In 2012 there was an article in Science detailing these degrees and some data related to salaries which seemed attractive. According to the article the first program was officially offered in 1997, reaching 140 programs by 2009 with over 247 at the time of printing.

This commentary from the article corroborates how I feel about my experience:

“There is a tendency for students to buy into the line that if you don't get a Ph.D., you're not a serious professional, that you're wasting your mind,” she says. After spending a decade talking with PSM students and graduates, she is certain that’s not true. “There is so much potential for growth and satisfaction with a PSM degree. You can become a person you didn’t even know you wanted to be.”

Below are some programs that would look interesting to me that students interested in this option should check out. (there is a program locator you can find here) . Many of these programs are a mash up of biology/biotech and applied economics and business degrees.

George Mason University- PSM Bioinformatics Management

University of Illinois - Agricultural Production

Cornell- MPS Agriculture and Life Sciences

Washington State University - PSM Molecular Biosciences

Middle Tennesee State University - PSM Biotechnology

California State - MS Biotechnology/MBA

Johns Hopkins - MBA/MS Biotechnology

Rice - PSM Bioscience and Health Policy

North Carolina State University - MBA (Biosciences Mgt Concentration)

Purdue/Kelley - MS-MBA (not a heavy science emphasis but a very cool degree regardles from great schools)

See also:
Analytical Translators
Why Study Agricultural/Applied Economics

Sunday, February 12, 2017

Molecular Genetics and Economics

A really interesting article in JEP:

A slice:

"In fact, the costs of comprehensively genotyping human subjects have fallen to the point where major funding bodies, even in the social sciences, are beginning to incorporate genetic and biological markers into major social surveys. The National Longitudinal Study of Adolescent Health, the Wisconsin Longitudinal Study, and the Health and Retirement Survey have launched, or are in the process of launching, datasets with comprehensively genotyped subjects…These samples contain, or will soon contain, data on hundreds of thousands of genetic markers for each individual in the sample as well as, in most cases, basic economic variables. How, if at all, should economists use and combine molecular genetic and economic data? What challenges arise when analyzing genetically informative data?"


Link:

https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3306008/


Reference:
Beauchamp JP, Cesarini D, Johannesson M, et al. Molecular Genetics and Economics. The journal of economic perspectives : a journal of the American Economic Association. 2011;25(4):57-82.

Friday, May 8, 2015

Mendelian Instruments (Applied Econometrics meets Bioinformatics)

Recently I defended the use of quasi-experimental methods in wellness studies, and a while back I sort of speculated that genomic data might be useful in a quasi-experimental setting-but wasn’t sure how:

If causality is the goal, then merge 'big data' from the gym app with biometrics and the SNP profiles and employ some quasi-expermental methodology to investigate causality.”

Then this morning at marginal revolution I ran across a link to a blog post that mentioned exploiting mendelian variation as instruments for a particular study related to alcohol consumption.

This piece gives a nice intro I think:

Stat Med. 2008 Apr 15;27(8):1133-63. Mendelian randomization: using genes as instruments for making causal inferences in epidemiology.

Lawlor DA1, Harbord RM, Sterne JA, Timpson N, Davey Smith G

Link: http://www.ncbi.nlm.nih.gov/pubmed/17886233

“Observational epidemiological studies suffer from many potential biases, from confounding and from reverse causation, and this limits their ability to robustly identify causal associations. Several high-profile situations exist in which randomized controlled trials of precisely the same intervention that has been examined in observational studies have produced markedly different findings. In other observational sciences, the use of instrumental variable (IV) approaches has been one approach to strengthening causal inferences in non-experimental situations. The use of germline genetic variants that proxy for environmentally modifiable exposures as instruments for these exposures is one form of IV analysis that can be implemented within observational epidemiological studies. The method has been referred to as 'Mendelian randomization', and can be considered as analogous to randomized controlled trials. This paper outlines Mendelian randomization, draws parallels with IV methods, provides examples of implementation of the approach and discusses limitations of the approach and some methods for dealing with these.”

Thursday, November 8, 2012

BISC Presentation: An Introduction to Social Network Analysis with Applications



RESEARCH SYMPOSIUM
“Strengthening Collaborations through Interactive Posters”
Friday, November 9, 2012
1:00 – 3:00 pm
Snell 2102 and 2113

Abstract

An introduction to Social Network Analysis tools with applications in viral marketing, social media analytics, epidemiology, homeland security, bioinformatics, student persistence, and technology diffusion.

Suggested Citation

Matt Bogard. "Social Network Analysis: An introduction with applications from literature" WKU Bioinformatics and Information Science Center.. Jan. 2012.
Available at: http://works.bepress.com/matt_bogard/23

Saturday, October 13, 2012

BMC Proceedings: A comparison of random forests, boosting and support vector machines for genomic selection

A very cool combination of machine learning/quantitative genetics/bioinformatics

"Genomic selection (GS) involves estimating breeding values using molecular markers spanning the entire genome. Accurate prediction of genomic breeding values (GEBVs) presents a central challenge to contemporary plant and animal breeders. The existence of a wide array of marker-based approaches for predicting breeding values makes it essential to evaluate and compare their relative predictive performances to identify approaches able to accurately predict breeding values. We evaluated the predictive accuracy of random forests (RF), stochastic gradient boosting (boosting) and support vector machines (SVMs) for predicting genomic breeding values using dense SNP markers and explored the utility of RF for ranking the predictive importance of markers for pre-screening markers or discovering chromosomal locations of QTLs."


http://www.biomedcentral.com/1753-6561/5/S3/S11

Saturday, August 13, 2011

QTL Analysis in R

See also: Part 1: QTL Analysis and Quantitative Genetics Part 2: Statistical Methods for QTL Analysis

The 'qtl' package in R allows you to implement QTL analysis using the methods I've previously discussed. The code below is adapted from Broman's documentation "A Brief Tour of R/qtl." ( http://www.rqtl.org/tutorials/rqtltour.pdf ) My code (which follows) is an even briefer tour, relating specifically to the general topics in QTL analysis that I have previously discussed in Part 1 and 2 of this 3 part series of posts. The data set is a simulated backcross data set. The 'summary' function provides some basic descriptive data, such as the number of individuals in the crosses, the number of mapped chromosomes, the number of phenotypes, and number of mapped markers.

Ex:
Backcross

No. individuals: 400

No. phenotypes: 4
Percent phenotyped: 100 100 100 100

No. chromosomes: 19
Autosomes: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Total markers: 91
No. markers: 7 5 3 4 10 5 4 4 6 4 4 5 11 3 3 4 4 2 3
Percent genotyped: 72.7
Genotypes (%): AA:50.1 AB:49.9

The plot function (see output below) provides 1) a genetic map with markers 2) phenotype distributions 3) missing data visualizations:


Recall, from the end of my discussion in Part 1 and my discussion in Part 2 I stated:

For maximum likelihood estimation, the following probability density of yi can be stated as:
f(yi) = Pr(y|Zi=Hk) ‘ the probability of phenotype ‘y’ given genotype H
= (1/ √ 2π σ) exp[ 1/2 σ 2 (yi-Xiβ +Hkγ )2
The hypothesis H0: γ =0 can be tested using the likelihood ratio test:
λ = -2(L0-L1)
where L0 represents the likelihood under a restricted model. This is equivalent to the general notion presented in Broman (1997) and my previous post:

LOD = Likelihood (effect occurs by QTL linkage) / Likelihood(effect occurs by chance)
The calc.genoprob, scanone, and sim.geno functions allow you to make these calculations with options for expectation maximization, Haley-Knott regression, or multiple imputation. 'summary' and 'max' functions allow you to identify the locations on the chromosomes where the LOD score is maximized and infer the potential location of a QTL. The plot function allows you to plot this data for specified chromosomes.




R Code: (note this code scrolls from left to right- click on the code and use the arrow keys or use the scroll bar at the bottom of the post )
# ------------------------------------------------------------------
# |PROGRAM NAME: EX_QTL_R
# |DATE: 8-13-11
# |CREATED BY: MATT BOGARD 
# |PROJECT FILE: 
# |----------------------------------------------------------------
# | PURPOSE: DEMONSTRATE STATISTICAL METHODS FOR QTL ANALYSIS IN R 
# | 
# |------------------------------------------------------------------
# | REFERENCE: R code for "A brief tour of R/qtl"
# | Karl W Broman, kbroman.wisc.edu http://www.rqtl.org
# |-----------------------------------------------------------------
 
 
library (qtl ) # call package qtl
 
ls () 
 
############################################################
# exploring backcross data
############################################################
 
 
data (fake.bc) # load simulated backcross data (default data in qtl package)
ls ()
 
summary (fake.bc) # summary of info in fake.bc which is and object of class 'cross'
 
# you can also get this information with specific function calls:
nind(fake.bc) # number of individuals
nphe(fake.bc) # number of phenotypes
nchr(fake.bc) # number of chromosomes
totmar(fake.bc) # number of total markers
nmar(fake.bc) # list markers? 
 
############################################################
# plotting maps, genotypes, phenotypes, and markers
###########################################################
 
 
plot (fake.bc) # gives plots data
 
# you can also call for the plots individually:
plot.missing(fake.bc) # just plot missing genotypes
plot.map(fake.bc) # just plot the genetic map 
plot.pheno(fake.bc, pheno.col=1) # just plot phenotype 1 'phe 1' 
plot.map(fake.bc, chr=c (1, 4, 6, 7, 15), show.marker.names=TRUE) # specific chromosomes and marker names
plot.missing(fake.bc, reorder=TRUE) # order NA genotypes based on value of phenotype
 
fake.bc <- drop.nullmarkers(fake.bc) # drop obs with missing genotypes
totmar(fake.bc) # total # markers left
 
 
##################################################################
# specifying and estimating the likelihood used for QTL mapping
#################################################################
 
# From Broman:
# The function calc.genoprob calculates QTL genotype probabilities, conditional
# on the available marker data. These are needed for most of the QTL mapping 
# functions. The argument step indicates the step size (in cM) at which the
# probabilities are calculated, and determines the step size at which later 
# LOD scores are calculated.
 
 
fake.bc <- calc.genoprob(fake.bc, step=1, error.prob=0.01)
 
# function scanone performs single-QTL genome scan with a normal model. 
# methods include: maximum likelihood via the EM algorithm 
# and Haley-Knott regression 
 
out.em <- scanone(fake.bc)
out.hk <- scanone(fake.bc, method="hk")
 
# multiple imputation method using sim.geno utilizing the joint 
# genotype distribution, given the observed marker data.
 
fake.bc <- sim.geno(fake.bc, step=2, n.draws=16, error.prob=0.01)
out.imp <- scanone(fake.bc, method="imp")
 
# get the maximum LOD score on each chromosome 
# can also specify a threshold for LOD
 
summary (out.em) 
summary (out.em, threshold=3)
summary (out.hk, threshold=3)
summary (out.imp, threshold=3)
 
# function max.scanone returns just the highest peak from output of scanone.
max (out.em) # based on expectation maximization
max (out.hk) # based on Haley-Knott regression
max (out.imp) # based on multiple imputation
 
##################################################################
# plot LOD scores by chromosome for QTL mapping
#################################################################
 
 
plot (out.em, chr=c (2,5)) # just based on em method
plot (out.em, out.hk, out.imp, chr=c (2,5)) # all methods
plot (out.em, chr=c (2)) # zoom in on specified chromosome
Created by Pretty R at inside-R.org

Wednesday, July 27, 2011

Statistical Methods for QTL Analysis

See also QTL Analysis and Quantitative Genetics and QTL Analysis in R.

In a previous post a gave a general overview of QTL mapping and analysis, and gave the motivation for the use of maximum likelihood to identify the approximate location of a QTL based on RFLP (or marker) variations. Below are more details related to the statistical methods that can be used in this process.


Analysis of Variance and Marker Regression

At each marker (RFLP) loci, compare backcross phenotype distributions for groups that differ according to their marker genotypes. As depicted in Broman (2001)


For markers a and c, we see that phenotype distributions differ for genotypes aa and aa’ but not for cc and cc’. Again this indicates that a QTL may be linked to the RFLP genotypes at locus “a” but not "c".

The differences between genotypes and the associated phenotype distributions for two marker genotypes can be assessed using the t-statistic. For >2 genotypes, analysis of variance may be used. The ANOVA approach allows a flexible experimental design, allowing for the incorporation of covariates, treatment, and environmental effects.

The model for marker regression, following the notation in Hu and Zu (2009) can be specified as follows:

yi= Xiβ + Ziγ +εi

such that yi is the phenotype of the ith individual, β is the vector of control effects, Xi is design vector,γ is a vector for QTL effects, and Z is a genotype indicator vector.

Z = H1 for A1A1 , H2 for A1A2 , H3 for A2A2 or more generally

Tests on hypotheses related to QTL effects take the form H0: γ = 0 . Hence we test the null hypothesis of no QTL associated with genotype Z.

Maximum Likelihood

For maximum likelihood estimation, the following probability density of yi can be stated as:
f(yi) = Pr(y|Zi=Hk) ‘ the probability of phenotype ‘y’ given genotype H
= (1/ √ 2π σ) exp[ 1/2 σ 2 (yi-Xiβ +Hkγ )2
The log likelihood function can then be specified as L(θ) =Σ ln(f(yi))
The hypothesis H0: γ =0 can be tested using the likelihood ratio test:
λ = -2(L0-L1)
where L0 represents the likelihood under a restricted model. This is equivalent to the general notion presented in Broman (1997) and my previous post:

Likelihood (effect occurs by QTL linkage) / Likelihood(effect occurs by chance)

References:

Jones, N., H. Ougham, and H. Thomas. Markers and mapping:We are all geneticists now. New Phytol. 137:165–177.1997.

Broman KW. Lab Anim (NY). Review of statistical methods for QTL mapping in experimental crosses.
2001 Jul-Aug;30(7):44-52.

Zhiqiu Hu and Shizhong Xu (2009). PROC QTL - A SAS Procedure for Mapping Quantitative Trait Loci. International Journal of Plant Genomics 2009: 3 doi:10.1155/2009/141234.

QTL Analysis and Quantitative Genetics

See also Statistical Methods for QTL Analysis and QTL Analysis in R

Restriction Fragment Length Polymorphisms (RFLPs) and Restriction Enzymes

Restriction enzymes target specific DNA base sequences producing staggered cuts and variable length DNA fragments-‘RFLPs’.


RFLPs establish fixed landmarks in the genome. Subjecting DNA to restriction enzymes and gel-electrophoresis leads to gel patterns which can be described in terms of genotpype or allelic differences between RFLPs at a given locus, which follow rules of Mendelian inheritance. (recall a 'locus' is the location of a DNA sequence on a chromosome. A variant of a DNA sequence at a given locus is an 'allele.'

Backcrossing, Segregation, and Recombination
P1 AABB x P2 aabb --> F1 AaBb

F1 x P2 --> AAab
Abab
aBab
abab

With unlinked genes and independent segregation, recombinant genotypes Abab and aBab will result 50% of the time. With linked genes, crossing over will result in recombinants < 50% of the time.

The probability of crossing over is proportional to the distance between loci, therefore, crossover rates can be used to create genetic maps.
Loci % Crossover or Recombinants
a,b 7
b,c 19

Mapping Population
Use P1,P2,F1 to create a population with segregating QTL and RFLP profiles and phenotypic variation. RFLPs that are tightly linked to QTLs will segregate together, can serve as markers for that trait.

Central Dogma of QTL Marker Analysis

DNA: Δ mapped locus position --> Δ RFLP allele -->Δ Phenotype -->likelihood that RFLP is linked to QTL

Where a QTL is referred to as a bundle of genes associated with some quantitative trait of interest. As we move across a segment of DNA, we note changes in RFLP genotype alleles, and associated changes in the phenotype of interest. This data can then be assessed to determine the likelihood that a RFLP is linked to a QTL.

For example, suppose at locus "a" there are two observed allelic differences in the RFLP profile, denoted aa and aa’. If we notice distinct differences in the phenotypic distribution for plants with alleles aa vs. aa’ it may be inferred that the RFLP at locus “a” is linked to the QTL of interest. If at another location, we observe two allelic differences in the RFLP profile cc and cc’, and we find little difference in the associated phenotypes we may conclude that the RFLP at locus “c” is not linked to the QTL of interest. Intermediate differences between phenotypes would imply that the RFLP at that locus is closely linked to the QTL of interest. The following schematic from Jones(1997) et al illustrates these differences:

The map position of the QTL can be determined by the method of maximum likelihood from the distribution of likelihood values determined at each locus above. At each locus we compute the following:

Likelihood(effect occurs QTL linkage) / Likelihood(effect occurs by chance)

Depicted as L1/L0 below:

The key to using markers to map QTL is to associate RFLP patterns with QTLs. Observing changes in RFLP profiles among plant DNA from a segregating population at various positions in the genome, and associating them with changes in phenotype allows us to find a statistical relationship between RFLPs and QTLs. Based on the observations above, the evidence strongly supports that a QTL may be found near the RFLP locus “a”.

Reference: Jones, N., H. Ougham, and H. Thomas. 1997. Markers and mapping:We are all geneticists now. New Phytol. 137:165–177

Friday, September 17, 2010

Mathematical Themes in Economics, Machine Learning and Bioinformatics

By Matt Bogard

Abstract

Graduate students in economics are often introduced to some very useful mathematical tools that many outside the discipline may not associate with training in economics. This essay looks at some of these tools and concepts, including constrained optimization, separating hyperplanes, supporting hyperplanes, and ‘duality.’ Applications of these tools are explored including topics from machine learning and bioinformatics.

Download Full Text
Subscribe to: Comments (Atom)

AltStyle によって変換されたページ (->オリジナル) /