Analytics at Wharton

Research Spotlight

Wharton Professor’s SCEPTRE Tool Helps Link Genetics to Disease Risk

It’s not often the case that statisticians have an advanced understanding of genomics, but it’s exactly this combination of expertise that has Eugene Katsevich, assistant professor of statistics and data science at the Wharton School, making waves with his team’s new software package, SCEPTRE.

At its core, SCEPTRE is a free, easy-to-use statistical tool that biologists can utilize to uncover correlations between certain genetic variants and their impact on disease risk. Understanding these correlations can help scientists accelerate the speed, and improve the effectiveness, with which they are able to treat a number of diseases. To fully understand SCEPTRE’s place in the field of biology, though, it’s important to first lay some genetic groundwork.

Eugene Katsevich, Assistant Professor of Statistics and Data Science

SCEPTRE’s main topic of interest is the human genome, a string of 3.2 billion genetic letters (A, C, G, and T), 99% of which is shared by every human. The genes in our body, which can be viewed as instructional manuals for key biological molecules called proteins, are simply segments of this long string.

The remaining 1% of the genetic letters in the human genome are different from person to person and are known in the field as genetic variants. As Katsevich explains, genetic variants are the difference between “me having brown hair and someone else having blonde hair,” but the impact they have on our bodies extends well beyond aesthetics.

“If, at a given genetic variant, instead of having an A you have a T, this might lead to one or more nearby genes having abnormal levels of activity.” It’s this alteration in gene activity that can lead to increased, or in some scenarios, decreased, disease risk.

“At its core, SCEPTRE is a free, easy-to-use statistical tool that biologists can utilize to uncover correlations between certain genetic variants and their impact on disease risk.”

Ever since 2003, when scientists successfully mapped out the human genome, the genetics community has conducted what are called Genome-Wide Association Studies (GWAS) to better understand the effect of certain genetic variants on disease risk. Scientists can recruit a group of subjects who are known to have a given disease and compare their genetic variants to those of healthy control subjects. The differences they find can help deduce which genetic variants are most strongly associated with the disease. According to Katsevich, the scientific community now has robust statistical associations between hundreds of thousands of genetic variants and hundreds of diseases.

The picture this paints, however, isn’t quite complete. “Genome-wide association studies kind of link the beginning to the end,” says Katsevich. “What’s important is what happens in the middle ­– the genes.” For every association GWAS have made between a genetic variant and risk for a given disease, there is a gene in the middle whose abnormal activity ­– driven by the genetic variant ­– is the cause of that disease risk. If scientists can identify the gene, they can develop pharmaceuticals to target that gene specifically, increasing the effectiveness of medication.

As it stands, one of the most promising ways to go about linking genetic variants to the genes they affect is by using a genetics tool called CRISPR. CRISPR lets scientists introduce certain genetic variants directly into a population of cells in a lab and record what happens. “If you take this idea and scale it up to thousands of different genetic variants and tens of thousands of genes, in potentially hundreds of thousands of cells, you get what’s called a single-cell CRISPR screen,” says Katsevich.

This is where his software package, SCEPTRE, comes into play. Single-cell CRISPR screens are massive, noisy data sets, and the process of untangling the noise to find true genetic correlations, as opposed to false positives, is a challenging statistical and computational undertaking for the biologists who generate these data.

Now, anyone in the world can download SCEPTRE onto their computer for free, load their data into R, a programming language, and sit back as the tool works its magic to untangle the messy CRISPR screen data. “A tool like SCEPTRE didn’t exist before because single-cell CRISPR screens are a fairly new biotechnology,” says Katsevich, who conceived of SCEPTRE as a post-doctoral student at Carnegie Melon University. “It’s quite complicated and not a lot of statisticians are even aware that this technology exists. I just happened to be at this intersection between statistics and genomics where I could identify this need.”

As SCEPTRE crunches the data, it provides the dual benefit of reducing the number of false positive correlations that the noisy CRISPR screen data might suggest, while simultaneously increasing the number of true positive associations. Knowing which genetic associations to pursue further is of paramount interest to biologists, as they don’t want to spend precious time and money chasing false leads.

The home page of SCEPTRE's website.

At his side throughout the SCEPTRE journey and crucial to its success is Tim Barry, a postdoctoral researcher in the statistics department at the University of Pennsylvania. Barry serves as the lead developer of SCEPTRE and was academically co-advised by Katsevich and Kathyrn Roeder while he studied at Carnegie Melon University. Roeder, the UPMC Professor of Statistics and Life Sciences at CMU, was Katsevich’s academic co-advisor in his time there.

Tim Barry

With funding from Analytics at Wharton, Katsevich was able to sponsor Barry’s summer-long visit to Penn’s campus. The opportunity to work side-by-side during that time led to what Katsevich describes as a “quantum leap forward” for SCEPTRE’s functionality.

Their efforts have clearly paid off. SCEPTRE’s positive reputation quickly spread around the genomics community, and specifically caught the eye of John Morris, Neville Sanjana, and Tuuli Lappalainen at the New York Genome Center.

Their team’s goal was to study the relationships between genetic variants and a variety of blood traits such as red blood cell count, white blood cell counts, platelets, and more, and needed SCEPTRE’s help. They had taken about 500 genetic variants that previous GWAS indicated were of interest and compiled a CRISPR screen, but they had been struggling to parse the data.

With the help of SCEPTRE, the team was able to identify a total of 91 variants with statistically strong connections to 124 genes, and they found minimal false positive correlations. These findings were significant in the community, so much so that the influential Science magazine ran a story highlighting their work and its findings.

As far as Katsevich is concerned, this type of progress is just the beginning for SCEPTRE.

“CRISPR screen datasets are getting bigger and bigger every year, and they pose more and more of a computational challenge. Therefore, it’s very important to move beyond this paradigm where we’re just computing everything on our laptop. We need to be able to deploy SCEPTRE on computer clusters and clouds. We plan to develop the infrastructure to make it easy and accessible for biologists to do that with their data, and their computer clusters.”

With the requisite support, SCEPTRE could mark the dawn of a new era for biologists, geneticists, and pharmaceutical companies. “We are going to be able to get insights into human diseases and how we might diagnose and treat them of a kind we have never seen before.”

To keep up with SCEPTRE’s latest developments, follow Katsevich on Twitter, and his lab website.