Skip to main content

data science

AI Tool Using Single-Cell Data Has Promise for Optimally Matching Cancer Drugs to Patients

Posted on by Dr. Monica M. Bertagnolli

data flows around a central cancer cell
Credit: Donny Bliss/NIH, NicoElNino/Adobe Stock

Precision oncology, in which doctors choose cancer treatment options based on the underlying molecular or genetic signature of individual tumors, has come a long way. The Food and Drug Administration has approved a growing number of tests that look for specific genetic changes that drive cancer growth to match patients to targeted treatments. The NCI-MATCH trial, supported by the National Cancer Institute, in which participants with advanced or rare cancer had their tumors sequenced in search of genetic changes that matched them to a treatment, has also suggested benefits for guiding treatment through genetic sequencing. But there remains a need to better predict treatment responses for people with cancer.

A promising approach is to analyze a tumor’s RNA in addition to its DNA. The idea is to not only better understand underlying genetic changes, but also learn how those changes impact gene activity as measured by RNA sequencing data. A recent study introduces an artificial intelligence (AI)-driven tool, dubbed PERCEPTION (PERsonalized single-Cell Expression-based Planning for Treatments In ONcology), developed by an NIH-led team to do just this.1 This proof-of-concept study, published in Nature Cancer, shows that it’s possible to fine-tune predictions of a patient’s treatment responses from bulk RNA data by zeroing in on what’s happening inside single cells.

One of the challenges in relying on bulk data from tumor samples is they typically include mixtures of like cells known as clones. Because different clones may respond differently to specific drugs, averaging what’s happening in cells across a particular patient’s tumor may not provide a clear picture of how that cancer will respond to a drug. Being able to capture gene activity patterns all the way down to the single-cell level might be a better way to target clones with specific alterations and therefore see better drug responses, but so far, single-cell gene expression data haven’t been widely available.

To explore the potential of single-cell RNA data, a team led by Eytan Ruppin and Alejandro Schäffer at NCI’s Cancer Data Science Laboratory at the NIH Clinical Center in Bethesda, MD, and Sanju Sinha, now at Sanford Burnham Prebys in San Diego, used a technique called transfer learning to train an AI model to predict drug responses. They first used existing bulk RNA sequencing data and then fine-tuned those models using single-cell RNA sequencing data from cell lines and large-scale drug screens. All told, they built AI models for 44 drugs approved by the FDA.

They found that PERCEPTION predicts the success of targeted treatments against cell lines with an accuracy reflected by an AUC score of about 0.8. AUC measures how well a model can distinguish between drug-sensitive and drug-resistant cell lines, with 0.5 being no better than a random guess and 1.0 being perfect accuracy. While there’s room for improvement, the findings show that PERCEPTION works better than earlier methods. The results also extended to single drugs and combination treatments in cultured cells and in cells isolated from patient tumors.

But would the tool accurately predict responses to treatments for patients? To find out, the researchers used their models to predict treatment responses based on clinical trial data for 41 patients with multiple myeloma treated with a combination of four drugs and 33 patients with breast cancer treated with a combination of two drugs. Their findings showed that the model could successfully predict treatment responses in patients, again with an AUC score of about 0.8. 

Interestingly, their research shows that having just one clone in a tumor that is resistant to a particular drug is enough to thwart a response to that drug. As a result, the clone with the worst response in a tumor will best explain a person’s overall treatment response. Further study revealed that the model could also predict the development of resistance to treatment in published data from 24 people treated with targeted therapies for non-small cell lung cancer.

The researchers note that the accuracy of their technique will only improve as single-cell RNA sequencing data becomes more widely available for more patients with additional cancer types. To aid in this endeavor, they’ve developed a research website and guide to enable other researchers to use PERCEPTION to build AI models that predict treatment responses. Their hope is, as these findings suggest, that single-cell RNA sequencing data could one day help doctors more precisely match patients to their optimal cancer treatments.

Reference:

[1] Sinha S, et al. PERCEPTION predicts patient response and resistance to treatment using single-cell transcriptomics of their tumors. Nature Cancer. DOI 10.1038/s43018-024-00756-7 (2024).

NIH Support: National Cancer Institute


National Library of Medicine Helps Lead the Way in AI Research

Posted on by Patricia Flatley Brennan, R.N., Ph.D., National Library of Medicine

NIH, National Library of Medicine. The earth surrounded by a ring of data
Credit: National Library of Medicine, NIH

Did you know that the NIH’s National Library of Medicine (NLM) has been serving science and society since 1836? From its humble beginning as a small collection of books in the library of the U.S. Army Surgeon General’s office, NLM has grown not only to become the world’s largest biomedical library, but a leader in biomedical informatics and computational health data science research.

Think of NLM as a door through which you pass to connect with health data, literature, medical and scientific information, expertise, and sophisticated mathematical models or images that describe a clinical problem. This intersection of information, people, and technology allows NLM to foster discovery. NLM does so by ensuring that scientists, clinicians, librarians, patients, and the public have access to biomedical information 24 hours a day, 7 days a week.

The NLM also supports two research efforts: the Division of Extramural Programs (EP) and Intramural Research Program (IRP). Both programs are accelerating advances in biomedical informatics, data science, computational biology, and computational health. One of EP’s notable investments is focused on advancing artificial intelligence (AI) methods and reimagining how health care is delivered with the power of AI.

How to teach machines, showing for different piles of pills.
Credit: National Library of Medicine, NIH

With support from NLM, Corey Lester and his colleagues at the University of Michigan College of Pharmacy, Ann Arbor, MI, are using AI to assist in pill verification, a standard procedure in pharmacies across the land. They want to help pharmacists avoid dangerous and costly dispensing errors. To do so, Lester is using AI to develop a real-time computer vision model. It views pills inside of a medication bottle, accurately identifies them, and determines that they are the correct or incorrect contents.

The IRP develops and applies computational methods and approaches to a broad range of information problems in biology, biomedicine, and human health. The IRP also offers intramural training opportunities and supports other training aimed at pre-baccalaureate to postdoctoral students and professionals.

The NLM principal investigators use biological data to advance computer algorithms and connect relationships between any level of biological organization and health conditions. They also use computational health sciences to focus on clinical information processing and analyze clinical data, assess clinical outcomes, and set health data standards.

Four chest x-rays
Credit: National Library of Medicine, NIH

NLM investigator Sameer Antani is collaborating with researchers in other NIH institutes to explore how AI can help us understand oral cancer, echocardiography, and pediatric tuberculosis. His research also is examining how images can be mined for data to predict the causes and outcomes of conditions. Examples of Antani’s work can be found in mobile radiology vehicles, which allow professionals to take chest X-rays (right) and screen for HIV and tuberculosis using software containing algorithms developed in his lab.

For AI to have its full impact, more algorithms and approaches that harness the power of data are needed. That’s why NLM supports hundreds of other intramural and extramural scientists who are addressing challenging health and biomedical problems. The NLM-funded research is focused on how AI can help people stay healthy through early disease detection, disease management, and clinical and treatment decision-making—all leading to the ultimate goal of helping people live healthier and happier lives.

The NLM is proud to lead the way in the use of AI to accelerate discovery and transform health care. Want to learn more? Follow me on Twitter. Or, you can follow my blog, NLM Musings from the Mezzanine and receive periodic NLM research updates.

I would like to thank Valerie Florance, Acting Scientific Director of NLM IRP, and Richard Palmer, Acting Director of NLM Division of EP, for their assistance with this post.

Links:

National Library of Medicine (National Library of Medicine/NIH)

Video: Using Machine Intelligence to Prevent Medication Dispensing Errors (NLM Funding Spotlight)

Video: Sameer Antani and Artificial Intelligence (NLM)

NLM Division of Extramural Programs (NLM)

NLM Intramural Research Program (NLM)

NLM Intramural Training Opportunities (NLM)

Principal Investigators (NLM)

NLM Musings from the Mezzanine (NLM)

Note: Dr. Lawrence Tabak, who performs the duties of the NIH Director, has asked the heads of NIH’s Institutes and Centers (ICs) to contribute occasional guest posts to the blog to highlight some of the interesting science that they support and conduct. This is the 20th in the series of NIH IC guest posts that will run until a new permanent NIH director is in place.


Understanding Long-Term COVID-19 Symptoms and Enhancing Recovery

Posted on by Walter J. Koroshetz, M.D., National Institute of Neurological Disorders and Stroke

RECOVER: Researching COVID to Enhance Recovery. An Initiative Funded by the National Institutes of Health

We are in the third year of the COVID-19 pandemic, and across the world, most restrictions have lifted, and society is trying to get back to “normal.” But for many people—potentially millions globally—there is no getting back to normal just yet.

They are still living with the long-term effects of a COVID-19 infection, known as the post-acute sequelae of SARS-CoV-2 infection (PASC), including Long COVID. These people continue to experience debilitating fatigue, shortness of breath, pain, difficulty sleeping, racing heart rate, exercise intolerance, gastrointestinal and other symptoms, as well as cognitive problems that make it difficult to perform at work or school.

This is a public health issue that is in desperate need of answers. Research is essential to address the many puzzling aspects of Long COVID and guide us to effective responses that protect the nation’s long-term health.

For the past two years, NIH’s National Heart, Lung, and Blood Institute (NHLBI), the National Institute of Allergy and Infectious Diseases (NIAID), and my National Institute of Neurological Disorders and Stroke (NINDS) along with several other NIH institutes and the office of the NIH Director, have been leading NIH’s Researching COVID to Enhance Recovery (RECOVER) initiative, a national research program to understand PASC.

The initiative studies core questions such as why COVID-19 infections can have lingering effects, why new symptoms may develop, and what is the impact of SARS-CoV-2, the virus that causes COVID-19, on other diseases and conditions? Answering these fundamental questions will help to determine the underlying biologic basis of Long COVID. The answers will also help to tell us who is at risk for Long COVID and identify therapies to prevent or treat the condition.

The RECOVER initiative’s wide scope of research is also unprecedented. It is needed because Long COVID is so complex, and history indicates that similar post infectious conditions have defied definitive explanation or effective treatment. Indeed, those experiencing Long COVID report varying symptoms, making it highly unlikely that a single therapy will work for everyone, underscoring the need to pursue multiple therapeutic strategies.

To understand Long COVID fully, hundreds of RECOVER investigators are recruiting more than 17,000 adults (including pregnant people) and more than 18,000 children to take part in cohort studies. Hundreds of enrolling sites have been set up across the country. An autopsy research cohort will also provide further insight into how COVID-19 affects the body’s organs and tissues.

In addition, researchers will analyze electronic health records from millions of people to understand how Long COVID and its symptoms change over time. The RECOVER initiative is also utilizing consistent research protocols across all the study sites. The protocols have been carefully developed with input from patients and advocates, and they are designed to allow for consistent data collection, improve data sharing, and help to accelerate the pace of research.

From the very beginning, people suffering from Long COVID have been our partners in RECOVER. Patients and advocates have contributed important perspectives and provided valuable input into the master protocols and research plans.

Now, with RECOVER underway, individuals with Long COVID, their caregivers, and community members continue to serve a critical role in the Initiative. The National Community Engagement Group (NCEG) has been established to make certain that RECOVER meets the needs of all people affected by Long COVID. The RECOVER Patient and Community Engagement Strategy outlines all the approaches that RECOVER is using to engage with and gather input from individuals impacted by Long COVID.

The NIH recently made more than 40 awards to improve understanding of the underlying biology and pathology of Long COVID. There have already been several important findings published by RECOVER scientists.

For example, in a recent study published in the journal Lancet Digital Health, RECOVER investigators used machine learning to comb through electronic health records to look for signals that may predict whether someone has Long COVID [1]. As new findings, tools, and technologies continue to emerge that help advance our knowledge of the condition, the RECOVER Research Review (R3) Seminar Series will provide a forum for researchers and our partners with up-to-date information about Long COVID research.

It is important to note that post-viral conditions are not a new concept. Many, but not all, of the symptoms reported in Long COVID, including fatigue, post-exertional malaise, chronic musculoskeletal pain, sleep disorders, postural orthostatic tachycardia (POTS), and cognitive issues, overlap with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS).

ME/CFS is a serious disease that can occur following infection and make people profoundly sick for decades. Like Long COVID, ME/CFS is a heterogenous condition that does not affect everybody in the same way, and the knowledge gained through research on Long COVID may also positively impact the understanding, treatment, and prevention of POTS, ME/CFS, and other chronic diseases.

Unlike other post-viral conditions, people who experience Long COVID were all infected by the same virus—albeit different variants—at a similar point in time. This creates a unique opportunity for RECOVER researchers to study post-viral conditions in real-time.

The opportunity enables scientists to study many people simultaneously while they are still infected to monitor their progress and recovery, and to try to understand why some individuals develop ongoing symptoms. A better understanding of the transition from acute to chronic disease may offer an opportunity to intervene, identify who is at risk of the transition, and develop therapies for people who experience symptoms long after the acute infection has resolved.

The RECOVER initiative will soon announce clinical trials, leveraging data from clinicians and patients in which symptom clusters were identified and can be targeted by various interventions. These trials will investigate therapies that are indicated for other non-COVID conditions and novel treatments for Long COVID.

Through extensive collaboration across the multiple NIH institutes and offices that contribute to the RECOVER effort, our hope is critical answers will emerge soon. These answers will help us to recognize the full range of outcomes and needs resulting from PASC and, most important, enable many people to make a full recovery from COVID-19. We are indebted to the over 10,000 subjects who have already enrolled in RECOVER. Their contributions and the hard work of the RECOVER investigators offer hope for the future to the millions still suffering from the pandemic.

Reference:

[1] Identifying who has long COVID in the USA: a machine learning approach using N3C data. Pfaff ER, Girvin AT, Bennett TD, Bhatia A, Brooks IM, Deer RR, Dekermanjian JP, Jolley SE, Kahn MG, Kostka K, McMurry JA, Moffitt R, Walden A, Chute CG, Haendel MA; N3C Consortium. Lancet Digit Health. 2022 Jul;4(7):e532-e541.

Links:

COVID-19 Research (NIH)

Long COVID (NIH)

RECOVER: Researching COVID to Enhance Recovery (NIH)

NIH builds large nationwide study population of tens of thousands to support research on long-term effects of COVID-19,” NIH News Release, September 15, 2021.

Director’s Messages (National Institute of Neurological Disorders and Stroke/NIH)

Note: Dr. Lawrence Tabak, who performs the duties of the NIH Director, has asked the heads of NIH’s Institutes and Centers (ICs) to contribute occasional guest posts to the blog to highlight some of the interesting science that they support and conduct. This is the 18th in the series of NIH IC guest posts that will run until a new permanent NIH director is in place.


Genome Data Help to Track COVID-19 Superspreading Event

Posted on by Dr. Francis Collins

Boston skyline
Credit: iStock/Chaay_Tee

When it comes to COVID-19, anyone, even without symptoms, can be a “superspreader” capable of unknowingly infecting a large number of people and causing a community outbreak. That’s why it is so important right now to wear masks when out in public and avoid large gatherings, especially those held indoors, where a superspreader can readily infect others with SARS-CoV-2, the virus responsible for COVID-19.

Driving home this point is a new NIH-funded study on the effects of just one superspreader event in the Boston area: an international biotech conference held in February, before the public health risks of COVID-19 had been fully realized [1]. Almost a hundred people were infected. But it didn’t end there.

In the study, the researchers sequenced close to 800 viral genomes, including cases from across the first wave of the epidemic in the Boston area. Using the fact that the viral genome changes in very subtle ways over time, they found that SARS-CoV-2 was actually introduced independently to the region more than 80 times, primarily from Europe and other parts of the United States. But the data also suggest that a single superspreading event at the biotech conference led to the infection of almost 20,000 people in the area, not to mention additional COVID-19 cases in other states and around the world.

The findings, posted on medRxiv as a pre-print, come from Bronwyn MacInnis and Pardis Sabeti at the Broad Institute of MIT and Harvard in Cambridge, MA, and their many close colleagues at Massachusetts General Hospital, the Massachusetts Department of Public Health, and the Boston Health Care for the Homeless Program. The initial focus of MacInnis, Sabeti, and their Broad colleagues has been on developing genome data and tools for surveillance of viruses and other infectious diseases in and viral outbreaks in West Africa, including Lassa fever and Ebola virus disease.

Closer to home, they’d expected to focus their attention on West Nile virus and tick-borne diseases. But, when the COVID-19 outbreak erupted, they were ready to pivot quickly to assist several Centers for Disease Control and Prevention (CDC) and state labs in the northeastern United States to use genomic tools to investigate local outbreaks.

It’s been clear from the beginning of the pandemic that COVID-19 cases often arise in clusters, linked to gatherings in places such as cruise ships, nursing homes, and homeless shelters. But the Broad Institute team and their colleagues realized, it’s difficult to see how extensively a virus spreads from such places into the wider community based on case counts alone.

Contact tracing certainly helps to track community spread of the virus. This surveillance strategy depends on quick, efficient identification of an infected individual. It follows up with the identification of all who’ve recently been in close contact with that person, allowing the contacts to self-quarantine and break the chain of transmission.

But contact tracing has its limitations. It’s not always possible to identify all the people that an infected person has been in recent contact with. Genome data, however, is particularly useful after the fact for connecting those dots to get a big picture view of viral transmission.

Here’s how it works: as SARS-CoV-2 spreads, the virus sometimes picks up a new mutation. Those tiny spelling changes in the viral genome usually have no effect on how the virus causes disease, but they do serve as distinct genomic fingerprints. Using those fingerprints to guide the way, researchers can trace the path the virus took through a community and beyond, identifying connections among cases that would be untrackable otherwise.

With this in mind, MacInnis and Sabeti’s team set out to help Boston’s public health officials understand just how the epidemic escalated so quickly in the Boston area, and just how much the February conference had contributed to community transmission of the virus. They also investigated other case clusters in the area, including within a skilled nursing facility, homeless shelters, and at Massachusetts General Hospital itself, to understand the spread of COVID-19 in these settings.

Based on contact tracing, officials had already connected approximately 90 cases of COVID-19 to the biotech conference, 28 of which were included in the original 772 viral genomes in this dataset. Based on the distinct genomic fingerprint carried by the 28 genomes, the researchers went on to discover that more than one-third of Boston area cases without any known link to the conference could indeed be traced back to the event.

When the researchers considered this proportion to the number of cases recorded in the region during the study, they extrapolated that the superspreader event led to nearly 20,000 cases in the Boston area. In contrast, the genome data show cases linked to another superspreader event that took place within a skilled nursing facility, while devastating to the residents, had much less of an impact on the surrounding community.

The analysis also uncovered some unexpected connections. The dataset showed that SARS-CoV-2 was brought to clients and staff at the Boston Health Care for the Homeless Program at least seven times. Remarkably, two of those introductions also traced back to the biotech conference. Researchers also found infections in Chelsea, Revere, and Everett, which were some of the hardest hit communities in the Boston area, that were connected to the original superspreading event.

There was some reassuring news about how precautions in hospitals are working. The researchers examined cases that were diagnosed among patients at Massachusetts General Hospital, raising concerns that the virus might have spread from one patient to another within the hospital. But the genome data show that those cases, in fact, weren’t part of the same transmission chain. They may have contracted the virus before they were hospitalized. Or it’s possible that staff may have inadvertently brought the virus into the hospital. But there was no patient-to-patient transmission.

Massachusetts is one of the states in which the COVID-19 pandemic had a particularly severe early impact. As such, these results present broadly applicable lessons for other states and urban areas about how the virus spreads. The findings highlight the value of genomic surveillance, along with standard contact tracing, for better understanding of viral transmission in our communities and improved prevention of future outbreaks.

Reference:

[1] Phylogenetic analysis of SARS-CoV-2 in the Boston area highlights the role of recurrent importation and superspreading events. Lemieux J. et al. medRxiv. August 25, 2020.

Links:

Coronavirus (COVID-19) (NIH)

Bronwyn MacInnis (Broad Institute of Harvard and MIT, Cambridge, MA)

Sabeti Lab (Broad Institute of Harvard and MIT)

NIH Support: National Institute of Allergy and Infectious Diseases; National Human Genome Research Institute; National Institute of General Medical Sciences


Genome Data Help Track Community Spread of COVID-19

Posted on by Dr. Francis Collins

RNA Virus
Credit: iStock/vchal

Contact tracing, a term that’s been in the news lately, is a crucial tool for controlling the spread of SARS-CoV-2, the novel coronavirus that causes COVID-19. It depends on quick, efficient identification of an infected individual, followed by identification of all who’ve recently been in close contact with that person so the contacts can self-quarantine to break the chain of transmission.

Properly carried out, contact tracing can be extremely effective. It can also be extremely challenging when battling a stealth virus like SARS-CoV-2, especially when the virus is spreading rapidly.

But there are some innovative ways to enhance contact tracing. In a new study, published in the journal Nature Medicine, researchers in Australia demonstrate one of them: assembling genomic data about the virus to assist contact tracing efforts. This so-called genomic surveillance builds on the idea that when the virus is passed from person to person over a few months, it can acquire random variations in the sequence of its genetic material. These unique variations serve as distinctive genomic “fingerprints.”

When COVID-19 starts circulating in a community, researchers can fingerprint the genomes of SARS-CoV-2 obtained from newly infected people. This timely information helps to tell whether that particular virus has been spreading locally for a while or has just arrived from another part of the world. It can also show where the viral subtype has been spreading through a community or, best of all, when it has stopped circulating.

The recent study was led by Vitali Sintchenko at the University of Sydney. His team worked in parallel with contact tracers at the Ministry of Health in New South Wales (NSW), Australia’s most populous state, to contain the initial SARS-CoV-2 outbreak from late January through March 2020.

The team performed genomic surveillance, using sequencing data obtained within about five days, to understand local transmission patterns. They also wanted to compare what they learned from genomic surveillance to predictions made by a sophisticated computer model of how the virus might spread amongst Australia’s approximately 24 million citizens.

Of the 1,617 known cases in Sydney over the three-month study period, researchers sequenced viral genomes from 209 (13 percent) of them. By comparing those sequences to others circulating overseas, they found a lot of sequence diversity, indicating that the novel coronavirus had been introduced to Sydney many times from many places all over the world.

They then used the sequencing data to better understand how the virus was spreading through the local community. Their analysis found that the 209 cases under study included 27 distinct genomic fingerprints. Based on the close similarity of their genomic fingerprints, a significant share of the COVID-19 cases appeared to have stemmed from the direct spread of the virus among people in specific places or facilities.

What was most striking was that the genomic evidence helped to provide information that contact tracers otherwise would have lacked. For instance, the genomic data allowed the researchers to identify previously unsuspected links between certain cases of COVID-19. It also helped to confirm other links that were otherwise unclear.

All told, researchers used the genomic evidence to cluster almost 40 percent of COVID-19 cases (81 of 209) for which the community-based data alone couldn’t identify a known contact source for the infection. That included 26 cases in which an individual who’d recently arrived in Australia from overseas spread the infection to others who hadn’t traveled. The genomic information also helped to identify likely sources in the community for another 15 locally acquired cases that weren’t known based on community data.

The researchers compared their genome surveillance data to SARS-CoV-2’s expected spread as modeled in a computer simulation based on travel to and from Australia over the time period in question. Because the study involved just 13 percent of all known COVID-19 cases in Sydney between late January through March, it’s not surprising that the genomic data presents an incomplete picture, detecting only a portion of the possible chains of transmission expected in the simulation model.

Nevertheless, the findings demonstrate the value of genomic data for tracking the virus and pinpointing exactly where in the community it is spreading. This can help to fill in important gaps in the community-based data that contact tracers often use. Even more exciting, by combining traditional contact tracing, genomic surveillance, and mathematical modeling with other emerging tools at our disposal, it may be possible to get a clearer picture of the movement of SARS-CoV-2 and put more targeted public health measures in place to slow and eventually stop its deadly spread.

Reference:

[1] Revealing COVID-19 transmission in Australia by SARS-CoV-2 genome sequencing and agent-based modeling. Rockett RJ, Arnott A, Lam C, et al. Nat Med. 2020 July 9. [Published online ahead of print]

Links:

Coronavirus (COVID-19) (NIH)

Vitali Sintchenko (University of Sydney, Australia)


Crowdsourcing 600 Years of Human History

Posted on by Dr. Francis Collins

Family Tree

Caption: A 6,000-person family tree, showing individuals spanning seven generations (green) and their marital links (red).
Credit: Columbia University, New York City

You may have worked on constructing your family tree, perhaps listing your ancestry back to your great-grandparents. Or with so many public records now available online, you may have even uncovered enough information to discover some unexpected long-lost relatives. Or maybe you’ve even submitted a DNA sample to one of the commercial sources to see what you could learn about your ancestry. But just how big can a family tree grow using today’s genealogical tools?

A recent paper offers a truly eye-opening answer. With permission to download the publicly available, online profiles of 86 million genealogy hobbyists, most of European descent, the researchers assembled more than 5 million family trees. The largest totaled more than 13 million people! By merging each tree from the crowd-sourced and public data, including the relatively modest 6,000-person seedling shown above, the researchers were able to go back 11 generations on average to the 15th century and the days of Christopher Columbus. Doubly exciting, these large datasets offer a powerful new resource to study human health, having already provided some novel insights into our family structures, genes, and longevity.