Milestone 2 Process Book

Main Page Milestone 1
Basic Info

Overview and Motivation

Cancer, a diverse class of diseases, is typically characterized by uncontrollable cell growth. This simplification is often misleading, as in reality the disease is extremely complex and is responsible for nearly a sixth of deaths worldwide (Bray). Roughly 1 in 2 men and 1 in 3 women will themselves develop cancer over the course of their lives, and it goes without saying that nearly everyone in the world will be touched by cancer in some capacity throughout their lifetime.

Despite its massive prevalence and enormous research effort, the age adjusted mortality* rate for cancer patients has remained stagnant since 1930 (Azra Raza). This means that in over 90 years, the life expectancy of cancer patients has remained largely unchanged. Despite frequent news of “breakthroughs in treatment” and “game changing discoveries”, the cancer treatment complex composed of clinicians, researchers, and pharmaceutical companies remains largely stumped in combatting and curing cancer. Massive amounts of funding and resources are allocated towards cancer research each year but the return on investment remains unfathomably low.

*age adjusted mortality: controls for the effects of differences in population age distributions


Related Work

Azra Raza is an oncologist and professor of medicine at Columbia University. Several of her works, including her book titled The First Cell, have inspired our team to pursue this project. While extremely optimistic about the future of the war against cancer, Raza is frusterated with how little progress has been made over the years.

The notion that progress is low is surprising given the amount of resources invested into cancer research and drug development each year. Our team wanted to visualize cancer diagnosis and mortality rates to observe whether or not Raza's frustration is justified.


Questions

Questions: What questions are you trying to answer? How did these questions evolve throughout the project? What new questions did you consider in the course of your analysis?
We are aiming to answer the following questions are answered for each type of cancer:

These questions evolved throughout the project predominantly on the basis of the data we are working with. Like many large, publicly available datasets, there are usually areas of bias due to flaws in data collection and reporting. For example, despite the data having over 200 variables that describe each cancer case, there is no variable that describes the stage that the cancer was diagnosed in. As a result, having a visualization to convey this exact information is not possible.

In addition, the prevailing race of diagnoses per year for each cancer type is white. That is, for cancer cases in each year, our data shows that the large majority of these people are white. This unbalance is again likely due to an imperfect process that medical organizations report that data to the government.

Over the course of analyzing the data, we thought it would be interesting to examine the correlation between the age of diagnosis and survival time of a patient. As a result, we created a scatter plot that contains a random subset of at most 100 cases from our data of over 5 million. We believe that this plot complements the other ones well, in that it is the only one where data from specific, stand-alone cases are depicted.


Data

Acquiring and parsing data has proven to be an extremely large part of making this project successful.The Centers for Disease Control and Prevention (CDC) have set up databases that “researchers can access and analyze high-quality population-based cancer incidence data on the entire United States population. De-identified cancer incidence data are available to researchers for free in public use databases.” (CDC). The data available to the public include cancer incidence and population data for all 50 states, the District of Columbia, and Puerto Rico, providing information on more than 31 million cancer cases.

The databases include data by demographic characteristics (for example, age, sex, and race) and tumor characteristics (for example, year of diagnosis, primary tumor site, histology, behavior, and stage at diagnosis) (United States Cancer Statistics). The current data comes from the 2020 National Program of Cancer Registries (NPCR) and Surveillance, Epidemiology, and End Results (SEER) program submissions, which include cancer cases diagnosed from January 1, 2001 through December 31, 2018. The data is accessible through the SEERStat software.

Our team has been approved to access the data and download the SEERStat software required to do so. The data we have downloaded thus far consists of over 5 million individual cases of cancer, each with over 200 variables that describe each case (age, race, cancer site, treatment, etc.). Many of the variables that we will be using need some adjusting from the raw data. For example, the type of cancer that each patient had is in the form of a code. A table online gives the conversion from each code to a specific cancer type. We used R to efficiently translate each code into its corresponding cancer type.

Feeding D3 a table of over 5 rows would be a flawed approach for what we are trying to achieve. This would lead to extremely slow runtimes and would ultimately get in the way of the information we are trying to convey to the users of our visualizations. To that end, we used R to do a lot of the computationally-intensive data processing. For example, our visualization allows for users to look at the proportion of new diagnoses that occurred in different racial groups for a cancer and year of their choice. To lighten the workload of JavaScript, almost all of the computation necessary to display this data is already done. The 5 million rows of individual cancer cases were condensed to a file that contains only the information needed to create the visualization itself.

Raw data - 5M+ rows
R script (other/data_processing.R)
Processed data - 704 rows


Exploratory Data Analysis

It was important to get a sense of what our data looked like before deciding on which visualizations to pursue. The four plots below were helpful in deciding what made sense to pursue, and issues to be aware of.

Specifically, it is important to note the large number of gonad cancer cases and relatively small number of small intestine cancer cases, as well as the rising number of diagnosed cases per year. While this would suggest that cancer cases are rising overall in the United States, there are other possible explanations to consider, such as improved reporting from doctors, or improved cancer detection techniques.

The plot in the bottom left shows that white people represent the majority of reported cancer cases. This too can be interpreted to mean many different things, but it is definitely interesting to note.

Lastly, the plot in the bottom right showing survival over time is the most surprising. It would have been more reassuring to see survival rates increasing over time, but we are actually seeing just the opposite.

Overall, these initial plots reassure our team of our goal and we believe the visualizations in the proposal will be valuable and informative.

Exploratory Data Analysis: What visualizations did you use to look at your data initially? What insights did you gain? How did these insights inform your design?


Design Evolution

Design Evolution: What are the different visualizations you considered? Justify the design decisions you made using the perceptual and design principles you learned in the course. Did you deviate from your proposal?

Throughout the course of our preliminary research, it was our goal to create a centralized, informative webpage that holds a lot of information in one condensed, organized area. In order to achieve this, we placed an interactive human body at the center to indicate the main navigation hub. The human body will cause all of its surroundings to change, providing a visually-pleasing transition into the next cancer-type data set.

After the obtaining the data set as described earlier, we gained a better sense of the information we could use, leading us to want to provide a decent amount of information. The only way to achieve this is to avoid repeating/common elements to display the data; the graphs should all be unique and eye-catching in their own way. Below is the original design made from PowerPoint slides:



After turning in our project proposal, we stuck to our script closely as well as implementing them as sketched out, upon which we can clean-up certain areas later. As can be observed below with the current implementation, we have included many features introduced previously in this class including:

More specifically, a scatter plot is used to place dots on top of the human figure found in the center of the screen. The user is able to click the dots that are on top of each organ of interest. For example, a user interested in lung cancer data would press the circle above the lung as well as a year from the slide bar above, and the pie chart in the bottom left hand corner of the screen is updated to reflect the corresponding data. The current state of the project sets our team up well for implementing the remaining visualizations.

Our initial plan contained a plot that showed geographical information of where the cancer cases were coming from across the United States. Our team later realized that this plot would not be possible to create, since our data lacked any geographical information. This gave us the opportunity to choose different pieces of data that would convey useful information to the user. We decided to make a scatter plot showing the correlation between the age at diagnosis and survival time. Another unique feature of this plot is that there are no summary statistics being used to convey the data. Each dot on the plot is representative of a single patient.

Current Version of Webpage:



Implementation

Evaluation

We learned a lot about how cancer suvival rates, treatments, and demographics have changed over time. We had the data, but we didn't know these trends until putting the data into interactive visualizations. Our group has learned a lot about different cancer trends and room for improvement within the treatments of the different types of cancer.

Questions answered:

Overall, the main aspect that needs to be improved is the aesthetic of the visualization. We should work on improving the color scheme and making the visualizations appear more modern looking. User testing will help us find additional improvements that we can make.




Link to Youtube Screencast is in README.md file