D3

Abstract

Hello! Welcome to our site. We are a team of students of University College London that took the module BASC2002: Quantitative Methods: Data Science and Visualisation. Our research aimed at exploring what factors were associated with poverty in London, as measured by income. We specifically looked at green spaces, physical activity of kids in schools, drug use and life expectancy. We found that green spaces was weakly negatively associated with income per borough; that life expectancy was moderately to strongly positively associated with income per borough; that there is no clear association between drug use and income per borough; and that physical activity was weakly positively associated with income per borough. These results provide evidence that supports some literature on its claim that policies that aim to alleviate low income individuals must be built at local levels. More so, this research informs policy makers on how best to tailor policies to improve the lives of low income individuals in London.

DATA

Basic information

All of our data came from the London Data Store (LDS), an official and local body for Greater London which consists of the Mayor of London, currently Sadiq Khan (Labour) and the London Assembly ("About-London Datastore," n.d.), all of whom are elected by the people. Most data is in the form of data tables in the CSV format, PDF files and Excel files, with data entries for each of the 32 London boroughs and the City of London. The individual datasets, while all provided by the London Data Store, were measured by different authorities:

Income which corresponds to the average income of taxpayers in each borough in 2014/2015 comes from the Greater London Authority: find the raw data here

Physical activity of children in schools (abbreviated to "active kids") which corresponds to the percentage of kids that have access to at least 2h of physical activity in schools in 2007/2008; this comes from the Annual Survey of School Sport Partnerships on behalf of the Department for Children, Schools and Families by TNS Social Research: find the raw data here

Green Spaces which corresponds to percentage area of green spaces per borough in 2005 which comes from the Department for Communities and Local Government: find the raw data here

Life Expectancy which corresponds to the mean life expectancy at age 65 in London boroughs in 2012/2013 which comes from the Office for National Statistics: find the raw data here

Drug use which corresponds to the number of people in contact with drug action teams comes from the National Treatment Agency for Substance Misuse (now Public Health England): find the raw data here

Is the London Data Store a reliable source?

Being published by an official, reputable and accountable source gives the impression that the data is reliable. However, we remain aware that data collected under government parties (as it is currently) means there may be a political agenda underlying it. For example, Sadiq Khan is very outspoken about his wishes to clean London’s air and so may manipulate the data to make air quality seem worse than it actually is. However we feel this would be difficult to do, as the London Assembly are elected to examine decisions and actions taken by the Mayor, who they hold publicly and democratically accountable.
Furthermore, it is possible that some of the data, such as income data, are estimates or based on surveys. As such the appropriate caution was taken via team awareness of potential issues however we believe discrepancies in data wouldn't be too problematic and would also still paint an accurate picture of incomes per borough relative to each other

Why did we choose these particular datasets?

We assume that the factors investigated are representative of wider thematics. For example, we assume that area of green space is a valid indicator of environmental quality; physical activity of children in schools is a valid indicator of both physical activity and quality of education; life expectancy and drug use are valid indicators of health. While we acknowledge that none of the wider thematics can be reduced to these indicators alone, we believe the indicators chosen reveal relevant and important information concerning these areas.

What problems were encountered regarding the data and data cleaning?

The majority of our issues arised in the data cleaning process. Due to the internationalisation on some of our laptops, converting ExCel files to CSV was difficult and we lost some data during the conversion. To work around this however, we would copy-paste the data from ExCel into Google Sheets, and export it into a CSV from there. After that it could be easily uploaded and manipulated on Python via Azure Notebooks.

Due to the City of London's status as a "principal division" (rather than Borough) of London, its data was occasionally not included in our chosen datasets. This was the case with our "Active Kids" and "Life Expectancy" datasets. As such, we had to estimate what the missing values may have been. For the former we looked at all the schools in the City of London and estimated this value. For the latter we did not estimate a value for the statistical analysis since we had no reliable way to search for this data. However, for the purposes of the chloropleth map visualisation, we took the average of all the other values to allow the visualisation to work.

Ethical considerations

Anonymity is an ethical consideration present whenever working with big data. The nature of the data and data source means anonymity should not be an issue. Because the data is available for boroughs of large populations, anonymity is well secured – there are no populations that are small enough that could allow for cross-checking with other data sets. In data analysis it is important to be aware of the potential identification of the people who provide the data. However it also poses a problems in that such anonymity forces us to make assumptions in our data analysis – it would be impossible to prove causation, only correlation.
Furthermore, another consideration when utilising big data is that many of the safeguards and considerations in traditional social research are not present in the creation of big data, particularly with regards to privacy and informed consent. Privacy is recognised as a human right under numerous declarations and treaties. In the UK, the European Convention on Human Rights, which has statutes concerning privacy rights, has been implemented through the Human Rights Act 1998, with protection of personal data provided by the Data Protection Act 1998 (Your right to respect for private and family life, n.d.; Data protection - GOV.UK, n.d.) However, authorities that collect big data are often able to skirt the protections imposed. This is especially of concern with the household income, physically active children, and drug treatment rate datasets, which concern the livelihoods of people and depend on recording information that could be considered sensitive and private by some.

Methods

This section outlines the methods we used in our research. For the reasons why we used these methods please refer to the visualisation section in which we outline what we decided to do with the data, the conclusions we could take from each of the visusalitions and why these led us to develop our work as we did.

Data and Statistics

Methods and steps of analysis:

We had 4 different variables - drug use, life expectancy, green spaces and active kids -, each of which we wanted to compare with average income in a London borough. We wanted to see if there was a correlation between the wealth of the borough – as estimated by the average income – and the variable (for example: if average income correlated with life expectancy).

We used Azure notebooks and the Python pandas package to clean the data. Click on the image below to see exactly how we did it (and then click "greenspaces.ipynb"):

We used python to find the Spearman and Pearson correlation coefficients to give us a numerical value to the correlation.

The Pearson and Spearman correlation coefficients can tell us how strong the correlation is, and also give us an indication of whether the relationship is linear.
The Pearson correlation coefficient shows the extent to which there is a linear relationship between the variables. A coefficient of 1 indicates a perfect positive correlation, -1, a perfect negative correlation and 0, no correlation. If the variables increase together, but not consistently, the coefficient will be greater than 1. (“A comparison of the Pearson and Spearman correlation methods,” n.d.)
The Pearson correlation coefficient is calculated by dividing the covariance of the 2 variables by their standard deviations multiplied together (Boslaugh, 2012).

The covariance is the unstandardized measurement for the linear relationship between two variables. The standard deviation is used to standardise this measurement to produce a value between 1 and -1 for linear relationships.
The Spearman Correlation shows how well the relationship between the variables can be described using a monotonic function. A monotonic function is a function that is either entirely non-increasing or entirely non-decreasing, for example:

In Figure 1, the graph of the relationship between X and Y never has a negative gradient.
The Spearman Correlation Coefficient can show us if there is a strong relationship between the variables, even if that relationship is non-linear. For a strong positive correlation between variables, even if the correlation is not consistent, the Spearman Correlation Coefficient will give a value close to 1.
The Spearman Correlation Coefficient is calculated in the same way as the Pearson, however the values are ranked using Spearman’s rank first. In our investigation this just meant each borough was given a number 1 to 33 (as there are 32 London boroughs plus the City of London), for income and for each variable, (e.g. Life Expectancy) and then the Correlation Coefficient was calculated from these rankings.

Evaluation of the Pearson and Spearman Correlation Coefficients

The Pearson Correlation Coefficient only produces a meaningful value to measure the strength of correlation if the variables are linearly correlated, this is why we calculated the Spearman coefficient as well.
Furthermore, the Pearson Correlation Coefficient is particularly sensitive to outliers (“Pearson Product-Moment Correlation - Guidelines to interpretation of the coefficient, detecting outliers and the type of variables needed.,” n.d.). The Spearman Coefficient is also more robust in this respect, however we will still consider this sensitivity when interpreting our results, and if appropriate, remove extreme outliers from the dataset.
The Spearman, like the Pearson Correlation Coefficient does not account for any relationship that is non-monotonic, for example, the relationship shown in Figure 2, would produce a Spearman and Pearson coefficient equal to 0.

The Spearman coefficient is more robust to extreme outliers than the Pearson coefficient because it limits the value of the outlier to its rank. However, it can’t tell us whether the relationship between the variables is linear or otherwise. Furthermore, the ranking washes out some of the correlation, as we can no longer tell how much Y increased as a result of X. We can interpret the Spearman Correlation Coefficient to be how sure we are that there is a correlation, and the Pearson Correlation Coefficient to represent how linear the relationship is.
We will address these limitations by using both methods on our data and comparing the results between them.

Weak, Medium and Strong Correlations

We will be interpreting our Pearson and Spearman Correlation Coefficients based on the assignations for weak to strong correlations from the website statstutor. (Resources for 03. Teach Yourself Worksheets > Pearsons Correlation Coefficient from statstutor,” n.d.; Resources for 03. Teach Yourself Worksheets > Spearmans Correlation Coefficient from statstutor, n.d.) According to this, a coefficient of 0 to 0.19 corresponds to a very weak association, a coefficient from 0.20 to 0.39 it corresponds to a weak association, a coefficient from 0.40 to 0.59 corresponds to a moderate association, a coefficient from 0.60 to 0.79 corresponds to a strong association and a coefficient from 0.80 to 1 corresponds to a very strong association. Positive values correspond to a positive association, negative values correspond to a negative association.

Visualisation Methods

How did we create the visualisations we use?
(click here to see the results of these methods)

Bar Charts

We created bar charts (using Python Azure Notebooks) for each variable, which allowed us to check that our values were working etc. (for example, the bar chart wouldn’t work until there were no N/A values, and all the values were stored as integers or floats.) Click the picture to explore our methods further and don't forget to click on "greenspaces.ipynb"!

Scatter Plots

We created scatter plots (also using Python, click on the images and "greenspaces.ipynb" for more information) for each variable vs. income, for example:

We also made ranked scatter plots. For this we first ranked the data - we ranked the London Boroughs in terms of income and in terms of the variable, so that we could plot them against each other for a scatter plot and regression line, for example:

The advantage of this was that if there was a non-linear relationship between the variables, we would still see a correlation in the ranked scatter plot. Whereas we might not in the original scatter plot - although this doesn’t apply to the example above where you can see a linear correlation in the original scatter plot.

Maps

Given the comparative nature of our project, we knew that typical visualization methods would not suffice. By contrasting multiple datasets, such as drugs use or life expectancy, against the income data, we were trying to research any visual correlations between the two. However, having to portray and compare variables from two different sets of data at once meant that typical choropleth maps were not a valid solution, since they only use color intensity to display one set of data.

Approach 1 – 3D Choropleth Map

Naturally, the first solution was to improve on choropleth maps by trying to visualize the additional set of data by extruding the geographical shape of the borough upwards. The datasets were then mapped in a linear manner and set within a height limit. The visualization was executed using a JavaScript library called Three.js, running the extruded 3D map in the browser and letting the user interact with the map by turning it around. In the example below, the color and height represent income, to better clarify how the visualization works, both dark red and the biggest height indicate a maximum in the used dataset.

Limitations 1

This visualisation allowed to compare several variables at once, as well as to see the extremities in the data sets rather easily, however lacked clarity, as boroughs would overlap and hinder visibility. In general, the rather complicated shapes of different boroughs proved to be one of the major issues in this approach and called for a better solution to be found. The demo can be seen at the bottom of the page.

Approach 2 - Dorling Cartogram

After having experimented in 3D, we decided to look at alternative visualisation methods, one of which happened to be a Dorling cartogram. This proved to be a great solution, as it was capable of displaying two sets of variables at once in visually clear manner and proved to be very exciting to interact with. The following cartogram visualisation was created using a JavaScript library called D3.js.

A Dorling Cartogram, being a force directed graph, enabled us to maintain geographic accuracy for easier visual clarity, which proved very important, as it helps contrast rich areas with poor ones.
In this particular visualisation, the income data is always visualised by colour across all 5 categories and the second dataset is portrayed by the changing radii of circles. The datasets were mapped in a linear manner and, again , set within a diameter limit for minimum and maximum.
Given the nature of D3.js, we are also able to interact with the visualization by hovering our mouse over it and seeing all of the values for a specific borough.

Limitations 2

However, one of the main issues is the lack of scale for radii (colours is accounted for in the bottom left corner). Given the nature of our used datasets, all of them had different units and there were limitations of implementing a dynamic scale to account for all the data. However, the data units are accounted for by the fact that we can inspect each borough by hovering our mouse on it and seeing the borough-specific values.

VISUALISATIONS

D3.js and three.js Visualisations

Categories:

Reimagining visualising income

A Dorling Cartogram

Following our analysis, we thought about how we could present the data and the association between the variables in a better way, and relate it to the physical space they are in. We therefore wondered how best to create a map that would show this. All maps put forward differing representations of various variables, meaning that some aspects will be distorted. We decided that a Dorling cartogram was the best way to present our data and its associations. These cartograms distort the physical boundaries between boroughs. However, by doing so they allow us to focus on how the variables change from borough to borough as well as how they relate to income.

Comparing Income Against Other Data

The Dorling cartogram presented here shows the various variables investigated. When one clicks on a variable, the diameters of each circle - that represent a borough - are proportional to the value of that variable in that borough. So, for example, a borough with a bigger circle for drug use will have a higher value of drug use. More so, the colour of each circle represents the level of income in that borough (see the description of the colours on the left hand side of the picture). If one puts their computer mouse on a circle, the values of all variables for that borough will also show.

Statistical Analysis

We calculated the Pearson and Spearman correlation coefficients, to get a numerical value to represent the extent of the correlation.

When it comes to the relationship between income and green spaces, it was found that the Person correlation value of this relationship was of -0.2238. The Spearman correlation value found was of -0.0788.

When it comes to the relationship between income and life expectancy, it was found that the Person correlation value of this relationship was of 0.631. The Spearman correlation value found was of 0.486.

When it comes to the relationship between income and physical activity of kids in schools, it was found that the Person correlation value of this relationship was of 0.249. The Spearman correlation value found was of 0.294.

When it comes to the relationship between income and drug use, it was found that the Person correlation value of this relationship was of 0.061. The Spearman correlation value found was of -0.037.

Discussion of the Results

Our first hypothesis is that higher levels poverty in a borough will correlate with lower levels of green spaces. The Pearson correlation value and the Spearman correlation value found correspond to a weak and very weak negative relationship, respectively. Meaning that there is some evidence that a higher income in a borough is weakly and very weakly associated with less green spaces in that borough. This does not support our hypothesis that higher levels of poverty - or equivalently lower levels of income - correspond to lower levels of green spaces.

Our second hypothesis is that higher levels of poverty in a borough will correlate with lower levels of physical activity of kids in schools. The Pearson correlation value and the Spearman correlation value found correspond both to weak positive relationship. Meaning that there is come evidence that a higher income in a borough is weakly associated with higher physical activity of kids in schools in that borough. This does support our hypothesis that higher levels of poverty - or equivalently lower levels of income - correspond to lower levels of physical activity of kids in schools.

Our third hypothesis is that higher levels of poverty in a borough will correlate with lower life expectancy. The Pearson correlation value and the Spearman correlation value found correspond to a strong and moderate positive relationship, respectively. Meaning that there is some evidence that a higher income in a borough is strongly and moderately associated with higher life expectancy in that borough. This does support our hypothesis that higher levels of poverty - or equivalently lower levels of income - correspond to lower life expectancy.

Our fourth hypothesis is that higher levels of poverty in a borough will correlate with higher levels of drug use. The Pearson correlation value and the Spearman correlation value found correspond to very weak positive and very weak negative negative relationship, respectively. Meaning that there is some evidence that a higher income in a borough is positively and negatively weakly associated with higher drug use in that borough. The lack of consistency provides inconclusive evidence for the existence of a relationship between the two variables. This does not provide evidence to support our hypothesis that higher levels of poverty - or equivalently lower levels of income - correspond to higher levels of drug use.

The Story the Data Tells

Research in the literature shows that poverty or in other words lack of income is correlated with a lot of aspects such as health problems, environmental pollution and others. However, research also shows that what relates to poverty changes from place to place. Our research has attempted to explore whether income per borough in London as a measure of poverty is associated to 4 variables that are considered to be relevant indicators of education, environment, physical exercise and health: green spaces, life expectancy, drug use, physical activity in schools.

When it comes to the relationship between income and green spaces. The application of statistical measures finds a very weak relationship between these two variables. Interesting though it finds this as a negative relationship meaning that lower income boroughts seem to weakly correlated with more green spaces. This is particularly interesting as countries like the USA are famous in research for having the opposite relationship (See literature review section). This provides further evidence for the rationale behind this study - that associations between poverty and other aspects of life need to be studied at the local level rather than generalised.

When it comes to the relationship between income and life expectancy. The application of statistical measures finds a moderate to strong positive relationship between income in each borough and life expectancy meaning that boroughts with higher income tend to have higher life expectancy. This is what we expected as it is likely that individuals with higher income will have more access to health services and healthy products throughout their life, living therefore longer in average.

When it comes to the relationship between income and drug use. The application of statistical methods suggests no clear relationship between the two variables. This is especially interesting since both this variable and life expectancy can be considered indicators of health. However, they are differently associated to income. This seems to suggest that studies of this kind should be very careful when exploring indicators of more overarching areas and should focus on specific aspects of life as much as possible.

When it comes to the relationship between income and physical activity of kids in schools. The application of statistical methods suggests a weak positive relationship between the two variables meaning that boroughts with higher levels of income seem to be weakly associated with higher levels of physical activity in children. This is what we expected as it is likely that individuals with higher income will be able to put their children in schools with more funding which are more likely to be able to provide these services.

Overall, there are some very interesting conclusions that can be taken from this research, that add to the current literature and can inform policy makers on how best to act. First, there are clear differences between relations of variables with income in London than in other places such as the USA. This adds evidence to the claim that motivated this research: that associations with income can only be properly ascertained in case-study analysis. Further, the results here suggest that even indicators of the same aspect can have different relationships to income. Overall, the relation between income and other aspects relevant to the life of individuals seems to be very complex and research should be as specific as possible and avoid generalisations.

Further studies should keep focusing on specific indicators of various aspects relevant to human life. This specific research seems to suggest that in order to improve the life of low income individuals in London, policy makers should focus on policies that lead to an increase in life expectancy as well as physical activity of kids in schools. A focus on green spaces and drug use while it might improve the life of all are not worth focusing on specifically in low income areas.

Limitations

We have assumed that our datasets are representative of certain aspects of quality of life. For example, we are assuming that drug action teams are acting as an indicator of health and social policy. However, when we make assumptions such as this, there is the possibility that they may not be actually representative of a facet of quality of life, and run opposite to other metrics within that correlation. An aforementioned example is the contact with drug action teams and life expectancy datasets, both of which are assumed to be indicators of the success of health and social policy. The correlations don’t run in alignment with one another, the former indicating an inconclusive relationship and the latter shows a strong positive relationship. This indicates the variance that can occur within a single facet of quality of life and the importance of taking multiple indicators within a facet of quality of life to ensure a comprehensive analysis.

Secondly, there are discrepancies between the age of our data sets, with the income dataset’s latest entry being the 2012/13 fiscal year, the drug action team dataset’s latest entry being the 2008/09 fiscal year, and the land use/green spaces dataset’s latest entry being 2005. This lack of recent data could negatively impact the applicability of our findings. For example if there had been a dramatic change in the wealth of London boroughs in relation to each other between 2005 and 2012/13, the relationship found would be less relevant. However this is unlikely to have happened in 7 years so it is not a major problem. Future studies should attempt either to use more recent data or collect raw data themselves.

Some datasets are accompanied by more specific issues. For example, the life expectancy data represents average life expectancy at 65. The life expectancy of people at that age is likely to largely depend on how healthy their life has been so far, which is not likely to depend on which borough they live in now. However, a high correlation is still likely, as wealthier people (who live in wealthier boroughs) are likely to have better access to good healthcare and tend to lead healthier lifestyles. This means though our data shows a high correlation between the wealth of a borough and the longevity of its population, this may not provide enough information to policymakers concerning healthcare before the age of 65. However, healthcare is increasingly important in elderly years, 65 and above, as chronic diseases become more prevalent. This may help policymakers improve healthcare for seniors aged 65 and over, which is increasingly important in an age where the population is ageing.

Another issue with the drug action team dataset is that it is presented as a percentage of the borough population that comes into contact with drug action teams, a proxy for those who seek drug treatment. However, this only covers part of the story, and doesn’t indicate quality or thoroughness of care, which are arguably more important.. It also doesn’t indicate how many people are actually drug users, which could provide insight into the relationship between income and drug use, and could be more interesting to study and understand than contact with drug action teams.

Finally, there is a limitation that applies to all projects like this, that correlation does not equal causality. The presence of a correlation between income and a given dataset does not mean that one variable directly causes another, or vice versa. While it may be tempting to assume so, there isn’t enough information provided to reliably tell us that one variable causes another, which could result in us drawing spurious conclusions if we were to assume causation. However, this does not necessarily undermine our conclusions as the correlations found can still be used to specifically target low income individuals and improve their lives in comparison to other income groups.

Despite these limitations, it is important to recognise that there are still valuable conclusions that we can draw from our project, as shown by the conclusion section. We have studied these variables closely and used a wide variation of techniques to analyse the datasets we found, reducing the potential impact of these limitations. Future, and importantly, longer and more resource-heavy research projects with the ability to collect raw data should take these limitations into account to advance the study of poverty in London and factors that are associated to it. For example, they could use several factors as indicators of health, and potentially create a composite weighted index to compare to income, potentially creating a Human Development Index for London boroughs.

Abstract

THE TEAM

Daisy

India

José

Thomas

Justas

Vision and Motivation

Aims and Hypotheses

Literature Review