All of our data came from the London Data Store (LDS), an official and local body for Greater London which
consists of the Mayor of London, currently Sadiq Khan (Labour) and the London Assembly ("About-London
Datastore," n.d.), all of whom are elected by the people. Most data is in the form of data tables in the CSV
format, PDF files and Excel files, with data entries for each of the 32 London boroughs and the City of London.
The individual datasets, while all provided by the London Data Store, were measured by different authorities:
Income which corresponds to the average income of taxpayers in each borough in 2014/2015 comes from the
Greater London Authority: find the raw data here
Physical activity of children in schools (abbreviated to "active kids") which corresponds to the percentage of
kids that have access to at least 2h of physical activity in schools in 2007/2008; this comes from the Annual
Survey of School Sport Partnerships on behalf of the Department for Children, Schools and Families by TNS Social
Research: find the raw data here
Green Spaces which corresponds to percentage area of green spaces per borough in 2005 which comes from the
Department for Communities and Local Government: find the raw data here
Life Expectancy which corresponds to the mean life expectancy at age 65 in London boroughs in 2012/2013 which
comes from the Office for National Statistics: find the raw data here
Drug use which corresponds to the number of people in contact with drug action teams comes from the National
Treatment Agency for Substance Misuse (now Public Health England): find the raw data
here
Is the London Data Store a reliable source?
Being published by an official, reputable and accountable source gives the impression that the data is
reliable. However, we remain aware that data collected under government parties (as it is currently) means there
may be a political agenda underlying it. For example, Sadiq Khan is very outspoken about his wishes to clean
London’s air and so may manipulate the data to make air quality seem worse than it actually is. However we feel
this would be difficult to do, as the London Assembly are elected to examine decisions and actions taken by the
Mayor, who they hold publicly and democratically accountable.
Furthermore, it is possible that some of the
data, such as income data, are estimates or based on surveys. As such the appropriate caution was taken via team
awareness of potential issues however we believe discrepancies in data wouldn't be too problematic and would
also still paint an accurate picture of incomes per borough relative to each other
Why did we choose these particular datasets?
We assume that the factors investigated are representative of wider thematics. For example, we assume that area
of green space is a valid indicator of environmental quality; physical activity of children in schools is a
valid indicator of both physical activity and quality of education; life expectancy and drug use are valid
indicators of health. While we acknowledge that none of the wider thematics can be reduced to these indicators
alone, we believe the indicators chosen reveal relevant and important information concerning these areas.
What problems were encountered regarding the data and data cleaning?
The majority of our issues arised in the data cleaning process. Due to the internationalisation on some of our
laptops, converting ExCel files to CSV was difficult and we lost some data during the conversion. To work around
this however, we would copy-paste the data from ExCel into Google Sheets, and export it into a CSV from there.
After that it could be easily uploaded and manipulated on Python via Azure Notebooks.
Due to the City of London's status as a "principal division" (rather than Borough) of London, its data was
occasionally not included in our chosen datasets. This was the case with our "Active Kids" and "Life Expectancy"
datasets. As such, we had to estimate what the missing values may have been. For the former we looked at all the
schools in the City of London and estimated this value. For the latter we did not estimate a value for the
statistical analysis since we had no reliable way to search for this data. However, for the purposes of the
chloropleth map visualisation, we took the average of all the other values to allow the visualisation to work.
Ethical considerations
Anonymity is an ethical consideration present whenever working with big data. The nature of the data and data
source means anonymity should not be an issue. Because the data is available for boroughs of large populations,
anonymity is well secured – there are no populations that are small enough that could allow for cross-checking
with other data sets. In data analysis it is important to be aware of the potential identification of the people
who provide the data. However it also poses a problems in that such anonymity forces us to make assumptions in
our data analysis – it would be impossible to prove causation, only correlation.
Furthermore, another consideration when utilising big data is that many of the safeguards and considerations
in traditional social research are not present in the creation of big data, particularly with regards to privacy
and informed consent. Privacy is recognised as a human right under numerous declarations and treaties. In the
UK, the European Convention on Human Rights, which has statutes concerning privacy rights, has been implemented
through the Human Rights Act 1998, with protection of personal data provided by the Data Protection Act 1998
(Your right to respect for private and family life, n.d.; Data protection - GOV.UK, n.d.) However, authorities
that collect big data are often able to skirt the protections imposed. This is especially of concern with the
household income, physically active children, and drug treatment rate datasets, which concern the livelihoods of
people and depend on recording information that could be considered sensitive and private by some.