A company I am working with is fond of Amazon Web Services (AWS) and the entire suite of tools therein. I thought it would be useful to set myself a task to force myself to learn some of the technologies that I am least familiar with and share what I have learned so far.

The Task

Using Price Paid data from the U.K. government website data going back to 1995, I set myself the task to look for trends in house price sales, using only AWS tools.

Limitations

As with any task using third-party data, it is essential to know and to understand the limitations of what can be analysed. In this, I made the following notes:

  1. Only "fair market" prices are included in the government files, meaning any property purchases through the 'Right to buy' scheme, gifts, under a court order, or other exclusions are not available for assessment.
  2. Although the data provides the type of property sold, it does not include the number or bedrooms, bathrooms, floor space, or other metrics to perform thorough analysis

With this in mind, generally speaking, the dataset is large and useful for statistical analysis.

Getting Started With AWS

Of course, the first thing to do is to set up an account. This is a painless process but does require adding a credit card from the outset, even if no charges accrue. It's worth noting that AWS - just like other cloud providers - offer a generous free tier to try out the systems. Once logged in, the first thing to see is the management console.

AWS Management Console

Once logged in, it's important to remember to set the geographical location in the top-right corner to an AWS region that supports every service intended for use. In my case, I originally set the region to London but soon realised AWS QuickSight was only available in Ireland (within the EMEA block), which did not have access to AWS Athena resources I had prepared. Instead, I had to recreate the Athena resources in Ireland, too. See the AWS Regional Table documentation for more details.

Uploading The Data To S3

Downloading the entire dataset from the gov.uk website yields a 4.3GB CSV file, correct at the time of writing. The first ten lines of the file look like this:

"{A42E2F04-2538-4A25-94C5-49E29C6C8FA8}","18500","1995-01-31 00:00","TQ1 1RY","F","N","L","VILLA PARADISO","FLAT 10","HIGHER WARBERRY ROAD","TORQUAY","TORQUAY","TORBAY","TORBAY","A","A"
"{1BA349E3-2579-40D6-999E-49E2A25D2284}","73450","1995-10-09 00:00","L26 7XJ","D","Y","F","6","","CATKIN ROAD","LIVERPOOL","LIVERPOOL","KNOWSLEY","MERSEYSIDE","A","A"
"{E5B50DCB-BC7A-4E54-B167-49E2A6B4148B}","59000","1995-03-31 00:00","BH12 2AE","D","N","F","28","","ALDER ROAD","POOLE","POOLE","POOLE","POOLE","A","A"
"{81E50116-D675-4B7F-9F8D-49E2B5D43271}","31000","1995-12-04 00:00","IP13 0DR","D","Y","F","NONSUCH COTTAGE","","THE STREET","HACHESTON","WOODBRIDGE","SUFFOLK COASTAL","SUFFOLK","A","A"
"{B97455B9-75CB-40BB-A615-42C53683E143}","95000","1995-09-22 00:00","WS14 0BE","D","N","F","FOX COVER COTTAGE","","HALL LANE","LICHFIELD","LICHFIELD","LICHFIELD","STAFFORDSHIRE","A","A"
"{F0D1E8DA-C00D-467A-A41C-42C5378DB6E0}","45450","1995-02-28 00:00","S42 5GA","S","Y","F","109","","ELVASTON ROAD","NORTH WINGFIELD","CHESTERFIELD","NORTH EAST DERBYSHIRE","DERBYSHIRE","A","A"
"{7DAC48DA-D479-4922-86B0-42C5580DFC67}","96000","1995-10-27 00:00","KT17 2DU","S","N","F","82","","KINGSTON ROAD","EPSOM","EPSOM","EPSOM AND EWELL","SURREY","A","A"
"{10E5F080-7AF3-4982-AAEF-42C55DC955FC}","30000","1995-11-28 00:00","WS10 9LD","S","N","F","66","","HILL STREET","WEDNESBURY","WEDNESBURY","WALSALL","WEST MIDLANDS","A","A"
"{B365B080-3670-4955-80F8-42C55F081143}","425000","1995-03-31 00:00","KT11 1HP","D","N","F","18","","BRUNSWICK GROVE","COBHAM","COBHAM","ELMBRIDGE","SURREY","A","A"
"{7648EFE5-67A0-4728-B5DE-42C565DA1D28}","89995","1995-06-30 00:00","WF6 2TT","D","Y","F","8","","SHAW AVENUE","NORMANTON","NORMANTON","WAKEFIELD","WEST YORKSHIRE","A","A"

It's worth making some notes about the data:

  • There is no header record naming each column. Instead, the column definitions are defined online.
  • Double-quotes surround each field value.
  • The date field is in an irregular format of "YYYY-MM-DD HH:mm", with the time elements always set to 00:00 - midnight.

Even though significant data analysis work is often in the data preparation, in keeping with the goals of the task, I uploaded this file without modification to AWS S3.

S3 stands for Simple Storage Service, and there is little more to know than it's a place to store files, which themselves are arranged in buckets (folders).

Uploading files through the web interface is as simple as creating a bucket and uploading the file with the main call-to-action buttons:

AWS S3 Interface
AWS S3 Interace: elegantsoftware-landregistry bucket

Loading The Data Into Athena

AWS Athena is a tool to query data without managing databases:

Amazon Athena is a fast, cost-effective, interactive query service that makes it easy to analyze petabytes of data in S3 with no data warehouses or clusters to manage.

Initially, the interface presents little more than an empty query editor, so it can be daunting to understand how it works:

AWS Athena Interface

However, with some SQL knowledge, it's easy to understand how to get things going. Firstly, we create a new database:

CREATE database land_registry;

Secondly, we create a table from our data in S3:

CREATE EXTERNAL TABLE IF NOT EXISTS land_registry.sales (
  `id` string,
  `price` int,
  `date` string,
  `postcode` string,
  `property_type` string,
  `old_new` string,
  `freehold_leasehold` string,
  `paon` string,
  `saon` string,
  `street` string,
  `locality` string,
  `town` string,
  `district` string,
  `county` string,
  `ppd_category` string,
  `record_status` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  'separatorChar' = ',',
   'quoteChar' = '"',
   'escapeChar' = '\\'
)
STORED AS TEXTFILE
LOCATION 's3://elegantsoftware-landregistrydata/'
TBLPROPERTIES ('has_encrypted_data'='false')

Some notes on this query:

  1. As far as I can tell, all tables in Athena must be created with the EXTERNAL keyword.
  2. Most fields in the government dataset are string fields.
  3. The date field cannot be parsed as a date because it has a strange format that is a hybrid of both DATE and TIMESTAMP field types commonly seen, but sadly not conforming sufficiently to either.
  4. There are various serialiser-deserialiser (SerDe) classes available for processing different file types, including CSV, TSV, JSON, Parquet, Avro, and more. We need to use OpenCSVSerde because the field values were enclosed in double-quotes. See relevant AWS Athena docs for more.
  5. The LOCATION keyword specifies the bucket location in S3, using the custom s3:// protocol, which can be found on the S3 bucket page earlier.

With the table created, we can query the data.

Basic Queries

There are some wild data records in the file that takes some understanding to interpret. For example, let's start by querying the most expensive purchases of all time (since 1995):

SELECT price, date, saon, street, county, postcode, property_type, old_new, freehold_leasehold, ppd_category, record_status
FROM sales
ORDER BY price DESC 
LIMIT 10
The most expensive property purchases in the data set

This shows the most expensive purchase was for an incredible £594.3 million in 2017. The acquisition appears to be of a medical centre in Marylebone, London.

For my task, it would be sensible to remove "other" property types like this from the output:

SELECT price, date, saon, street, county, postcode, property_type, old_new, freehold_leasehold, ppd_category, record_status
FROM sales
WHERE property_type != 'O'
ORDER BY price DESC 
LIMIT 10
The most expensive property purchases in the data set, exclusing "other" property types

Unsurprisingly, the most expensive purchases were for properties in Knightbridge and other expensive parts of London.

On the other end of the scale, we can see just how many sales have gone through for the lowest sums of money:

SELECT count(id) as count, price
FROM sales
GROUP BY price
ORDER BY price ASC 
LIMIT 10
The frequency of each of the cheapest property purchases in the data set

For one reason or another, there have been 96 property purchases with a sales price of just £1 since 1995, that the Land Registry deem to be purchased at fair market value.

Creating Views For Display

I wanted to display on a graph the average price paid for a property with the following characteristics:

  • Prices paid should be broken down and averaged by the year and month of purchase
  • Prices paid should not include "other" property types
  • Prices paid should be since 2005

I set up the following query:

SELECT count(id) as count, DATE_FORMAT(date_parse(date, '%Y-%m-%d %H:%i'), '%Y-%m') as year_month, AVG(price) as average_price, property_type as property_type
FROM land_registry.sales
WHERE property_type != 'O'
AND DATE_FORMAT(date_parse(date, '%Y-%m-%d %H:%i'), '%Y-%m') >= '2005-00'
GROUP BY DATE_FORMAT(date_parse(date, '%Y-%m-%d %H:%i'), '%Y-%m'), property_type
ORDER BY DATE_FORMAT(date_parse(date, '%Y-%m-%d %H:%i'), '%Y-%m') ASC

As the date field is a string in the table, I needed to use the date_parse function specifying how to interpret the non-standard format of the data. The date_format function then picks out only the year and month fields for querying.

Unfortunately, although SQL queries are written in the order SELECT, FROM, WHERE, GROUP BY, ORDER BY, they are usually executed in a different order such as FROM, WHERE, GROUP BY, ORDER BY, SELECT. This means it is necessary to duplicate the date_parse and date_format parameters instead of using a SELECT alias throughout the query.

With the date processing in place, this query yields the following results:

Property purchases group by year, month and property type

The first row shows there were 11871 flats/maisonettes sold in January 2005, with an average price paid of a little over £172,000, and the rest of rows show data for the other property types until May 2019.

Pressing the grey "Create"->"Create view from query" button in the query editor will save this output as a view. Alternatively, prepending the entire query above with the following will also create or replace a view:

CREATE OR REPLACE VIEW "sales_average_prices" AS
... query ...

Graphing The Data

I used AWS QuickSight to view the output of this data. Amazon describes this tool as:

Amazon QuickSight is a fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization.

The process is straightforward to create a new analysis from a data set, using QuickSight terms. QuickSight can pull data from various sources, of which I chose AWS Athena:

AWS QuickSight Interface

The instructions for the Athena are easy to follow by naming the data set, selecting the database and selecting the view recently created as the source.

After setting up the data set, a visualisation can be created by selecting the line graph option in the bottom-left and selecting the x-axis, value and colour options from the fields list.

AWS QuickSight Visualisation

From here, we can see the general upward-trend of all property prices, with detached houses setting the top prices paid, on average, and terraced houses and semi-detached houses roughly level at the bottom-end.

Of course, many more visualisations can be created from additional Athena views loaded into QuickSight. For example, the following graphs show the average prices paid and the total number of sales, by year for all transactions in the SS11 postcode, since 2006.

AWS QuickSight Visualisations for SS11

Comparing the prices paid by property type in SS11 vs the rest of the U.K., it is interesting to notice that flats are the cheapest property type in SS11, but are the second most expensive property type across the U.K., as a whole.

Improvements

The main improvements I would put in place for real projects would be:

  1. Add appropriate indexes to the Athena data sources when creating the tables.
  2. Use a tool like AWS Glue to transform the initial data to a more performant data structure before loading into AWS Athena.
  3. Find out some real requirements for the data being analysed.

Conclusion

I am quite impressed with the set of tools offered by AWS for this purpose. The interfaces and processes have been fast and intuitive, and so far, everything has been free, as well! I'd recommend these tools for someone wishing to do cloud analysis at their organisation.

This post contains HM Land Registry data © Crown copyright and database right 2019. This data is licensed under the Open Government Licence v3.0.