By Marcel Coetzee
Whether you are a real estate investor, hedge fund manager, or simply curious about the property market in the UK, analysing publicly available data is a good start to understanding trends.
The Price Paid Data dataset contains information on all sold property sales in England and Wales since 1995. This isn’t a trivial dataset – coming in at a whopping 4.3GB, your first instinct might be to throw it into a database or data warehouse.
What if I told you that you don’t need to?
In fact, your dusty bargain laptop is a perfectly fine candidate to process this data!
We will use the latest and greatest tools: DuckDB and Rill.
DuckDB is an open source SQL query engine designed to be fast and efficient, making it ideal for data processing on laptops with limited resources. Users can quickly and easily explore and process large datasets, without having to worry about their computer’s performance or spinning up a dedicated database/data warehouse. Think SQLite’s big brother for analytics.
Rill, in turn, uses DuckDB behind the scenes without the user having to worry about any details. It’s a dashboarding solution with impressive visuals and responsiveness. Did I mention that it is open source, as well?
You can find the code for this article here, but you will probably learn more by just following along. It’s that easy.
Requirements:
- Mac or Linux with at least 8GB memory. Sorry, Windows users 🙁
Get Rill and Initialise your Project
At the time of writing, all you need to do is issue this command in your terminal:
curl -s https://cdn.rilldata.com/install.sh | bash
Then, run:
rill init –project uk-property-prices
A new folder should have been created:
Go into this directory and start the Rill server:
cd uk-property-prices/
Rill start
Finally, navigate to http://localhost:9009
Download the UK Property Data into a Model
On the UK government’s website, look for the link to the single file CSV:
Copy this link, and back in the Rill dashboard, click on the + next to “Sources” and add the link.e link to the csv:
Remember to save this data to a model with a name that makes sense, like all_property_prices_raw.
Go grab a coffee. This is going to take a while, considering the size of this dataset.
After it has downloaded, click on “create model”. You should be presented with your first model:
The columns aren’t what we’d expect them to be based on the definitions. Let’s amend them to be the right column names and data types. In the SQL block, type:
SELECT cast(column02 AS date) AS ‘Date of Transfer’,
column01 AS Price,
column03 AS Postcode,
CASE column04
- WHEN ‘D’ THEN ‘Detached’
- WHEN ‘S’ THEN ‘Semi-Detached’
- WHEN ‘T’ THEN ‘Terraced’
- WHEN ‘F’ THEN ‘Flats/Maisonettes’
- WHEN ‘O’ THEN ‘Other’
END AS ‘Property Type’,
CASE column05
WHEN ‘Y’ THEN ‘Newly built property’
WHEN ‘N’ THEN ‘Established residential building’
END AS ‘Newly Built’,
CASE column06
WHEN ‘F’ THEN ‘Freehold’
WHEN ‘L’ THEN ‘Leasehold’
END AS Duration,
column09 AS Street,
column10 AS Locality,
column11 AS Town,
column12 AS District,
column13 AS County
FROM all_property_prices_raw;
Then click “Create Dashboard”:
The dashboard is looking pretty cool already.
But we can do better. Click on “edit metrics” and define some custom ones, along with the descriptions that we got from the Dataset’s description page:
Our final product:
You can see the effects of covid in both price volatility and the sharp drop in the number of properties sold.
Worryingly, this trend seems to still be in place as we speak.
That concludes our foray into easy (and cheap) citizen data analytics!
May the duck be with you!
*Looking for more content? Head this way!