Whether you’re just getting started with your first AWS data-driven project or have an impressive track-record as a data connoisseur in the AWS ecosystem, AWS Glue is a must.
Because it is a managed service that:
- Helps you save time and resources.
- Allows for faster data integration.
- Automates a serious amount of the manual effort spent on building, maintaining, and running ETL jobs.
With AWS recently releasing their promising and exciting new Autoscaling feature in Glue 3.0, there’s still plenty of room for improvement in your existing Glue ecosystem.
So with the help of our Solutions Engineer, Gabriel Eisenberg, let’s unpack why AWS Glue works best when workload optimisation is thrown into the mix.
What is AWS Glue and What Does it Do?
According to the AWS website,
“AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for analytics. Glue is ‘serverless’ – you don’t need to provision or manage any resources and you only pay for resources when Glue is actively running.”
To understand why it matters, it’s helpful to know what data integration is. Data integration is essentially the process of collecting data from different sources and arranging it in databases for future use.
In order to get the best out of your analytics, machine learning and application development endeavours, it’s essential that your data is of good quality and in an easily accessible, central location.
After your data goes through a number of processes, it gets loaded and organised into a data warehouse, data lake or database.
And that’s where AWS Glue shines.
AWS Glue makes it easy to comb through, merge and analyse data by reducing the time taken with these processes. It means that you get to leverage your data in minutes, instead of extended periods of time.
Gabriel Eisenberg explains:
“When you’re doing big data processing and working in a big data space, say it’s hundreds of GB of data a day, you won’t be able to get away with simple or traditional data processing techniques. So you need to start using tools that are more equipped for big data processing.”
“With AWS Glue, everything is serverless. So AWS handles the underlying infrastructure for you. You just need to set up the parameters of your Glue job and it will spin up everything. Glue handles a lot of the complexities for you and adds nice functionality that is primed for ETL workloads. It’s just a lot easier to get going with the state of processing.”
It’s Great For Analytics, Too!
One of the most common areas for data engineers to find themselves using AWS Glue is in the process of preparing their data for analytics tools.
Research by Beroe, Inc. has revealed that the global business intelligence market is estimated to reach £22.83 billion by 2022. With key drivers behind this trend including big data analytics and demand for data-as-a-service.
“It works in any case where you’re going to be doing big data processing. For example, if you are doing feature generation, you want to run huge sets of data through a performant pipeline and get things done in a reasonable amount of time,” says Eisenberg.
The use cases can vary across a wide range of industries. Most of which are growing a strong reliance on data to drive their decision-making and can make effective use of these analytical tools.
“Cases like ingesting a whole bunch of data from a range of IoT sensors. It’s a gigantic scale of data and you need to process it all before depositing it in a data warehouse. Or to process them for use in a data lake. Or to have it ready for consumption in some way or another… With such a huge mass of data, it’s difficult to process it in traditional ways.”
Data engineers make effective use of Glue to aid them in the integration of data across a broad spectrum of industries. It allows them to generate insightful reports, build better machine learning models, and create other useful applications.
The Road to Optimisation – “When You Were There, What Did You Miss?”
AWS Glue allows for data to be processed with a pay-per-use pricing model. You pay an hourly rate, with per-second billing, and the more data processing units (DPUs) you use, the higher the cost of running your process will be.
So it’s essential that organisations take care to plan and monitor in order to find the best possible approaches to optimise their running costs.
And while AWS Glue 3.0’s new Autoscaling feature is promising, there is plenty of room for fine-tuning and improvement in any of the Glue 2.0 or 3.0 ecosystems – especially for those risk-averse Glue connoisseurs.
Especially where alleviating some of the extra costs associated with demand are concerned.
So to begin, you need to identify the right opportunities to make adjustments.
“You’ve built the pipeline and it’s sitting in production. It’s being used. But what did you maybe miss? What could you have done to optimise it to save costs? To make the pipeline better and more performant?”.
“What you could do is adjust your number of workers. However, the problem with adjusting it in this way is knowing how many to use.”
“Let’s take a Glue job with 10 workers that runs for 4 hours, as an example. I need the job to complete in a much quicker time. So I try 100 workers and suddenly, my Glue costs start rising. Perhaps I only needed 60 DPUs for a balance between cost and performance, but I’m now running 100 and overspending for diminishing performance returns.”
The importance of monitoring and fine-tuning become increasingly apparent as you begin to determine areas of demand and where to balance load.
“Perhaps I scale it down. Maybe I scale it up. At the same time, I’m trying to monitor things as well… If you’re passing over 50% memory or CPU in use, it might mean that you need more workers. And if you’re reaching some sort of threshold, it means you need more workers to process the job more comfortably.”
5 Factors To Consider For Optimising Costs
A huge burden in this optimisation process is that it is very manual-intensive and time-consuming for data engineers.
So our teams are looking at a number of key parameters to help simplify the fine-tuning process. Making the tedious nature of doing these optimisations much easier.
“The problem of optimisation without Autoscaling is that it’s very manual. You have to look at performance graphs, you have to look at how long your job takes. You have to look at costs. There’s a range of outcomes to consider and you make adjustments to one or two variables to try and reach a point where you’re in that happy balance between costs and time.”
“So what one could do is record details about jobs. Is memory usage getting too high? Are some workers underutilised?”
By collecting and analysing the stats and outcomes of your jobs, it alleviates a lot of the guesswork when approaching your optimisation strategy.
Eisenberg lists these factors as:
- CPU load (minimum, maximum, average, standard deviation).
- RAM load (minimum, maximum, average, standard deviation).
- Number of workers/DPUs.
- Worker type.
- Job duration.
“While it might differ from organisation to organisation, I would recommend to first set up a process to extract the metadata for these factors for each job run and place it in a database like DynamoDB or in storage like S3.
“That way, we can begin collecting metadata about our jobs and we can access it when we are ready to optimise our workloads.”
For example, you might provision 200 workers, but it seems like a lot of them aren’t even being used. So maybe you can cut down your number of workers to about 150 or 100.
The challenge comes in when you notice that the job is starting to throttle again.
“The problem is that the data is not necessarily consistent. Which means there could be a variation in the amount of data you have to process. For reasons like a public holiday or an event like Black Friday. So maybe you have abnormalities or outliers that you need to account for if you have time sensitive workloads.”
“What you could do is define a rules engine or a model.”
“Maybe you could take in a number of the parameters together. Like if my CPU load and my RAM load is performing under a certain threshold, or above a certain threshold, I’m going to increase or decrease the number of workers. Maybe by 10. Let’s look at how that works out and then adjust it. And you can automate this whole process in a script.”
Not Ready For Autoscaling? We Get It! In The Meantime…
Because AWS Glue 3.0 has a list of migration requirements, not all businesses will be ready to reap the benefits of Autoscaling just yet.
So in the meantime, it’s important to look at opportunities to reduce cost and optimise performance wherever possible.
“Because AWS Glue 3.0 is new and because there’s a need to go through a migration effort, maybe you can optimise your existing AWS Glue 2.0, or earlier, workloads.”
“If you’re not ready to migrate, or your company’s not there yet, then what you may want to do is apply a simple optimisation strategy to bring your costs down and increase performance. And if necessary, opt for a cost-performance trade off. Then migrate when you’re ready”.
Get It Right The First Time (With Us)
With the right expertise, you can ensure that your environment remains as efficient as possible. That means reaping the many benefits that come with:
- Time and resource savings.
- Faster data integration.
- Reduced effort on ETL jobs through automation.
Our data engineering teams have all of the skills and experience needed to help get you started, or on track to getting your Glue 2.0 environment comfortably optimised.
If you’re looking for all of the best industry advice on anything data, ML, or cloud related, then check out blog!