Enterprise data needs change constantly but at an inconsistent rate, and in recent years change has come at an increasing rate. Tools once considered useful for big data applications are no longer sufficient. Development in other IT areas such as IoT, geolocation, etc. have changed the way data was collected, stored, distributed, processed and analyzed. Real-time decision needs complicate this scenario even more and we needed new tools to handle these challenges efficiently.
Prisoft has decades of experience in implementing on premise data warehouse solutions around Oracle, Teradata and Microsoft SQL Server technology stack. On our legacy projects we had mainly used Oracle Answers, PowerBI, Tableau and Qlik as the visualization tools. In the evolution process from Oracle to Teradata we are now witnessing a third wave of innovation in data warehousing technology with the advent of cloud data warehouses. As enterprises move to the cloud, they are abandoning their legacy on premise data warehousing technologies, including Hadoop, for these new cloud data platforms. This transformation is a huge tectonic shift in data management and has profound implications for enterprises.
Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.
As data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data. Companies are looking to access all their data, all the time, by all users and get fast answers, and one of the best solution available for them in market is Amazon Redshift and Amazon Web Services (AWS). Amazon redshift along with other AWS services allows you to store all your structured, semi-structured, and unstructured data at any scale and provides insight whenever required.
Amazon Redshift Spectrum is a feature within Amazon Web Services Redshift data warehousing service that lets us conduct fast, complex analysis on objects stored on S3.
Redshift Spectrum can perform SQL queries on data stored in Amazon S3 buckets. This can save time and money because it eliminates the need to move data from a storage service to a database, and instead directly queries data inside an S3 bucket. Spectrum breaks a query into subsets that are run concurrently. These requests are spread across nodes to maintain query speed and consistent performance.
How to create and access Redshift Spectrum
In the next part, we will use Redshift Spectrum to query user log files stored in S3.
- For this we will be using the click stream logs available in public s3 buckets.
- The next step is to create tables / glue data catalogues that will be pointing to these s3 locations. Once done we can reference these external tables/data catalogues in our SQL query to query these s3 locations.
- We will navigate to AWS Glue and create a new glue crawler.
- Select Crawler in the left pane and then click on “Add Crawler”.
- Add a name for the crawler.
- In the Crawler Type specification window, we will check the “Data Store” type check box, also select “Crawl all folders” and click on “Next”.
- Enter the S3 location and keep all other options default and click on finish.
- Select the check box against your crawler and click on “Run Crawler”.
- Once the crawler runs successfully you should see your tables created.
- In the next step, we will create an external schema pointing to the data catalogue we just created.
- Now we can use/reference this external schema in SQL queries to query S3.
- Let’s run a query and see how Redshift Spectrum tables join with Redshift tables.
Redshift spectrum is a great tool to have in any organization’s bucket using Amazon Web Service or wanting to get maximum value out of their data in least costly way. It is a smart way to add data lake features inside a data warehouse already hosted in Redshift.