Introduction

Overview

Data Lake is a technical term related to Big Data (Big Data). Data Lake is simply a place to store raw data (unprocessed) waiting to be processed, analyzed and given insights.

Data Lake has the following properties:

  • Collect everything – contains all data raw or processed over a long period of time.
  • Multi-user – allows multiple users to refine, explore, and enrich data.
  • Flexible access – supports multiple access patterns on shared infrastructure: batch, interactive, online, search, in-memory, and processing engines other.

Data Lake

Amazon Glue: is a complete ETL service. You can use Glue Crawler to identify your data and store related data information (metadata) (for example, table and schema definitions) in the Glue Data Catalog. Once classified, your data can be instantly searched, queried and ready for ETL jobs.

AWS Glue ETL can generate code to do the data transformation and put the data in the bucket. AWS Glue is capable of generating reusable, customizable Python code.

Once your ETL Jobs are ready, we can create a schedule to run on a highly scalable Apache Spark environment managed by AWS Glue.

DataLake

Amazon Athena an interactive query service used to analyze data in Amazon S3 with standard SQL. . We simply point to your data in Amazon S3, define the schema and start querying with the built-in query editor. Amazon Athena allows us to mine all of our data in Amazon S3 without having to set up complex ETL processes. Amazon Athena charges based on queries run.

Amazon Athena uses Presto with ANSI SQL support and works with many standard data formats, including  CSV, JSON, ORC, Avro , and  Parquet. Athena is recommended for fast querying needs, but it can also handle complex analysis, including large joins, window functions, and arrays.

DataLake

Amazon Quick Sight a data representation service fully managed by AWS.

DataLake

  • Data source is an external data store and you need to configure data access in this external datastore, for example. Amazon S3, Amazon Athena, Salesforce, and more

  • Dataset define the specific data in the Data source that you want to use. For example, the Data source can be a table if you are connecting to the Data source database. It can be a file if you are connecting to an Amazon S3 Data source.

  • Analysis  is a container for a collection of related Visuals and stories, i.e. all stories that apply to a given business goal or KPI.

  • Visual is a graphical representation of your data. You can create many different types of Visuals in an analysis, using different datasets and Visual types.

  • Dashboard is a page consisting of one or more view-only Analysis that you can share with other Amazon QuickSight users for reporting purposes. The Dashboard keeps the configuration of the Analysis at the time you publish it, including things like filtering, parameters, controls, and sort order.