Data Lake Instance
Jump to navigation
Jump to search
A Data Lake Instance is a large composite heterogeneous data base with data bases in their original data structure.
- Context:
- It can (typically) be intended to increase Data Accessibility.
- It can (typically) be accessed by a Data Lake Querying System, such as a Hadoop-based or Spark-based querying system.
- It can (often) be associated with a Meta-Data Repository.
- It can (often) support Ad Hoc Analysis Tasks.
- It can (often) have a unified Data Format, such as JSON or Parquet.
- It can (often) be a Big Data Dataset.
- It can range from (typically) being a Centralized Data Lake to being a Distributed Data Lake (e.g. a federated data lake).
- It can range from being a High-Quality Data Lake to being a Low-Quality Data Lake(data swamp with few data quality checks).
- It can be hosted by a Data Lake System, such as a Custom Data Lake System or a Commercial Data Lake Service, such as Azure "Data Lake"[1].
- Example(s):
- a Maine's Data Lake Instance.
- Google's Data Lake of Sept 29th 2016.
- an AWS S3-based Data Lake.
- …
- Counter-Example(s):
- a Federated Database.
- a Data Warehouse (a subject-oriented database with an enterprise schema), such as a financial data warehouse).
- an AWS S3 Log File Repository.
- a Associative Array Database.
- See: Operational Data Store, Data Strategy, Data Management Maturity Model.
References
2016
- (Wikipedia, 2016) ⇒ http://wikipedia.org/wiki/data_lake Retrieved:2016-2-4.
- A data lake is a large storage repository and processing engine. They provide "massive storage for any kind of data, enormous processing power and the ability to handle virtually limitless concurrent tasks or jobs".
2016
- (Halevy et al., 2016) ⇒ Alon Y. Halevy, Flip Korn, Natalya Fridman Noy, Christopher Olston, Neoklis Polyzotis, Sudip Roy, and Steven Euijong Whang. (2016). “Managing Google's Data Lake: An Overview of the Goods System.” In: {IEEE} Data Eng. Bull., 39(3).
- QUOTE: For most large enterprises today, data constitutes their core asset, along with code and infrastructure. For most enterprises, the amount of data that they produce internally has exploded in recent years. At the same time, in many cases, engineers and data scientists do not use centralized data-management systems and end up creating what became known as a data lake - a collection of datasets that often are not well organized or not organized at all and where one needs to "fish" for useful datasets. In this paper, we describe our experience building and deploying GOODS, a system to manage Google's internal data lake.
2016
- http://blog.zaloni.com/modernizing-your-big-data-architecture-key-considerations
- QUOTE:
- Enable the lake: Build the lake and determine how you will ingest, organize and catalog your data.
- Govern the data: This involves data quality rules, automation workflows, as well as data security.
- Engage the business: Deliver the data to more end users, including business end users, to maximize its value — “democratizing” access to your data. This involves implementing tools that make data discovery, enrichment and provisioning very intuitive for less-technically savvy business users.
- QUOTE:
2015
2015
- https://azure.microsoft.com/en-us/solutions/data-lake/
- Azure Data Lake includes all the capabilities required to make it easy for developers, data scientists, and analysts to store data of any size, shape and speed, and do all types of processing and analytics across platforms and languages. It removes the complexities of ingesting and storing all of your data while making it faster to get up and running with batch, streaming, and interactive analytics. Azure Data Lake works with existing IT investments for identity, management, and security for simplified data management and governance. It also integrates seamlessly with operational stores and data warehouses so you can extend current data applications. We’ve drawn on the experience of working with enterprise customers and running some of the largest scale processing and analytics in the world for Microsoft businesses like Office 365, Xbox Live, Azure, Windows, Bing and Skype. Azure Data Lake solves many of the productivity and scalability challenges that prevent you from maximizing the value of your data assets with a service that’s ready to meet your current and future business needs.