AWS EMR File System (EMRFS)
An AWS EMR File System (EMRFS) is a data library that implements library that implements Hadoop's FileSystem api so that Amazon EMR clusters can use for reading and writing regular files from Amazon EMR directly to Amazon S3.
- Context:
- It can be used to create an HDFS-based Distributed File System.
- Example(s):
- Counter-Example(s):
- See: HDFS, S3.
References
2019
- https://stackoverflow.com/a/57031006
- QUOTE: ... EMRFS is a library that implements hadoops FileSystem api. EMRFS makes S3 look like hdfs or the local filesystem. This is then used by many of the applications in the hadoop ecosystem such as spark and hive. For example this is how you would use EMRFS to read from S3 in spark
- val df = spark.read.parquet("S3://s3-bucket/path/to/folder/")
- df.write.csv("s3://s3-bucket/path/to/output/")
2018
- https://medium.com/@tawkir/emrfs-consistent-view-what-is-it-and-why-is-it-d06dbde7d405
- QUOTE: ... Amazon came up with EMRFS which keeps a track of all the objects you are writing to s3. EMRFS is basically a dynamo DB storage. For example: You are writing two objects to s3, one is named part-0001, and the other on part-0002. Before writing to s3, EMRFS will insert these two keys in its database and mark them as written to s3. …
2018
- https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html
- QUOTE: The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like consistent view and data encryption.
Consistent view provides consistency checking for list and read-after-write (for new put requests) for objects in Amazon S3. Data encryption allows you to encrypt objects that EMRFS writes to Amazon S3, and enables EMRFS to work with encrypted objects in Amazon S3. If you are using Amazon EMR release version 4.8.0 or later, you can use security configurations to set up encryption for EMRFS objects in Amazon S3, along with other encryption settings. For more information, see Encryption Options. If you use an earlier release version of Amazon EMR, you can manually configure encryption settings. For more information, see Specifying Amazon S3 Encryption Using EMRFS Properties.
When using Amazon EMR release version 5.10.0 or later, you can use EMRFS authorization for Amazon S3 to control access to EMRFS objects in Amazon S3 based on user, group, or the location of EMRFS data in Amazon S3. For more information, see Configure EMRFS Authorization for Data in Amazon S3.
- QUOTE: The EMR File System (EMRFS) is an implementation of HDFS that all Amazon EMR clusters use for reading and writing regular files from Amazon EMR directly to Amazon S3. EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like consistent view and data encryption.