Your department creates regular analytics reports from your company’s log files. All log data is
collected in Amazon S3 and processed by daily Amazon Elastic MapReduce (EMR) jobs that
generate daily PDF reports and aggregated tables in .csv format for an Amazon Redshift data
warehouse.
Your CFO requests that you optimize the cost structure for this system.
Which of the following alternatives will lower costs without compromising average performance of
the system or data integrity for the raw data?
A.
Use reduced redundancy storage (RRS) for all data In S3.
Use a combination of Spot Instances and Reserved Instances for Amazon EMR jobs.
Use Reserved Instances for Amazon Redshift.
B.
Use reduced redundancy storage (RRS) for PDF and .csv data in S3.
Add Spot Instances to EMR jobs.
Use Spot Instances for Amazon Redshift.
C.
Use reduced redundancy storage (RRS) for PDF and .csv data In Amazon S3.
Add Spot Instances to Amazon EMR jobs.
Use Reserved Instances for Amazon Redshift.
D.
Use reduced redundancy storage (RRS) for all data in Amazon S3.
Add Spot Instances to Amazon EMR jobs.
Use Reserved Instances for Amazon Redshift.
Explanation:
Reserved Instances (a.k.a. Reserved Nodes) are appropriate for steady-state production
workloads, and offer significant discounts over On-Demand pricing.
https://aws.amazon.com/redshift
Q: What are some EMR best practices?
If you are running EMR in production you should specify an AMI version, Hive version, Pig
version, etc. to make sure the version does not change unexpectedly (e.g. when EMR later adds
support for a newer version). If your cluster is mission critical, only use Spot instances for task
nodes because if the Spot price increases you may lose the instances. In development, use
logging and enable debugging to spot and correct errors faster. If you are using GZIP, keep your
file size to 1–2 GB because GZIP files cannot be split. Click here to download the white paper on
Amazon EMR best practices.
https://aws.amazon.com/elasticmapreduce/faqs/
A
Answer is C how come it is A!!!
Question state clearly: “without compromising average performance of the system or data integrity for the raw data?”
using RRS for all data could compromise the integrity of the data. Using RRS only for PDF and CSV preserve the integritiy of source data.
C is the correct answer.
A.
The availability of the spot instance is unpredictable, and running spot instances could be shutdown by AWS anytime when the price is up beyond your biding limits.