AWS Certified Data Engineer - Associate (DEA-C01) Dumps January 2025
Are you tired of looking for a source that'll keep you updated on the AWS Certified Data Engineer - Associate (DEA-C01) Exam? Plus, has a collection of affordable, high-quality, and incredibly easy Amazon Data-Engineer-Associate Practice Questions? Well then, you are in luck because Salesforcexamdumps.com just updated them! Get Ready to become a AWS Certified Data Engineer Certified.
PDF
$140 $28
Test Engine
$200 $40
PDF + Test Engine
$240 $48
Here are Amazon Data-Engineer-Associate PDF available features:
Amazon Data-Engineer-Associate is a necessary certification exam to get certified. The certification is a reward to the deserving candidate with perfect results. The AWS Certified Data Engineer Certification validates a candidate's expertise to work with Amazon. In this fast-paced world, a certification is the quickest way to gain your employer's approval. Try your luck in passing the AWS Certified Data Engineer - Associate (DEA-C01) Exam and becoming a certified professional today. Salesforcexamdumps.com is always eager to extend a helping hand by providing approved and accepted Amazon Data-Engineer-Associate Practice Questions. Passing AWS Certified Data Engineer - Associate (DEA-C01) will be your ticket to a better future!
Pass with Amazon Data-Engineer-Associate Braindumps!
Contrary to the belief that certification exams are generally hard to get through, passing AWS Certified Data Engineer - Associate (DEA-C01) is incredibly easy. Provided you have access to a reliable resource such as Salesforcexamdumps.com Amazon Data-Engineer-Associate PDF. We have been in this business long enough to understand where most of the resources went wrong. Passing Amazon AWS Certified Data Engineer certification is all about having the right information. Hence, we filled our Amazon Data-Engineer-Associate Dumps with all the necessary data you need to pass. These carefully curated sets of AWS Certified Data Engineer - Associate (DEA-C01) Practice Questions target the most repeated exam questions. So, you know they are essential and can ensure passing results. Stop wasting your time waiting around and order your set of Amazon Data-Engineer-Associate Braindumps now!
We aim to provide all AWS Certified Data Engineer certification exam candidates with the best resources at minimum rates. You can check out our free demo before pressing down the download to ensure Amazon Data-Engineer-Associate Practice Questions are what you wanted. And do not forget about the discount. We always provide our customers with a little extra.
Why Choose Amazon Data-Engineer-Associate PDF?
Unlike other websites, Salesforcexamdumps.com prioritize the benefits of the AWS Certified Data Engineer - Associate (DEA-C01) candidates. Not every Amazon exam candidate has full-time access to the internet. Plus, it's hard to sit in front of computer screens for too many hours. Are you also one of them? We understand that's why we are here with the AWS Certified Data Engineer solutions. Amazon Data-Engineer-Associate Question Answers offers two different formats PDF and Online Test Engine. One is for customers who like online platforms for real-like Exam stimulation. The other is for ones who prefer keeping their material close at hand. Moreover, you can download or print Amazon Data-Engineer-Associate Dumps with ease.
If you still have some queries, our team of experts is 24/7 in service to answer your questions. Just leave us a quick message in the chat-box below or email at support@salesforcexamdumps.com.
Amazon Data-Engineer-Associate Sample Questions
Question # 1
A data engineer needs Amazon Athena queries to finish faster. The data engineer notices that all the files the Athena queries use are currently stored in uncompressed .csv format. The data engineer also notices that users perform most queries by selecting a specific column. Which solution will MOST speed up the Athena query performance?
A. Change the data format from .csvto JSON format. Apply Snappy compression. B. Compress the .csv files by using Snappy compression. C. Change the data format from .csvto Apache Parquet. Apply Snappy compression. D. Compress the .csv files by using gzjg compression.
Answer: C Explanation: Amazon Athena is a serverless interactive query service that allows you to analyze data in Amazon S3 using standard SQL. Athena supports various data formats,such as CSV, JSON, ORC, Avro, and Parquet. However, not all data formats are equallyefficient for querying. Some data formats, such as CSV and JSON, are row-oriented,meaning that they store data as a sequence of records, each with the same fields. Roworientedformats are suitable for loading and exporting data, but they are not optimal foranalytical queries that often access only a subset of columns. Row-oriented formats alsodo not support compression or encoding techniques that can reduce the data size andimprove the query performance.On the other hand, some data formats, such as ORC and Parquet, are column-oriented,meaning that they store data as a collection of columns, each with a specific data type.Column-oriented formats are ideal for analytical queries that often filter, aggregate, or joindata by columns. Column-oriented formats also support compression and encodingtechniques that can reduce the data size and improve the query performance. Forexample, Parquet supports dictionary encoding, which replaces repeated values withnumeric codes, and run-length encoding, which replaces consecutive identical values witha single value and a count. Parquet also supports various compression algorithms, such asSnappy, GZIP, and ZSTD, that can further reduce the data size and improve the queryperformance. Therefore, changing the data format from CSV to Parquet and applying Snappycompression will most speed up the Athena query performance. Parquet is a columnorientedformat that allows Athena to scan only the relevant columns and skip the rest,reducing the amount of data read from S3. Snappy is a compression algorithm that reducesthe data size without compromising the query speed, as it is splittable and does not requiredecompression before reading. This solution will also reduce the cost of Athena queries, asAthena charges based on the amount of data scanned from S3.The other options are not as effective as changing the data format to Parquet and applyingSnappy compression. Changing the data format from CSV to JSON and applying Snappycompression will not improve the query performance significantly, as JSON is also a roworientedformat that does not support columnar access or encoding techniques.Compressing the CSV files by using Snappy compression will reduce the data size, but itwill not improve the query performance significantly, as CSV is still a row-oriented formatthat does not support columnar access or encoding techniques. Compressing the CSV filesby using gzjg compression will reduce the data size, but it willdegrade the queryperformance, as gzjg is not a splittable compression algorithm and requires decompressionbefore reading. References:Amazon AthenaChoosing the Right Data FormatAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 5: Data Analysis and Visualization, Section 5.1: Amazon Athena
Question # 2
A company stores data in a data lake that is in Amazon S3. Some data that the company stores in the data lake contains personally identifiable information (PII). Multiple usergroups need to access the raw data. The company must ensure that user groups can access only the PII that they require. Which solution will meet these requirements with the LEAST effort?
A. Use Amazon Athena to query the data. Set up AWS Lake Formation and create datafilters to establish levels of access for the company's IAM roles. Assign each user to theIAM role that matches the user's PII access requirements. B. Use Amazon QuickSight to access the data. Use column-level security features inQuickSight to limit the PII that users can retrieve from Amazon S3 by using AmazonAthena. Define QuickSight access levels based on the PII access requirements of theusers. C. Build a custom query builder UI that will run Athena queries in the background to accessthe data. Create user groups in Amazon Cognito. Assign access levels to the user groupsbased on the PII access requirements of the users. D. Create IAM roles that have different levels of granular access. Assign the IAM roles toIAM user groups. Use an identity-based policy to assign access levels to user groups at thecolumn level.
Answer: A Explanation: Amazon Athena is a serverless, interactive query service that enables you to analyze datain Amazon S3 using standard SQL. AWS Lake Formation is a service that helps you build,secure, and manage data lakes on AWS. You can use AWS Lake Formation to create datafilters that define the level of access for different IAM roles based on the columns, rows, ortags of the data. By using Amazon Athena to query the data and AWS Lake Formation tocreate data filters, the company can meet the requirements of ensuring that user groupscan access only the PII that they require with the least effort. The solution is to use AmazonAthena to query the data in the data lake that is in Amazon S3. Then, set up AWS LakeFormation and create data filters to establish levels of access for the company’s IAM roles.For example, a data filter can allow a user group to access only the columns that containthe PII that they need, such as name and email address, and deny access to the columnsthat contain the PII that they do not need, such as phone number and social securitynumber. Finally, assign each user to the IAM role that matches the user’s PII accessrequirements. This way, the user groups can access the data in the data lake securely andefficiently. The other options are either not feasible or not optimal. Using AmazonQuickSight to access the data (option B) would require the company to pay for theQuickSight service and to configure the column-level security features for each user.Building a custom query builder UI that will run Athena queries in the background to accessthe data (option C) would require the company to develop and maintain the UI and tointegrate it with Amazon Cognito. Creating IAM roles that have different levels of granularaccess (option D) would require the company to manage multiple IAM roles and policies and to ensure that they are aligned with the data schema. References:Amazon AthenaAWS Lake FormationAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 4: Data Analysis and Visualization, Section 4.3: Amazon Athena
Question # 3
A company receives call logs as Amazon S3 objects that contain sensitive customer information. The company must protect the S3 objects by using encryption. The company must also use encryption keys that only specific employees can access. Which solution will meet these requirements with the LEAST effort?
A. Use an AWS CloudHSM cluster to store the encryption keys. Configure the process thatwrites to Amazon S3 to make calls to CloudHSM to encrypt and decrypt the objects.Deploy an IAM policy that restricts access to the CloudHSM cluster. B. Use server-side encryption with customer-provided keys (SSE-C) to encrypt the objectsthat contain customer information. Restrict access to the keys that encrypt the objects. C. Use server-side encryption with AWS KMS keys (SSE-KMS) to encrypt the objects thatcontain customer information. Configure an IAM policy that restricts access to the KMSkeys that encrypt the objects. D. Use server-side encryption with Amazon S3 managed keys (SSE-S3) to encrypt theobjects that contain customer information. Configure an IAM policy that restricts access tothe Amazon S3 managed keys that encrypt the objects.
Answer: C Explanation: Option C is the best solution to meet the requirements with the least effortbecause server-side encryption with AWS KMS keys (SSE-KMS) is a feature that allowsyou to encrypt data at rest in Amazon S3 using keys managed by AWS Key ManagementService (AWS KMS). AWS KMS is a fully managed service that enables you to create andmanage encryption keys for your AWS services and applications. AWS KMS also allowsyou to define granular access policies for your keys, such as who can use them to encryptand decrypt data, and under what conditions. By using SSE-KMS, you canprotect your S3objects by using encryption keys that only specific employees can access, without having to manage the encryption and decryption process yourself.Option A is not a good solution because it involves using AWS CloudHSM, which is aservice that provides hardware security modules (HSMs) in the AWS Cloud. AWSCloudHSM allows you to generate and use your own encryption keys on dedicatedhardware that is compliant with various standards and regulations. However, AWSCloudHSM is not a fully managed service and requires more effort to set up and maintainthan AWS KMS. Moreover, AWS CloudHSM does not integrate with Amazon S3, so youhave to configure the process that writes to S3 to make calls to CloudHSM to encrypt anddecrypt the objects, which adds complexity and latency to the data protection process.Option B is not a good solution because it involves using server-side encryption withcustomer-provided keys (SSE-C), which is a feature that allows you to encrypt data at restin Amazon S3 using keys that you provide and manage yourself. SSE-C requires you tosend your encryption key along with each request to upload or retrieve an object. However,SSE-C does not provide any mechanism to restrict access to the keys that encrypt theobjects, so you have to implement your own key management and access control system,which adds more effort and risk to the data protection process.Option D is not a good solution because it involves using server-side encryption withAmazon S3 managed keys (SSE-S3), which is a feature that allows you to encrypt data atrest in Amazon S3 using keys that are managed by Amazon S3. SSE-S3 automaticallyencrypts and decrypts your objects as they are uploaded and downloaded from S3.However, SSE-S3 does not allow you to control who can access the encryption keys orunder what conditions. SSE-S3 uses a single encryption key for each S3 bucket, which isshared by all users who have access to the bucket. This means that you cannot restrictaccess to the keys that encrypt the objects by specific employees, which does not meet therequirements.References:AWS Certified Data Engineer - Associate DEA-C01 Complete Study GuideProtecting Data Using Server-Side Encryption with AWS KMS–ManagedEncryption Keys (SSE-KMS) - Amazon Simple Storage ServiceWhat is AWS Key Management Service? - AWS Key Management ServiceWhat is AWS CloudHSM? - AWS CloudHSMProtecting Data Using Server-Side Encryption with Customer-Provided EncryptionKeys (SSE-C) - Amazon Simple Storage ServiceProtecting Data Using Server-Side Encryption with Amazon S3-ManagedEncryption Keys (SSE-S3) - Amazon Simple Storage Service
Question # 4
A data engineer needs to maintain a central metadata repository that users access through Amazon EMR and Amazon Athena queries. The repository needs to provide the schema and properties of many tables. Some of the metadata is stored in Apache Hive. The data engineer needs to import the metadata from Hive into the central metadata repository. Which solution will meet these requirements with the LEAST development effort?
A. Use Amazon EMR and Apache Ranger. B. Use a Hive metastore on an EMR cluster. C. Use the AWS Glue Data Catalog. D. Use a metastore on an Amazon RDS for MySQL DB instance.
Answer: C Explanation: The AWS Glue Data Catalog is an Apache Hive metastore-compatiblecatalog that provides a central metadata repository for various data sources and formats.You can use the AWS Glue Data Catalog as an external Hive metastore for Amazon EMRand Amazon Athena queries, and import metadata from existing Hive metastores into the Data Catalog. This solution requires the least development effort, as you can use AWSGlue crawlers to automatically discover and catalog the metadata from Hive, and use theAWS Glue console, AWS CLI, or Amazon EMR API to configure the Data Catalog as theHive metastore. The other options are either more complex or require additional steps,such as setting up Apache Ranger for security, managing a Hive metastore on an EMRcluster or an RDS instance, or migrating the metadata manually. References:Using the AWS Glue Data Catalog as the metastore for Hive (Section: SpecifyingAWS Glue Data Catalog as the metastore)Metadata Management: Hive Metastore vs AWS Glue (Section: AWS Glue DataCatalog)AWS Glue Data Catalog support for Spark SQL jobs (Section: Importing metadatafrom an existing Hive metastore)AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide(Chapter 5, page 131)
Question # 5
A company is planning to use a provisioned Amazon EMR cluster that runs Apache Spark jobs to perform big data analysis. The company requires high reliability. A big data team must follow best practices for running cost-optimized and long-running workloads on Amazon EMR. The team must find a solution that will maintain the company's current level of performance. Which combination of resources will meet these requirements MOST cost-effectively? (Choose two.)
A. Use Hadoop Distributed File System (HDFS) as a persistent data store. B. Use Amazon S3 as a persistent data store. C. Use x86-based instances for core nodes and task nodes. D. Use Graviton instances for core nodes and task nodes. E. Use Spot Instances for all primary nodes.
Answer: B,D Explanation: The best combination of resources to meet the requirements of high reliability, cost-optimization, and performance for running Apache Spark jobs on AmazonEMR is to use Amazon S3 as a persistent data store and Graviton instances for core nodesand task nodes.Amazon S3 is a highly durable, scalable, and secure object storage service that can storeany amount of data for a variety of use cases, including big data analytics1. Amazon S3 isa better choice than HDFS as a persistent data store for Amazon EMR, as it decouples thestorage from the compute layer, allowing for more flexibility and cost-efficiency. Amazon S3also supports data encryption, versioning, lifecycle management, and cross-regionreplication1. Amazon EMR integrates seamlessly with Amazon S3, using EMR File System(EMRFS) to access data stored in Amazon S3 buckets2. EMRFS also supports consistentview, which enables Amazon EMR to provide read-after-write consistency for Amazon S3objects that are accessed through EMRFS2.Graviton instances are powered by Arm-based AWS Graviton2 processors that deliver upto 40% better price performance over comparable current generation x86-basedinstances3. Graviton instances are ideal for running workloads that are CPU-bound,memory-bound, or network-bound, such as big data analytics, web servers, and opensourcedatabases3. Graviton instances are compatible with Amazon EMR, and can beusedfor both core nodes and task nodes. Core nodes are responsible for running the data processing frameworks, such as Apache Spark, and storing data in HDFS or the local filesystem. Task nodes are optional nodes that can be added to a cluster to increase theprocessing power and throughput. By using Graviton instances for both core nodes andtask nodes, you can achieve higher performance and lower cost than using x86-basedinstances.Using Spot Instances for all primary nodes is not a good option, as it can compromise thereliability and availability of the cluster. Spot Instances are spare EC2 instances that areavailable at up to 90% discount compared to On-Demand prices, but they can beinterrupted by EC2 with a two-minute notice when EC2 needs the capacity back. Primarynodes are the nodes that run the cluster software, such as Hadoop, Spark, Hive, and Hue,and are essential for the cluster operation. If a primary node is interrupted by EC2, thecluster will fail or become unstable. Therefore, it is recommended to use On-DemandInstances or Reserved Instances for primary nodes, and use Spot Instances only for tasknodes that can tolerate interruptions. References:Amazon S3 - Cloud Object StorageEMR File System (EMRFS)AWS Graviton2 Processor-Powered Amazon EC2 Instances[Plan and Configure EC2 Instances][Amazon EC2 Spot Instances][Best Practices for Amazon EMR]
Question # 6
A company wants to implement real-time analytics capabilities. The company wants to use Amazon Kinesis Data Streams and Amazon Redshift to ingest and process streaming data at the rate of several gigabytes per second. The company wants to derive near real-time insights by using existing business intelligence (BI) and analytics tools. Which solution will meet these requirements with the LEAST operational overhead?
A. Use Kinesis Data Streams to stage data in Amazon S3. Use the COPY command toload data from Amazon S3 directly into Amazon Redshift to make the data immediatelyavailable for real-time analysis. B. Access the data from Kinesis Data Streams by using SQL queries. Create materializedviews directly on top of the stream. Refresh the materialized views regularly to query themost recent stream data. C. Create an external schema in Amazon Redshift to map the data from Kinesis DataStreams to an Amazon Redshift object. Create a materialized view to read data from thestream. Set the materialized view to auto refresh. D. Connect Kinesis Data Streams to Amazon Kinesis Data Firehose. Use Kinesis DataFirehose to stage the data in Amazon S3. Use the COPY command to load the data fromAmazon S3 to a table in Amazon Redshift.
Answer: C Explanation: This solution meets the requirements of implementing real-time analytics capabilities with the least operational overhead. By creating an external schema in AmazonRedshift, you can access the data from Kinesis Data Streams using SQL queries withouthaving to load the data into the cluster. By creating a materialized view on top of thestream, you can store the results of the query in the cluster and make them available foranalysis. By setting the materialized view to auto refresh, you can ensure that the view isupdated with the latest data from the stream at regular intervals. This way, you can derivenear real-time insights by using existing BI and analytics tools. References:Amazon Redshift streaming ingestionCreating an external schema for Amazon Kinesis Data StreamsCreating a materialized view for Amazon Kinesis Data Streams
Question # 7
A company stores details about transactions in an Amazon S3 bucket. The company wants to log all writes to the S3 bucket into another S3 bucket that is in the same AWS Region. Which solution will meet this requirement with the LEAST operational effort?
A. Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket toinvoke an AWS Lambda function. Program the Lambda function to write the event toAmazon Kinesis Data Firehose. Configure Kinesis Data Firehose to write the event to thelogs S3 bucket. B. Create a trail of management events in AWS CloudTraiL. Configure the trail to receivedata from the transactions S3 bucket. Specify an empty prefix and write-only events.Specify the logs S3 bucket as the destination bucket. C. Configure an S3 Event Notifications rule for all activities on the transactions S3 bucket toinvoke an AWS Lambda function. Program the Lambda function to write the events to thelogs S3 bucket. D. Create a trail of data events in AWS CloudTraiL. Configure the trail to receive data fromthe transactions S3 bucket. Specify an empty prefix and write-only events. Specify the logsS3 bucket as the destination bucket.
Answer: D Explanation: This solution meets the requirement of logging all writes to the S3 bucket into another S3 bucket with the least operational effort. AWS CloudTrail is a service thatrecords the API calls made to AWS services, including Amazon S3. By creating a trail ofdata events, you can capture the details of the requests that are made to the transactionsS3 bucket, such as the requester, the time, the IP address, and the response elements. Byspecifying an empty prefix and write-only events, you can filter the data events to onlyinclude the ones that write to the bucket. By specifying the logs S3 bucket as thedestination bucket, you can store the CloudTrail logs in another S3 bucket that is in thesame AWS Region. This solution does not require any additional coding or configuration,and it is more scalable and reliable than using S3 Event Notifications and Lambdafunctions. References:Logging Amazon S3 API calls using AWS CloudTrailCreating a trail for data eventsEnabling Amazon S3 server access logging
Question # 8
A data engineer has a one-time task to read data from objects that are in Apache Parquet format in an Amazon S3 bucket. The data engineer needs to query only one column of the data. Which solution will meet these requirements with the LEAST operational overhead?
A. Confiqure an AWS Lambda function to load data from the S3 bucket into a pandasdataframe- Write a SQL SELECT statement on the dataframe to query the requiredcolumn. B. Use S3 Select to write a SQL SELECT statement to retrieve the required column fromthe S3 objects. C. Prepare an AWS Glue DataBrew project to consume the S3 objects and to query the required column. D. Run an AWS Glue crawler on the S3 objects. Use a SQL SELECT statement in AmazonAthena to query the required column.
Answer: B Explanation: Option B is the best solution to meet the requirements with the least operational overhead because S3 Select is a feature that allows you to retrieve only asubset of data from an S3 object by using simple SQL expressions. S3 Select works onobjects stored in CSV, JSON, or Parquet format. By using S3 Select, you can avoid theneed to download and process the entire S3 object, which reduces the amount of datatransferred and the computation time. S3 Select is also easy to use and does not requireany additional services or resources.Option A is not a good solution because it involves writing custom code and configuring anAWS Lambda function to load data from the S3 bucket into a pandas dataframe and querythe required column. This option adds complexity and latency to the data retrieval processand requires additional resources and configuration.Moreover, AWS Lambda haslimitations on the execution time, memory, and concurrency, which may affect theperformance and reliability of the data retrieval process.Option C is not a good solution because it involves creating and running an AWS GlueDataBrew project to consume the S3 objects and query the required column. AWS GlueDataBrew is a visual data preparation tool that allows you to clean, normalize, andtransform data without writing code. However, in this scenario, the data is already inParquet format, which is a columnar storage format that is optimized for analytics.Therefore, there is no need to use AWS Glue DataBrew to prepare the data. Moreover,AWS Glue DataBrew adds extra time and cost to the data retrieval process and requiresadditional resources and configuration.Option D is not a good solution because it involves running an AWS Glue crawler on the S3objects and using a SQL SELECT statement in Amazon Athena to query the requiredcolumn. An AWS Glue crawler is a service that can scan data sources and create metadatatables in the AWS Glue Data Catalog. The Data Catalog is a central repository that storesinformation about the data sources, such as schema, format, and location. Amazon Athenais a serverless interactive query service that allows you to analyze data in S3 usingstandard SQL. However, in this scenario, the schema and format of the data are alreadyknown and fixed, so there is no need to run a crawler to discover them. Moreover, runninga crawler and using Amazon Athena adds extra time and cost to the data retrieval processand requires additional services and configuration.References:AWS Certified Data Engineer - Associate DEA-C01 Complete Study GuideS3 Select and Glacier Select - Amazon Simple Storage ServiceAWS Lambda - FAQsWhat Is AWS Glue DataBrew? - AWS Glue DataBrewPopulating the AWS Glue Data Catalog - AWS Glue What is Amazon Athena? - Amazon Athena
Question # 9
A retail company has a customer data hub in an Amazon S3 bucket. Employees from many countries use the data hub to support company-wide analytics. A governance team must ensure that the company's data analysts can access data only for customers who are within the same country as the analysts. Which solution will meet these requirements with the LEAST operational effort?
A. Create a separate table for each country's customer data. Provide access to eachanalyst based on the country that the analyst serves. B. Register the S3 bucket as a data lake location in AWS Lake Formation. Use the LakeFormation row-level security features to enforce the company's access policies. C. Move the data to AWS Regions that are close to the countries where the customers are.Provide access to each analyst based on the country that the analyst serves. D. Load the data into Amazon Redshift. Create a view for each country. Create separate1AM roles for each country to provide access to data from each country. Assign theappropriate roles to the analysts.
Answer: B Explanation: AWS Lake Formation is a service that allows you to easily set up, secure, and manage data lakes. One of the features of Lake Formation is row-level security, whichenables you to control access to specific rows or columns of data based on the identity orrole of the user. This feature is useful for scenarios where you need to restrict access tosensitive or regulated data, such as customer data from different countries. By registeringthe S3 bucket as a data lake location in Lake Formation, you can use the Lake Formationconsole or APIs to define and apply row-level security policies to the data in the bucket.You can also use Lake Formation blueprints to automate the ingestion and transformationof data from various sources into the data lake. This solution requires the least operationaleffort compared to the other options, as it does not involve creating or moving data, ormanaging multiple tables, views, or roles. References:AWS Lake FormationRow-Level SecurityAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 4: Data Lakes and Data Warehouses, Section 4.2: AWS Lake Formation
Question # 10
A company uses Amazon RDS to store transactional data. The company runs an RDS DB instance in a private subnet. A developer wrote an AWS Lambda function with default settings to insert, update, or delete data in the DB instance. The developer needs to give the Lambda function the ability to connect to the DB instance privately without using the public internet. Which combination of steps will meet this requirement with the LEAST operational overhead? (Choose two.)
A. Turn on the public access setting for the DB instance. B. Update the security group of the DB instance to allow only Lambda function invocationson the database port. C. Configure the Lambda function to run in the same subnet that the DB instance uses. D. Attach the same security group to the Lambda function and the DB instance. Include aself-referencing rule that allows access through the database port. E. Update the network ACL of the private subnet to include a self-referencing rule thatallows access through the database port.
Answer: C,D Explanation: To enable the Lambda function to connect to the RDS DB instance privately without using the public internet, the best combination of steps is to configure the Lambdafunction to run in the same subnet that the DB instance uses, and attach the same securitygroup to the Lambda function and the DB instance. This way, the Lambda function and theDB instance can communicate within the same private network, and the security group canallow traffic between them on the database port. This solution has the least operationaloverhead, as it does not require any changes to the public access setting, the networkACL, or the security group of the DB instance.The other options are not optimal for the following reasons:A. Turn on the public access setting for the DB instance. This option is notrecommended, as it would expose the DB instance to the public internet, whichcan compromise the security and privacy of the data. Moreover, this option wouldnot enable the Lambda function to connect to the DB instance privately, as it wouldstill require the Lambda function to use the public internet to access the DBinstance.B. Update the security group of the DB instance to allow only Lambda functioninvocations on the database port. This option is not sufficient, as it would onlymodify the inbound rules of the security group of the DB instance, but not theoutbound rules of the security group of the Lambda function. Moreover, this option would not enable the Lambda function to connect to the DB instance privately, as itwould still require the Lambda function to use the public internet to access the DBinstance.E. Update the network ACL of the private subnet to include a self-referencing rulethat allows access through the database port. This option is not necessary, as thenetwork ACL of the private subnet already allows all traffic within the subnet bydefault. Moreover, this option would not enable the Lambda function to connect tothe DB instance privately, as it would still require the Lambda function to use thepublic internet to access the DB instance.References:1: Connecting to an Amazon RDS DB instance2: Configuring a Lambda function to access resources in a VPC3: Working with security groups: Network ACLs
Question # 11
A company has five offices in different AWS Regions. Each office has its own human resources (HR) department that uses a unique IAM role. The company stores employee records in a data lake that is based on Amazon S3 storage. A data engineering team needs to limit access to the records. Each HR department should be able to access records for only employees who are within the HR department's Region. Which combination of steps should the data engineering team take to meet this requirement with the LEAST operational overhead? (Choose two.)
A. Use data filters for each Region to register the S3 paths as data locations. B. Register the S3 path as an AWS Lake Formation location. C. Modify the IAM roles of the HR departments to add a data filter for each department'sRegion. D. Enable fine-grained access control in AWS Lake Formation. Add a data filter for eachRegion. E. Create a separate S3 bucket for each Region. Configure an IAM policy to allow S3access. Restrict access based on Region.
Answer: B,D Explanation: AWS Lake Formation is a service that helps you build, secure, and manage data lakes on Amazon S3. You can use AWS Lake Formation to register the S3 path as adata lake location, and enable fine-grained access control to limit access to the recordsbased on the HR department’s Region. You can use data filters to specify which S3prefixes or partitions each HR department can access, and grant permissions to the IAMroles of the HR departments accordingly. This solution will meet the requirement with theleast operational overhead, as it simplifies the data lake management and security, andleverages the existing IAM roles of the HR departments12.The other options are not optimal for the following reasons:A. Use data filters for each Region to register the S3 paths as data locations. Thisoption is not possible, as data filters are not used to register S3 paths as datalocations, but to grant permissions to access specific S3 prefixes or partitionswithin a data location. Moreover, this option does not specify how to limit access tothe records based on the HR department’s Region.C. Modify the IAM roles of the HR departments to add a data filter for eachdepartment’s Region. This option is not possible, as data filters are not added toIAM roles, but to permissions granted by AWS Lake Formation. Moreover, thisoption does not specify how to register the S3 path as a data lake location, or howto enable fine-grained access control in AWS Lake Formation.E. Create a separate S3 bucket for each Region. Configure an IAM policy to allowS3 access. Restrict access based on Region. This option is not recommended, asit would require more operational overhead to create and manage multiple S3buckets, and to configure and maintain IAM policies for each HR department.Moreover, this option does not leverage the benefits of AWS Lake Formation, suchas data cataloging, data transformation, and data governance.References:1: AWS Lake Formation2: AWS Lake Formation Permissions: AWS Identity and Access Management: Amazon S3
Question # 12
A healthcare company uses Amazon Kinesis Data Streams to stream real-time health data from wearable devices, hospital equipment, and patient records. A data engineer needs to find a solution to process the streaming data. The data engineer needs to store the data in an Amazon Redshift Serverless warehouse. The solution must support near real-time analytics of the streaming data and the previous day's data. Which solution will meet these requirements with the LEAST operational overhead?
A. Load data into Amazon Kinesis Data Firehose. Load the data into Amazon Redshift. B. Use the streaming ingestion feature of Amazon Redshift. C. Load the data into Amazon S3. Use the COPY command to load the data into AmazonRedshift. D. Use the Amazon Aurora zero-ETL integration with Amazon Redshift.
Answer: B Explanation: The streaming ingestion feature of Amazon Redshift enables you to ingest data from streaming sources, such as Amazon Kinesis Data Streams, into AmazonRedshift tables in near real-time. You can use the streaming ingestion feature to processthe streaming data from the wearable devices, hospital equipment, and patient records.The streaming ingestion feature also supports incremental updates, which means you canappend new data or update existing data in the Amazon Redshift tables. This way, you canstore the data in an Amazon Redshift Serverless warehouse and support near real-timeanalytics of the streaming data and the previous day’s data. This solution meets therequirements with the least operational overhead, as it does not require any additionalservices or components to ingest and process the streaming data. The other options areeither not feasible or not optimal. Loading data into Amazon Kinesis Data Firehose andthen into Amazon Redshift (option A) would introduce additional latency and cost, as wellas require additional configuration and management. Loading data into Amazon S3 andthen using the COPY command to load the data into Amazon Redshift (option C) wouldalso introduce additional latency and cost, as well as require additional storage space andETL logic. Using the Amazon Aurora zero-ETL integration with Amazon Redshift (option D)would not work, as it requires the data to be stored in Amazon Aurora first, which is not thecase for the streaming data from the healthcare company. References:Using streaming ingestion with Amazon RedshiftAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 3: Data Ingestion and Transformation, Section 3.5: Amazon RedshiftStreaming Ingestion
Question # 13
A company is migrating a legacy application to an Amazon S3 based data lake. A data engineer reviewed data that is associated with the legacy application. The data engineer found that the legacy data contained some duplicate information. The data engineer must identify and remove duplicate information from the legacy application data. Which solution will meet these requirements with the LEAST operational overhead?
A. Write a custom extract, transform, and load (ETL) job in Python. Use theDataFramedrop duplicatesf) function by importingthe Pandas library to perform datadeduplication. B. Write an AWS Glue extract, transform, and load (ETL) job. Usethe FindMatchesmachine learning(ML) transform to transform the data to perform data deduplication. C. Write a custom extract, transform, and load (ETL) job in Python. Import the Pythondedupe library. Use the dedupe library to perform data deduplication. D. Write an AWS Glue extract, transform, and load (ETL) job. Import the Python dedupelibrary. Use the dedupe library to perform data deduplication.
Answer: B Explanation: AWS Glue is a fully managed serverless ETL service that can handle data deduplication with minimal operational overhead. AWS Glue provides a built-in MLtransform called FindMatches, which can automatically identify and group similar records ina dataset. FindMatches can also generate a primary key for each group of records andremove duplicates. FindMatches does not require any coding or prior ML experience, as itcan learn from a sample of labeled data provided by the user. FindMatches can also scaleto handle large datasets and optimize the cost and performance of the ETL job.References:AWS GlueFindMatches ML TransformAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 14
A company needs to build a data lake in AWS. The company must provide row-level data access and column-level data access to specific teams. The teams will access the data by using Amazon Athena, Amazon Redshift Spectrum, and Apache Hive from Amazon EMR. Which solution will meet these requirements with the LEAST operational overhead?
A. Use Amazon S3 for data lake storage. Use S3 access policies to restrict data access byrows and columns. Provide data access throughAmazon S3. B. Use Amazon S3 for data lake storage. Use Apache Ranger through Amazon EMR torestrict data access byrows and columns. Providedata access by using Apache Pig. C. Use Amazon Redshift for data lake storage. Use Redshift security policies to restrictdata access byrows and columns. Provide data accessby usingApache Spark and AmazonAthena federated queries. D. UseAmazon S3 for data lake storage. Use AWS Lake Formation to restrict data accessby rows and columns. Provide data access through AWS Lake Formation.
Answer: D Explanation: Option D is the best solution to meet the requirements with the leastoperational overhead because AWS Lake Formation is a fully managed service thatsimplifies the process of building, securing, and managing data lakes. AWS Lake Formation allows you to define granular data access policies at the row and column levelfor different users and groups. AWS Lake Formation also integrates with Amazon Athena,Amazon Redshift Spectrum, and Apache Hive on Amazon EMR, enabling these services toaccess the data in the data lake through AWS Lake Formation.Option A is not a good solution because S3 access policies cannot restrict data access byrows and columns. S3 access policies are based on the identity and permissions of therequester, the bucket and object ownership, and the object prefix and tags. S3 accesspolicies cannot enforce fine-grained data access control at the row and column level.Option B is not a good solution because it involves using Apache Ranger and Apache Pig,which are not fully managed services and require additional configuration andmaintenance. Apache Ranger is a framework that provides centralized securityadministration for data stored in Hadoop clusters, such as Amazon EMR. Apache Rangercan enforce row-level and column-level access policies for Apache Hive tables. However,Apache Ranger is not a native AWS service and requires manual installation andconfiguration on Amazon EMR clusters. Apache Pig is a platform that allows you to analyzelarge data sets using a high-level scripting language called Pig Latin. Apache Pig canaccess data stored in Amazon S3 and process it using Apache Hive. However,Apache Pigis not a native AWS service and requires manual installation and configuration on AmazonEMR clusters.Option C is not a good solution because Amazon Redshift is not a suitable service for datalake storage. Amazon Redshift is a fully managed data warehouse service that allows youto run complex analytical queries using standard SQL. Amazon Redshift can enforce rowleveland column-level access policies for different users and groups. However, AmazonRedshift is not designed to store and process large volumes of unstructured or semistructureddata, which are typical characteristics of data lakes. Amazon Redshift is alsomore expensive and less scalable than Amazon S3 for data lake storage.References:AWS Certified Data Engineer - Associate DEA-C01 Complete Study GuideWhat Is AWS Lake Formation? - AWS Lake FormationUsing AWS Lake Formation with Amazon Athena - AWS Lake FormationUsing AWS Lake Formation with Amazon Redshift Spectrum - AWS LakeFormationUsing AWS Lake Formation with Apache Hive on Amazon EMR - AWS LakeFormationUsing Bucket Policies and User Policies - Amazon Simple Storage ServiceApache RangerApache PigWhat Is Amazon Redshift? - Amazon Redshift
Question # 15
A company uses an Amazon Redshift provisioned cluster as its database. The Redshift cluster has five reserved ra3.4xlarge nodes and uses key distribution. A data engineer notices that one of the nodes frequently has a CPU load over 90%. SQL Queries that run on the node are queued. The other four nodes usually have a CPU load under 15% during daily operations. The data engineer wants to maintain the current number of compute nodes. The data engineer also wants to balance the load more evenly across all five compute nodes. Which solution will meet these requirements?
A. Change the sort key to be the data column that is most often used in a WHERE clauseof the SQL SELECT statement. B. Change the distribution key to the table column that has the largest dimension. C. Upgrade the reserved node from ra3.4xlarqe to ra3.16xlarqe. D. Change the primary key to be the data column that is most often used in a WHEREclause of the SQL SELECT statement.
Answer: B Explanation: Changing the distribution key to the table column that has the largest dimension will help to balance the load more evenly across all five compute nodes. Thedistribution key determines how the rows of a table are distributed among the slices of thecluster. If the distribution key is not chosen wisely, it can cause data skew, meaning someslices will have more data than others, resulting in uneven CPU load and queryperformance. By choosing the table column that has the largest dimension, meaning thecolumn that has the most distinct values, as the distribution key, the data engineer canensure that the rows are distributed more uniformly across the slices, reducing data skewand improving query performance.The other options are not solutions that will meet the requirements. Option A, changing thesort key to be the data column that is most often used in a WHERE clause of the SQLSELECT statement, will not affect the data distribution or the CPU load. The sort keydetermines the order in which the rows of a table are stored on disk, which can improve theperformance of range-restricted queries, but not the load balancing. Option C, upgradingthe reserved node from ra3.4xlarge to ra3.16xlarge, will not maintain the current number ofcompute nodes, as it will increase the cost and the capacity of the cluster. Option D,changing the primary key to be the data column that is most often used in a WHEREclause of the SQL SELECT statement, will not affect the data distribution or the CPU loadeither. The primary key is a constraint that enforces the uniqueness of the rows in a table,but it does not influence the data layout or the query optimization. References:Choosing a data distribution styleChoosing a data sort keyWorking with primary keys
Question # 16
A company is developing an application that runs on Amazon EC2 instances. Currently, the data that the application generates is temporary. However, the company needs to persist the data, even if the EC2 instances are terminated. A data engineer must launch new EC2 instances from an Amazon Machine Image (AMI) and configure the instances to preserve the data. Which solution will meet this requirement?
A. Launch new EC2 instances by using an AMI that is backed by an EC2 instance storevolume that contains the application data. Apply the default settings to the EC2 instances. B. Launch new EC2 instances by using an AMI that is backed by a root Amazon ElasticBlock Store (Amazon EBS) volume that contains the application data. Apply the defaultsettings to the EC2 instances. C. Launch new EC2 instances by using an AMI that is backed by an EC2 instance storevolume. Attach an Amazon Elastic Block Store (Amazon EBS) volume to contain theapplication data. Apply the default settings to the EC2 instances. D. Launch new EC2 instances by using an AMI that is backed by an Amazon Elastic BlockStore (Amazon EBS) volume. Attach an additional EC2 instance store volume to containthe application data. Apply the default settings to the EC2 instances.
Answer: C Explanation: Amazon EC2 instances can use two types of storage volumes: instance store volumes and Amazon EBS volumes. Instance store volumes are ephemeral, meaningthey are only attached to the instance for the duration of its life cycle. If the instance isstopped, terminated, or fails, the data on the instance store volume is lost. Amazon EBSvolumes are persistent, meaning they can be detached from the instance and attached toanother instance, and the data on the volume is preserved. To meet the requirement ofpersisting the data even if the EC2 instances are terminated, the data engineer must useAmazon EBS volumes to store the application data. The solution is to launch new EC2instances by using an AMI that is backed by an EC2 instance store volume, which is thedefault option for most AMIs. Then, the data engineer must attach an Amazon EBS volumeto each instance and configure the application to write the data to the EBS volume. Thisway, the data will be saved on the EBS volume and can be accessed by another instance ifneeded. The data engineer can apply the default settings to the EC2 instances, as there isno need to modify the instance type, security group, or IAM role for this solution. The otheroptions are either not feasible or not optimal. Launching new EC2 instances by using anAMI that is backed by an EC2 instance store volume that contains the application data(option A) or by using an AMI that is backed by a root Amazon EBS volume that containsthe application data (option B) would not work, as the data on the AMI would be outdatedand overwritten by the new instances. Attaching an additional EC2 instance store volumeto contain the application data (option D)would not work, as the data on the instance storevolume would be lost if the instance is terminated. References:Amazon EC2 Instance StoreAmazon EBS VolumesAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 2: Data Store Management, Section 2.1: Amazon EC2
Question # 17
A data engineer must ingest a source of structured data that is in .csv format into an Amazon S3 data lake. The .csv files contain 15 columns. Data analysts need to run Amazon Athena queries on one or two columns of the dataset. The data analysts rarely query the entire file. Which solution will meet these requirements MOST cost-effectively?
A. Use an AWS Glue PySpark job to ingest the source data into the data lake in .csvformat. B. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csvstructured data source. Configure the job to ingest the data into the data lake in JSONformat.C. Use an AWS Glue PySpark job to ingest the source data into the data lake in ApacheAvro format. D. Create an AWS Glue extract, transform, and load (ETL) job to read from the .csvstructured data source. Configure the job to write the data into the data lake in ApacheParquet format.
Answer: D Explanation: Amazon Athena is a serverless interactive query service that allows you to analyze data in Amazon S3 using standard SQL. Athena supports various data formats,such as CSV,JSON, ORC, Avro, and Parquet. However, not all data formats are equallyefficient for querying. Some data formats, such as CSV and JSON, are row-oriented,meaning that they store data as a sequence of records, each with the same fields. Roworientedformats are suitable for loading and exporting data, but they are not optimal foranalytical queries that often access only a subset of columns. Row-oriented formats alsodo not support compression or encoding techniques that can reduce the data size andimprove the query performance.On the other hand, some data formats, such as ORC and Parquet, are column-oriented,meaning that they store data as a collection of columns, each with a specific data type.Column-oriented formats are ideal for analytical queries that often filter, aggregate, or joindata by columns. Column-oriented formats also support compression and encodingtechniques that can reduce the data size and improve the query performance. Forexample, Parquet supports dictionary encoding, which replaces repeated values withnumeric codes, and run-length encoding, which replaces consecutive identical values witha single value and a count. Parquet also supports various compression algorithms, such asSnappy, GZIP, and ZSTD, that can further reduce the data size and improve the queryperformance.Therefore, creating an AWS Glue extract, transform, and load (ETL) job to read from the.csv structured data source and writing the data into the data lake in Apache Parquetformat will meet the requirements most cost-effectively. AWS Glue is a fully managed service that provides a serverless data integration platform for data preparation, datacataloging, and data loading. AWS Glue ETL jobs allow you to transform and load datafrom various sources into various targets, using either a graphical interface (AWS GlueStudio) or a code-based interface (AWS Glue console or AWS Glue API). By using AWSGlue ETL jobs, you can easily convert the data from CSV to Parquet format, without havingto write or manage any code. Parquet is a column-oriented format that allows Athena toscan only the relevant columns and skip the rest, reducing the amount of data read fromS3. This solution will also reduce the cost of Athena queries, as Athena charges based onthe amount of data scanned from S3.The other options are not as cost-effective as creating an AWS Glue ETL job to write thedata into the data lake in Parquet format. Using an AWS Glue PySpark job to ingest thesource data into the data lake in .csv format will not improve the query performance orreduce the query cost, as .csv is a row-oriented format that does not support columnaraccess or compression. Creating an AWS Glue ETL job to ingest the data into the datalake in JSON format will not improve the query performance or reduce the query cost, asJSON is also a row-oriented format that does not support columnar access or compression.Using an AWS Glue PySpark job to ingest the source data into the data lake in ApacheAvro format will improve the query performance, as Avro is a column-oriented format thatsupports compression and encoding, but it will require more operational effort, as you willneed to write and maintain PySpark code to convert the data from CSV to Avro format.References:Amazon AthenaChoosing the Right Data FormatAWS Glue[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide],Chapter 5: Data Analysis and Visualization, Section 5.1: Amazon Athena
Question # 18
A data engineer uses Amazon Redshift to run resource-intensive analytics processes once every month. Every month, the data engineer creates a new Redshift provisioned cluster. The data engineer deletes the Redshift provisioned cluster after the analytics processes are complete every month. Before the data engineer deletes the cluster each month, the data engineer unloads backup data from the cluster to an Amazon S3 bucket. The data engineer needs a solution to run the monthly analytics processes that does not require the data engineer to manage the infrastructure manually. Which solution will meet these requirements with the LEAST operational overhead?
A. Use Amazon Step Functions to pause the Redshift cluster when the analytics processesare complete and to resume the cluster to run new processes every month. B. Use Amazon Redshift Serverless to automatically process the analytics workload. C. Use the AWS CLI to automatically process the analytics workload. D. Use AWS CloudFormation templates to automatically process the analytics workload.
Answer: B Explanation: Amazon Redshift Serverless is a new feature of Amazon Redshift that enables you to run SQL queries on data in Amazon S3 without provisioning or managingany clusters. You can use Amazon Redshift Serverless to automatically process theanalytics workload, as it scales up and down the compute resources based on the querydemand, and charges you only for the resources consumed. This solution will meet therequirements with the least operational overhead, as it does not require the data engineerto create, delete, pause, or resume any Redshift clusters, or to manage any infrastructuremanually. You can use the Amazon Redshift Data API to run queries from the AWS CLI,AWS SDK, or AWS Lambda functions12.The other options are not optimal for the following reasons:A. Use Amazon Step Functions to pause the Redshift cluster when the analyticsprocesses are complete and to resume the cluster to run new processes everymonth. This option is not recommended, as it would still require the data engineerto create and delete a new Redshift provisioned cluster every month, which canincur additional costs and time. Moreover, this option would require the dataengineer to use Amazon Step Functions to orchestrate the workflow of pausingand resuming the cluster, which can add complexity and overhead.C. Use the AWS CLI to automatically process the analytics workload. This optionis vague and does not specify how the AWS CLI is used to process the analyticsworkload. The AWS CLI can be used to run queries on data in Amazon S3 usingAmazon Redshift Serverless, Amazon Athena, or Amazon EMR, but each of theseservices has different features and benefits. Moreover, this option does notaddress the requirement of not managing the infrastructure manually, as the dataengineer may still need to provision and configure some resources, such asAmazon EMR clusters or Amazon Athena workgroups.D. Use AWS CloudFormation templates to automatically process the analyticsworkload. This option is also vague and does not specify how AWSCloudFormation templates are used to process the analytics workload. AWSCloudFormation is a service that lets you model and provision AWS resourcesusing templates. You can use AWS CloudFormation templates to create anddelete a Redshift provisioned cluster every month, or to create and configure otherAWS resources, such as Amazon EMR, Amazon Athena, or Amazon RedshiftServerless. However, this option does not address the requirement of notmanaging the infrastructure manually, as the data engineer may still need to writeand maintain the AWS CloudFormation templates, and to monitor the status andperformance of the resources.References:1: Amazon Redshift Serverless2: Amazon Redshift Data API: Amazon Step Functions: AWS CLI: AWS CloudFormation
Question # 19
A financial company wants to use Amazon Athena to run on-demand SQL queries on a petabyte-scale dataset to support a business intelligence (BI) application. An AWS Glue job that runs during non-business hours updates the dataset once every day. The BI application has a standard data refresh frequency of 1 hour to comply with company policies. A data engineer wants to cost optimize the company's use of Amazon Athena without adding any additional infrastructure costs. Which solution will meet these requirements with the LEAST operational overhead?
A. Configure an Amazon S3 Lifecycle policy to move data to the S3 Glacier Deep Archivestorage class after 1 day B. Use the query result reuse feature of Amazon Athena for the SQL queries. C. Add an Amazon ElastiCache cluster between the Bl application and Athena. D. Change the format of the files that are in the dataset to Apache Parquet.
Answer: B Explanation: The best solution to cost optimize the company’s use of Amazon Athena without adding any additional infrastructure costs is to use the query result reuse feature ofAmazonAthena for the SQL queries. This feature allows you to run the same query multipletimes without incurring additional charges, as long as the underlying data has not changedand the query results are still in the query result location in Amazon S31. This feature isuseful for scenarios where you have a petabyte-scale dataset that is updated infrequently,such as once a day, and you have a BI application that runs the same queries repeatedly,such as every hour. By using the query result reuse feature, you can reduce the amount ofdata scanned by your queries and save on the cost of running Athena. You can enable ordisable this feature at the workgroup level or at the individual query level1.Option A is not the best solution, as configuring an Amazon S3 Lifecycle policy to movedata to the S3 Glacier Deep Archive storage class after 1 day would not cost optimize thecompany’s use of Amazon Athena, but rather increase the cost and complexity. AmazonS3 Lifecycle policies are rules that you can define to automatically transition objectsbetween different storage classes based on specified criteria, such as the age of theobject2. S3 Glacier Deep Archive is the lowest-cost storage class in Amazon S3, designedfor long-term data archiving that is accessed once or twice in a year3. While moving data toS3 Glacier Deep Archive can reduce the storage cost, it would also increase the retrievalcost and latency, as it takes up to 12 hours to restore the data from S3 Glacier DeepArchive3. Moreover, Athena does not support querying data that is in S3 Glacier or S3Glacier Deep Archive storage classes4. Therefore, using this option would not meet therequirements of running on-demand SQL queries on the dataset.Option C is not the best solution, as adding an Amazon ElastiCache cluster between the BIapplication and Athena would not cost optimize the company’s use of Amazon Athena, butrather increase the cost and complexity. Amazon ElastiCache is a service that offers fullymanaged in-memory data stores, such as Redis and Memcached, that can improve theperformance and scalability of web applications by caching frequently accessed data.While using ElastiCache can reduce the latency and load on the BI application, it would notreduce the amount of data scanned by Athena, which is the main factor that determines thecost of running Athena. Moreover, using ElastiCache would introduce additional infrastructure costs and operational overhead, as you would have to provision, manage,and scale the ElastiCache cluster, and integrate it with the BI application and Athena.Option D is not the best solution, as changing the format of the files that are in the datasetto Apache Parquet would not cost optimize the company’s use of Amazon Athena withoutadding any additional infrastructure costs, but rather increase the complexity. ApacheParquet is a columnar storage format that can improve the performance of analyticalqueries by reducing the amount of data that needs to be scanned and providing efficientcompression and encoding schemes. However,changing the format of the files that are inthe dataset to Apache Parquet would require additional processing and transformationsteps, such as using AWS Glue or Amazon EMR to convert the files from their originalformat to Parquet, and storing the converted files in a separate location in Amazon S3. Thiswould increase the complexity and the operational overhead of the data pipeline, and alsoincur additional costs for using AWS Glue or Amazon EMR. References:Query result reuseAmazon S3 LifecycleS3 Glacier Deep ArchiveStorage classes supported by Athena[What is Amazon ElastiCache?][Amazon Athena pricing][Columnar Storage Formats]AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 20
A company uses an Amazon Redshift cluster that runs on RA3 nodes. The company wants to scale read and write capacity to meet demand. A data engineer needs to identify a solution that will turn on concurrency scaling. Which solution will meet this requirement?
A. Turn on concurrency scaling in workload management (WLM) for Redshift Serverlessworkgroups. B. Turn on concurrency scaling at the workload management (WLM) queue level in theRedshift cluster. C. Turn on concurrency scaling in the settings duringthe creation of andnew Redshiftcluster. D. Turn on concurrency scaling for the daily usage quota for the Redshift cluster.
Answer: B Explanation: Concurrency scaling is a feature that allows you to support thousands ofconcurrent users and queries, with consistently fast query performance. When you turn onconcurrency scaling, Amazon Redshift automatically adds query processing power inseconds to process queries without any delays. You can manage which queries are sent tothe concurrency-scaling cluster by configuring WLM queues. To turn on concurrencyscaling for a queue, set the Concurrency Scaling mode value to auto. The other options areeither incorrect or irrelevant, as they do not enable concurrency scaling for the existingRedshift cluster on RA3 nodes. References:Working with concurrency scaling - Amazon RedshiftAmazon Redshift Concurrency Scaling - Amazon Web ServicesConfiguring concurrency scaling queues - Amazon RedshiftAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide(Chapter 6, page 163)c
Question # 21
A company has a production AWS account that runs company workloads. The company's security team created a security AWS account to store and analyze security logs from the production AWS account. The security logs in the production AWS account are stored in Amazon CloudWatch Logs. The company needs to use Amazon Kinesis Data Streams to deliver the security logs to the security AWS account. Which solution will meet these requirements?
A. Create a destination data stream in the production AWS account. In the security AWSaccount, create an IAM role that has cross-account permissions to Kinesis Data Streams inthe production AWS account. B. Create a destination data stream in the security AWS account. Create an IAM role and atrust policy to grant CloudWatch Logs the permission to put data into the stream. Create asubscription filter in the security AWS account. C. Create a destination data stream in the production AWS account. In the production AWSaccount, create an IAM role that has cross-account permissions to Kinesis Data Streams inthe security AWS account. D. Create a destination data stream in the security AWS account. Create an IAM role and atrust policy to grant CloudWatch Logs the permission to put data into the stream. Create asubscription filter in the production AWS account.
Answer: D Explanation: Amazon Kinesis Data Streams is a service that enables you to collect, process, and analyze real-time streaming data. You can use Kinesis Data Streams toingest data from various sources, such as Amazon CloudWatch Logs, and deliver it todifferent destinations, such as Amazon S3 or Amazon Redshift. To use Kinesis DataStreams to deliver the security logs from the production AWS account to the security AWSaccount, you need to create a destination data stream in the security AWS account. Thisdata stream will receive the log data from the CloudWatch Logs service in the productionAWS account. To enable this cross-account data delivery, you need to create an IAM roleand a trust policy in the security AWS account. The IAM role defines the permissions thatthe CloudWatch Logs service needs to put data into the destination data stream. The trustpolicy allows the production AWS account to assume the IAM role. Finally, you need tocreate a subscription filter in the production AWS account. A subscription filter defines thepattern to match log events and the destination to send the matching events. In this case,the destination is the destination data stream in the security AWS account. This solutionmeets the requirements of using Kinesis Data Streams to deliver the security logs to thesecurity AWS account. The other options are either not possible or not optimal. You cannotcreate a destination data stream in the production AWS account, as this would not deliverthe data to the security AWS account. You cannot create a subscription filter in the securityAWS account, as this would not capture the log events from the production AWS account.References:Using Amazon Kinesis Data Streams with Amazon CloudWatch LogsAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 3: Data Ingestion and Transformation, Section 3.3: Amazon Kinesis DataStreams
Question # 22
A company is migrating on-premises workloads to AWS. The company wants to reduce overall operational overhead. The company also wants to explore serverless options. The company's current workloads use Apache Pig, Apache Oozie, Apache Spark, Apache Hbase, and Apache Flink. The on-premises workloads process petabytes of data in seconds. The company must maintain similar or better performance after the migration to AWS. Which extract, transform, and load (ETL) service will meet these requirements?
A. AWS Glue B. Amazon EMR C. AWS Lambda D. Amazon Redshift
Answer: A Explanation: AWS Glue is a fully managed serverless ETL service that can handlepetabytes of data in seconds. AWS Glue can run Apache Spark and Apache Flink jobswithout requiring any infrastructure provisioning or management. AWS Glue can alsointegrate with Apache Pig, Apache Oozie, and Apache Hbase using AWS Glue DataCatalog and AWS Glue workflows. AWS Glue can reduce the overall operational overheadby automating the data discovery, data preparation, and data loading processes. AWSGlue can also optimize the cost and performance of ETL jobs by using AWS Glue JobBookmarking, AWS Glue Crawlers, and AWS Glue Schema Registry. References:AWS GlueAWS Glue Data CatalogAWS Glue Workflows[AWS Glue Job Bookmarking][AWS Glue Crawlers][AWS Glue Schema Registry][AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 23
A data engineering team is using an Amazon Redshift data warehouse for operational reporting. The team wants to prevent performance issues that might result from longrunning queries. A data engineer must choose a system table in Amazon Redshift to record anomalies when a query optimizer identifies conditions that might indicate performance issues. Which table views should the data engineer use to meet this requirement?
A. STL USAGE CONTROL B. STL ALERT EVENT LOG C. STL QUERY METRICS D. STL PLAN INFO
Answer: B Explanation: The STL ALERT EVENT LOG table view records anomalies when the queryoptimizer identifies conditions that might indicate performance issues. These conditionsinclude skewed data distribution, missing statistics, nested loop joins, and broadcasteddata. The STL ALERT EVENT LOG table view can help the data engineer to identify andtroubleshoot the root causes of performance issues and optimize the query execution plan.The other table views are not relevant for this requirement. STL USAGE CONTROLrecords the usage limits and quotas for Amazon Redshift resources. STL QUERYMETRICS records the execution time and resource consumption of queries. STL PLANINFO records the query execution plan and the steps involved in each query. References:STL ALERT EVENT LOGSystem Tables and ViewsAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 24
A media company wants to improve a system that recommends media content to customer based on user behavior and preferences. To improve the recommendation system, the company needs to incorporate insights from third-party datasets into the company's existing analytics platform. The company wants to minimize the effort and time required to incorporate third-party datasets. Which solution will meet these requirements with the LEAST operational overhead?
A. Use API calls to access and integrate third-party datasets from AWS Data Exchange. B. Use API calls to access and integrate third-party datasets from AWS C. Use Amazon Kinesis Data Streams to access and integrate third-party datasets fromAWS CodeCommit repositories. D. Use Amazon Kinesis Data Streams to access and integrate third-party datasets fromAmazon Elastic Container Registry (Amazon ECR).
Answer: A Explanation: AWS Data Exchange is a service that makes it easy to find, subscribe to, and use third-party data in the cloud. It provides a secure and reliable way to access andintegrate data from various sources, such as data providers, public datasets, or AWSservices. Using AWS Data Exchange, you can browse and subscribe to data products thatsuit your needs, and then use API calls or the AWS Management Console to export thedata to Amazon S3, where you can use it with your existing analytics platform. This solutionminimizes the effort and time required to incorporate third-party datasets, as you do notneed to set up and manage data pipelines, storage, or access controls. You also benefitfrom the data quality and freshness provided by the data providers, who can update theirdata products as frequently as needed12.The other options are not optimal for the following reasons:B. Use API calls to access and integrate third-party datasets from AWS. Thisoption is vague and does not specify which AWS service or feature is used toaccess and integrate third-party datasets. AWS offers a variety of services andfeatures that can help with data ingestion, processing, and analysis, but not all ofthem are suitable for the given scenario. For example, AWS Glue is a serverlessdata integration service that can help you discover, prepare, and combine datafrom various sources, but it requires you to create and run data extraction,transformation, and loading (ETL) jobs, which can add operational overhead3.C. Use Amazon Kinesis Data Streams to access and integrate third-party datasetsfrom AWS CodeCommit repositories. This option is not feasible, as AWSCodeCommit is a source control service that hosts secure Git-based repositories,not a data source that can be accessed by Amazon Kinesis Data Streams.Amazon Kinesis Data Streams is a service that enables you to capture, process,and analyze data streams in real time, suchas clickstream data, application logs,or IoT telemetry. It does not support accessing and integrating data from AWSCodeCommit repositories, which are meant for storing and managing code, notdata .D. Use Amazon Kinesis Data Streams to access and integrate third-party datasetsfrom Amazon Elastic Container Registry (Amazon ECR). This option is also notfeasible, as Amazon ECR is a fully managed container registry service that stores,manages, and deploys container images, not a data source that can be accessedby Amazon Kinesis Data Streams. Amazon Kinesis Data Streams does notsupport accessing and integrating data from Amazon ECR, which is meant forstoring and managing container images, not data .References: 1: AWS Data Exchange User Guide2: AWS Data Exchange FAQs3: AWS Glue Developer Guide: AWS CodeCommit User Guide: Amazon Kinesis Data Streams Developer Guide: Amazon Elastic Container Registry User Guide: Build a Continuous Delivery Pipeline for Your Container Images with AmazonECR as Source
Question # 25
A company uses an on-premises Microsoft SQL Server database to store financial transaction data. The company migrates the transaction data from the on-premises database to AWS at the end of each month. The company has noticed that the cost to migrate data from the on-premises database to an Amazon RDS for SQL Server database has increased recently. The company requires a cost-effective solution to migrate the data to AWS. The solution must cause minimal downtown for the applications that access the database. Which AWS service should the company use to meet these requirements?
A. AWS Lambda B. AWS Database Migration Service (AWS DMS) C. AWS Direct Connect D. AWS DataSync
Answer: B Explanation: AWS Database Migration Service (AWS DMS) is a cloud service that makesit possible to migrate relational databases, data warehouses, NoSQL databases, and othertypes of data stores to AWS quickly, securely, and with minimal downtime and zero data loss1. AWS DMS supports migration between 20-plus database and analytics engines,such as Microsoft SQL Server to Amazon RDS for SQL Server2. AWS DMS takesovermany of the difficult or tedious tasks involved in a migration project, such as capacityanalysis, hardware and software procurement, installation and administration, testing anddebugging, and ongoing replication and monitoring1. AWS DMS is a cost-effective solution,as you only pay for the compute resources and additional log storage used during themigration process2. AWS DMS is the best solution for the company to migrate the financialtransaction data from the on-premises Microsoft SQL Server database to AWS, as it meetsthe requirements of minimal downtime, zero data loss, and low cost.Option A is not the best solution, as AWS Lambda is a serverless compute service that letsyou run code without provisioning or managing servers, but it does not provide any built-infeatures for database migration. You would have to write your own code to extract,transform, and load the data from the source to the target, which would increase theoperational overhead and complexity.Option C is not the best solution, as AWS Direct Connect is a service that establishes adedicated network connection from your premises to AWS, but it does not provide anybuilt-in features for database migration. You would still need to use another service or toolto perform the actual data transfer, which would increase the cost and complexity.Option D is not the best solution, as AWS DataSync is a service that makes it easy totransfer data between on-premises storage systems and AWS storage services, such asAmazon S3, Amazon EFS, and Amazon FSx for Windows File Server, but it does notsupport Amazon RDS for SQL Server as a target. You would have to use another serviceor tool to migrate the data from Amazon S3 to Amazon RDS for SQL Server, which wouldincrease the latency and complexity. References:Database Migration - AWS Database Migration Service - AWSWhat is AWS Database Migration Service?AWS Database Migration Service DocumentationAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 26
A company has used an Amazon Redshift table that is named Orders for 6 months. The company performs weekly updates and deletes on the table. The table has an interleaved sort key on a column that contains AWS Regions. The company wants to reclaim disk space so that the company will not run out of storage space. The company also wants to analyze the sort key column. Which Amazon Redshift command will meet these requirements?
A. VACUUM FULL Orders B. VACUUM DELETE ONLY Orders C. VACUUM REINDEX Orders D. VACUUM SORT ONLY Orders
Answer: C Explanation:Amazon Redshift is a fully managed, petabyte-scale data warehouse service that enablesfast and cost-effective analysis of large volumes of data. Amazon Redshift uses columnarstorage, compression, and zone maps to optimize the storage and performance of data.However, over time, as data is inserted, updated, or deleted, the physical storage of datacan become fragmented, resulting in wasted disk space and degraded queryperformance. To address this issue, Amazon Redshift provides the VACUUM command,which reclaims disk space and resorts rows in either a specified table or all tables in thecurrent schema1.The VACUUM command has four options: FULL, DELETE ONLY, SORT ONLY, andREINDEX. The option that best meets the requirements of the question is VACUUMREINDEX, which re-sorts the rows in a table that has an interleaved sort key andrewritesthe table to a new location on disk. An interleaved sort key is a type of sort key thatgives equal weight to each column in the sort key, and stores the rows in a way thatoptimizes the performance of queries that filter by multiple columns in the sort key.However, as data is added or changed, the interleaved sort order can become skewed,resulting in suboptimal query performance. The VACUUM REINDEX option restores theoptimal interleaved sort order and reclaims disk space by removing deleted rows. Thisoption also analyzes the sort key column and updates the table statistics, which are usedby the query optimizer to generate the most efficient query execution plan23. The other options are not optimal for the following reasons:A. VACUUM FULL Orders. This option reclaims disk space by removing deletedrows and resorts the entire table. However, this option is not suitable for tables thathave an interleaved sort key, as it does not restore the optimal interleaved sortorder. Moreover, this option is the most resource-intensive and time-consuming,as it rewrites the entire table to a new location on disk.B. VACUUM DELETE ONLY Orders. This option reclaims disk space by removingdeleted rows, but does not resort the table. This option is not suitable for tablesthat have any sort key, as it does not improve the query performance by restoringthe sort order. Moreover, this option does not analyze the sort key column andupdate the table statistics.D. VACUUM SORT ONLY Orders. This option resorts the entire table, but doesnot reclaim disk space by removing deleted rows. This option is not suitable fortables that have an interleaved sort key, as it does not restore the optimalinterleaved sort order. Moreover, this option does not analyze the sort key columnand update the table statistics.References:1: Amazon Redshift VACUUM2: Amazon Redshift Interleaved Sorting3: Amazon Redshift ANALYZE
Question # 27
A company extracts approximately 1 TB of data every day from data sources such as SAP HANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. Some of the data sources have undefined data schemas or data schemas that change. A data engineer must implement a solution that can detect the schema for these data sources. The solution must extract, transform, and load the data to an Amazon S3 bucket. The company has a service level agreement (SLA) to load the data into the S3 bucket within 15 minutes of data creation. Which solution will meet these requirements with the LEAST operational overhead?
A. Use Amazon EMR to detect the schema and to extract, transform, and load the data intothe S3 bucket. Create a pipeline in Apache Spark. B. Use AWS Glue to detect the schema and to extract, transform, and load the data intothe S3 bucket. Create a pipeline in Apache Spark. C. Create a PvSpark proqram in AWS Lambda to extract, transform, and load the data intothe S3 bucket. D. Create a stored procedure in Amazon Redshift to detect the schema and to extract,transform, and load the data into a Redshift Spectrum table. Access the table from AmazonS3.
Answer: B Explanation: AWS Glue is a fully managed service that provides a serverless data integration platform.It can automatically discover and categorize data from various sources, including SAPHANA, Microsoft SQL Server, MongoDB, Apache Kafka, and Amazon DynamoDB. It canalso infer the schema of the data and store it in the AWS Glue Data Catalog, which is acentral metadata repository. AWS Glue can then use the schema information to generateand run Apache Spark code to extract, transform, and load the data into an Amazon S3bucket. AWS Glue can also monitor and optimize the performance and cost of the datapipeline, and handle any schema changes that may occur in the source data. AWS Gluecan meet the SLA of loading the data into the S3 bucket within 15 minutes of data creation,as it can trigger the data pipeline based on events, schedules, or on-demand. AWS Gluehas the least operational overhead among the options, as it does not require provisioning,configuring, or managing any servers or clusters. It also handles scaling, patching, andsecurity automatically. References:AWS Glue[AWS Glue Data Catalog][AWS Glue Developer Guide]AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 28
A company stores data from an application in an Amazon DynamoDB table that operates in provisioned capacity mode. The workloads of the application have predictable throughput load on a regular schedule. Every Monday, there is an immediate increase in activity early in the morning. The application has very low usage during weekends. The company must ensure that the application performs consistently during peak usage times. Which solution will meet these requirements in the MOST cost-effective way?
A. Increase the provisioned capacity to the maximum capacity that is currently presentduring peak load times. B. Divide the table into two tables. Provision each table with half of the provisionedcapacity of the original table. Spread queries evenly across both tables. C. Use AWS Application Auto Scaling to schedule higher provisioned capacity for peakusage times. Schedule lower capacity during off-peak times. D. Change the capacity mode from provisioned to on-demand. Configure the table to scaleup and scale down based on the load on the table.
Answer: C Explanation: Amazon DynamoDB is a fully managed NoSQL database service that provides fast and predictable performance with seamless scalability. DynamoDB offers twocapacity modes for throughput capacity: provisioned and on-demand. In provisionedcapacity mode, you specify the number of read and write capacity units per second thatyou expect your application to require. DynamoDB reserves the resources to meet yourthroughput needs with consistent performance. In on-demand capacity mode, you pay perrequest and DynamoDB scales the resources up and down automatically based on theactual workload. On-demand capacity mode is suitable for unpredictable workloads thatcan vary significantly over time1.The solution that meets the requirements in the most cost-effective way is to use AWSApplication Auto Scaling to schedule higher provisioned capacity for peak usage times andlower capacity during off-peak times. This solution has the following advantages:It allows you to optimize the cost and performance of your DynamoDB table byadjusting the provisioned capacity according to your predictable workload patterns.You can use scheduled scaling to specify the date and time for the scaling actions,and the new minimum and maximum capacity limits. For example, you canschedule higher capacity for every Monday morning and lower capacity forweekends2.It enables you to take advantage of the lower cost per unit of provisioned capacitymode compared to on-demand capacity mode. Provisioned capacity modecharges a flat hourly rate for the capacity you reserve, regardless of how much youuse. On-demand capacity mode charges for each read and write request youconsume, with nominimum capacity required. For predictable workloads,provisioned capacity mode can be more cost-effective than on-demand capacitymode1.It ensures that your application performs consistently during peak usage times byhaving enough capacity to handle the increased load. You can also use autoscaling to automatically adjust the provisioned capacity based on the actualutilization of your table, and set a target utilization percentage for your table orglobal secondary index. This way, you can avoid under-provisioning or overprovisioningyour table2.Option A is incorrect because it suggests increasing the provisioned capacity to themaximum capacity that is currently present during peak load times. This solution has thefollowing disadvantages: It wastes money by paying for unused capacity during off-peak times. If youprovision the same high capacity for all times, regardless of the actual workload,you are over-provisioning your table and paying for resources that you don’tneed1.It does not account for possible changes in the workload patterns over time. If yourpeak load times increase or decrease in the future, you may need to manuallyadjust the provisioned capacity to match the new demand. This adds operationaloverhead and complexity to your application2.Option B is incorrect because it suggests dividing the table into two tables and provisioningeach table with half of the provisioned capacity of the original table. This solution has thefollowing disadvantages:It complicates the data model and the application logic by splitting the data into twoseparate tables. You need to ensure that the queries are evenly distributed acrossboth tables, and that the data is consistent and synchronized between them. Thisadds extra development and maintenance effort to your application3.It does not solve the problem of adjusting the provisioned capacity according to theworkload patterns. You still need to manually or automatically scale the capacity ofeach table based on the actual utilization and demand. This may result in underprovisioningor over-provisioning your tables2.Option D is incorrect because it suggests changing the capacity mode from provisioned toon-demand. This solution has the following disadvantages:It may incur higher costs than provisioned capacity mode for predictableworkloads. On-demand capacity mode charges for each read and write requestyou consume, with no minimum capacity required. For predictable workloads,provisioned capacity mode can be more cost-effective than on-demand capacitymode, as you can reserve the capacity you need at a lower rate1.It may not provide consistent performance during peak usage times, as ondemandcapacity mode may take some time to scale up the resources to meet thesudden increase in demand. On-demand capacity mode uses adaptive capacity tohandle bursts of traffic, but it may not be able to handle very large spikes orsustained high throughput. In such cases, you may experience throttling orincreased latency.References:1: Choosing the right DynamoDB capacity mode - Amazon DynamoDB2: Managing throughput capacity automatically with DynamoDB auto scaling -Amazon DynamoDB3: Best practices for designing and using partition keys effectively - AmazonDynamoDB[4]: On-demand mode guidelines - Amazon DynamoDB[5]: How to optimize Amazon DynamoDB costs - AWS Database Blog[6]: DynamoDB adaptive capacity: How it works and how it helps - AWS DatabaseBlog[7]: Amazon DynamoDB pricing - Amazon Web Services (AWS)
Question # 29
A company is planning to migrate on-premises Apache Hadoop clusters to Amazon EMR. The company also needs to migrate a data catalog into a persistent storage solution. The company currently stores the data catalog in an on-premises Apache Hive metastore on the Hadoop clusters. The company requires a serverless solution to migrate the data catalog. Which solution will meet these requirements MOST cost-effectively?
A. Use AWS Database Migration Service (AWS DMS) to migrate the Hive metastore intoAmazon S3. Configure AWS Glue Data Catalog to scan Amazon S3 to produce the datacatalog. B. Configure a Hive metastore in Amazon EMR. Migrate the existing on-premises Hivemetastore into Amazon EMR. Use AWS Glue Data Catalog to store the company's datacatalog as an external data catalog. C. Configure an external Hive metastore in Amazon EMR. Migrate the existing on-premisesHive metastore into Amazon EMR. Use Amazon Aurora MySQL to store the company'sdata catalog. D. Configure a new Hive metastore in Amazon EMR. Migrate the existing on-premises Hivemetastore into Amazon EMR. Use the new metastore as the company's data catalog.
Answer: A Explanation: AWS Database Migration Service (AWS DMS) is a service that helps you migrate databases to AWS quickly and securely. You can use AWS DMS to migrate theHive metastore from the on-premises Hadoop clusters into Amazon S3, which is ahighlyscalable, durable, and cost-effective object storage service. AWS Glue Data Catalogis a serverless, managed service that acts as a central metadata repository for your dataassets. You can use AWS Glue Data Catalog to scan the Amazon S3 bucket that containsthe migrated Hive metastore and create a data catalog that is compatible with Apache Hiveand other AWS services. This solution meets the requirements of migrating the datacatalog into a persistent storage solution and using a serverless solution. This solution isalso the most cost-effective, as it does not incur any additional charges for running AmazonEMR or Amazon Aurora MySQL clusters. The other options are either not feasible or notoptimal. Configuring a Hive metastore in Amazon EMR (option B) or an external Hivemetastore in Amazon EMR (option C) would require running and maintaining Amazon EMRclusters, which would incur additional costs and complexity. Using Amazon Aurora MySQLto store the company’s data catalog (option C) would also incur additional costs andcomplexity, as well as introduce compatibility issues with Apache Hive. Configuring a newHive metastore in Amazon EMR (option D) would not migrate the existing data catalog, butcreate a new one, which would result in data loss and inconsistency. References:Using AWS Database Migration ServicePopulating the AWS Glue Data CatalogAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 4: Data Analysis and Visualization, Section 4.2: AWS Glue Data Catalog
Question # 30
A company loads transaction data for each day into Amazon Redshift tables at the end of each day. The company wants to have the ability to track which tables have been loaded and which tables still need to be loaded. A data engineer wants to store the load statuses of Redshift tables in an Amazon DynamoDB table. The data engineer creates an AWS Lambda function to publish the details of the load statuses to DynamoDB. How should the data engineer invoke the Lambda function to write load statuses to the DynamoDB table?
A. Use a second Lambda function to invoke the first Lambda function based on AmazonCloudWatch events. B. Use the Amazon Redshift Data API to publish an event to Amazon EventBridqe.Configure an EventBridge rule to invoke the Lambda function. C. Use the Amazon Redshift Data API to publish a message to an Amazon Simple Queue Service (Amazon SQS) queue. Configure the SQS queue to invoke the Lambda function. D. Use a second Lambda function to invoke the first Lambda function based on AWSCloudTrail events.
Answer: B Explanation: The Amazon Redshift Data API enables you to interact with your Amazon Redshift data warehouse in an easy and secure way. You can use the Data API to run SQLcommands, such as loading data into tables, without requiring a persistent connection tothe cluster. The Data API also integrates with Amazon EventBridge, which allows you tomonitor the execution status of your SQL commands and trigger actions based on events.By using the Data API to publish an event to EventBridge, the data engineer can invoke theLambda function that writes the load statuses to the DynamoDB table. This solution isscalable, reliable, and cost-effective. The other options are either not possible or notoptimal. You cannot use a second Lambda function to invoke the first Lambda functionbased on CloudWatch or CloudTrail events, as these services do not capture the loadstatus of Redshift tables. You can use the Data API to publish a message to an SQSqueue, but this would require additional configuration and polling logic to invoke theLambda function from the queue. This would also introduce additional latency and cost.References:Using the Amazon Redshift Data APIUsing Amazon EventBridge with Amazon RedshiftAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 2: Data Store Management, Section 2.2: Amazon Redshift
Question # 31
A data engineer needs to build an extract, transform, and load (ETL) job. The ETL job will process daily incoming .csv files that users upload to an Amazon S3 bucket. The size of each S3 object is less than 100 MB. Which solution will meet these requirements MOST cost-effectively?
A. Write a custom Python application. Host the application on an Amazon ElasticKubernetes Service (Amazon EKS) cluster. B. Write a PySpark ETL script. Host the script on an Amazon EMR cluster. C. Write an AWS Glue PySpark job. Use Apache Spark to transform the data. D. Write an AWS Glue Python shell job. Use pandas to transform the data.
Answer: D Explanation: AWS Glue is a fully managed serverless ETL service that can handle variousdata sources and formats, including .csv files in Amazon S3. AWS Glue provides two typesof jobs: PySpark and Python shell. PySpark jobs use Apache Spark to process large-scaledata in parallel, while Python shell jobs use Python scripts to process small-scale data in a single execution environment. For this requirement, a Python shell job is more suitable andcost-effective, as the size of each S3 object is less than 100 MB, which does not requiredistributed processing. A Python shell job can use pandas, a popular Python library fordataanalysis, to transform the .csv data as needed. The other solutions are not optimal orrelevant for this requirement. Writing a custom Python application and hosting it on anAmazon EKS cluster would require more effort and resources to set up and manage theKubernetes environment, as well as to handle the data ingestion and transformation logic.Writing a PySpark ETL script and hosting it on an Amazon EMR cluster would also incurmore costs and complexity to provision and configure the EMR cluster, as well as to useApache Spark for processing small data files. Writing an AWS Glue PySpark job would alsobe less efficient and economical than a Python shell job, as it would involve unnecessaryoverhead and charges for using Apache Spark for small data files. References:AWS GlueWorking with Python Shell Jobspandas[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 32
A financial company wants to implement a data mesh. The data mesh must support centralized data governance, data analysis, and data access control. The company has decided to use AWS Glue for data catalogs and extract, transform, and load (ETL) operations. Which combination of AWS services will implement a data mesh? (Choose two.)
A. Use Amazon Aurora for data storage. Use an Amazon Redshift provisioned cluster fordata analysis. B. Use Amazon S3 for data storage. Use Amazon Athena for data analysis. C. Use AWS Glue DataBrewfor centralized data governance and access control. D. Use Amazon RDS for data storage. Use Amazon EMR for data analysis. E. Use AWS Lake Formation for centralized data governance and access control.
Answer: B,E Explanation: A data mesh is an architectural framework that organizes data into domainsand treats data as products that are owned and offered for consumption by differentteams1. A data mesh requires a centralized layer for data governance and access control,as well as a distributed layer for data storage and analysis. AWS Glue can provide datacatalogs and ETL operations for the data mesh, but it cannot provide data governance andaccess control by itself2. Therefore, the company needs to use another AWS service forthis purpose. AWS Lake Formation is a service that allows you to create, secure, and manage data lakes on AWS3. It integrates with AWS Glue and other AWS services toprovide centralized data governance and access control for the data mesh. Therefore,option E is correct.For data storage and analysis, the company can choose from different AWS servicesdepending on their needs and preferences. However, one of the benefits of a data mesh isthat it enables data to be stored and processed in a decoupled and scalable way1.Therefore, using serverless or managed services that can handle large volumes andvarieties of data is preferable. Amazon S3 is a highly scalable, durable, and secure objectstorage service that can store any type of data. Amazon Athena is a serverless interactivequery service that can analyze data in Amazon S3 using standard SQL. Therefore, optionB is a good choice for data storage and analysis in a data mesh. Option A, C, and D arenot optimal because they either use relational databases that are not suitable for storingdiverse and unstructured data, or they require more management and provisioning thanserverless services. References:1: What is a Data Mesh? - Data Mesh Architecture Explained - AWS2: AWS Glue - Developer Guide3: AWS Lake Formation - Features[4]: Design a data mesh architecture using AWS Lake Formation and AWS Glue[5]: Amazon S3 - Features[6]: Amazon Athena - Features
Question # 33
A company has a frontend ReactJS website that uses Amazon API Gateway to invoke REST APIs. The APIs perform the functionality of the website. A data engineer needs to write a Python script that can be occasionally invoked through API Gateway. The code must return results to API Gateway. Which solution will meet these requirements with the LEAST operational overhead?
A. Deploy a custom Python script on an Amazon Elastic Container Service (Amazon ECS)cluster. B. Create an AWS Lambda Python function with provisioned concurrency. C. Deploy a custom Python script that can integrate with API Gateway on Amazon ElasticKubernetes Service (Amazon EKS). D. Create an AWS Lambda function. Ensure that the function is warm byscheduling anAmazon EventBridge rule to invoke the Lambda function every 5 minutes by usingmockevents.
Answer: B Explanation: AWS Lambda is a serverless compute service that lets you run code withoutprovisioning or managing servers. You can use Lambda to create functions that performcustom logic and integrate with other AWS services, such as API Gateway. Lambdaautomatically scales your application by running code in response to each trigger. You payonly for the compute time you consume1.Amazon ECS is a fully managed container orchestration service that allows you to run andscale containerized applications on AWS. You can use ECS to deploy, manage, and scaleDocker containers using either Amazon EC2 instances or AWS Fargate, a serverlesscompute engine for containers2.Amazon EKS is a fully managed Kubernetes service that allows you to run Kubernetes
Question # 34
A company uses Amazon Redshift for its data warehouse. The company must automate refresh schedules for Amazon Redshift materialized views. Which solution will meet this requirement with the LEAST effort?
A. Use Apache Airflow to refresh the materialized views. B. Use an AWS Lambda user-defined function (UDF) within Amazon Redshift to refresh thematerialized views. C. Use the query editor v2 in Amazon Redshift to refresh the materialized views. D. Use an AWS Glue workflow to refresh the materialized views.
Answer: C Explanation: The query editor v2 in Amazon Redshift is a web-based tool that allowsusers to run SQL queries and scripts on Amazon Redshift clusters. The query editor v2supports creating and managing materialized views, which are precomputed results of aquery that can improve the performance of subsequent queries. The query editor v2 alsosupports scheduling queries to run at specified intervals, which can be used to refreshmaterialized views automatically. This solution requires the least effort, as it does notinvolve any additional services, coding, or configuration. The other solutions are morecomplex and require more operational overhead. Apache Airflow is an open-sourceplatform for orchestrating workflows, which can be used to refresh materialized views, but itrequires setting up and managing an Airflow environment, creating DAGs (directed acyclicgraphs) to define the workflows, and integrating with Amazon Redshift. AWS Lambda is a serverless compute service that can run code in response to events, which can be used to refresh materialized views, but it requires creating and deploying Lambda functions,defining UDFs within Amazon Redshift, and triggering the functions using events orschedules. AWS Glue is a fully managed ETL service that can run jobs to transform andload data, which can be used to refresh materialized views, but it requires creating andconfiguring Glue jobs, defining Glue workflows to orchestrate the jobs, and scheduling theworkflows using triggers. References:Query editor V2Working with materialized viewsScheduling queries[AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide]
Question # 35
A financial services company stores financial data in Amazon Redshift. A data engineer wants to run real-time queries on the financial data to support a web-based trading application. The data engineer wants to run the queries from within the trading application. Which solution will meet these requirements with the LEAST operational overhead?
A. Establish WebSocket connections to Amazon Redshift. B. Use the Amazon Redshift Data API. C. Set up Java Database Connectivity (JDBC) connections to Amazon Redshift. D. Store frequently accessed data in Amazon S3. Use Amazon S3 Select to run thequeries.
Answer: B Explanation: The Amazon Redshift Data API is a built-in feature that allows you to run SQL queries on Amazon Redshift data with web services-based applications, such as AWSLambda, Amazon SageMaker notebooks, and AWS Cloud9. The Data API does not requirea persistent connection to your database, and it provides a secure HTTP endpoint andintegration with AWS SDKs. You can use the endpoint to run SQL statements withoutmanaging connections. The Data API also supports both Amazon Redshift provisionedclusters and Redshift Serverless workgroups. The Data API is the best solution for runningreal-time queries on the financial data from within the trading application, as it has the leastoperational overhead compared to the other options.Option A is not the best solution, as establishing WebSocket connections to AmazonRedshift would require more configuration and maintenance than using the Data API.WebSocket connections are also not supported by Amazon Redshift clusters or serverless workgroups.Option C is not the best solution, as setting up JDBC connections to Amazon Redshiftwould also require more configuration and maintenance than using the Data API. JDBCconnections are also not supported by Redshift Serverless workgroups.Option D is not the best solution, as storing frequently accessed data in Amazon S3 andusing Amazon S3 Select to run the queries would introduce additional latency andcomplexity than using the Data API. Amazon S3 Select is also not optimized for real-timequeries, as it scans the entire object before returning the results. References:Using the Amazon Redshift Data APICalling the Data APIAmazon Redshift Data API ReferenceAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Question # 36
A data engineer must orchestrate a series of Amazon Athena queries that will run every day. Each query can run for more than 15 minutes. Which combination of steps will meet these requirements MOST cost-effectively? (Choose two.)
A. Use an AWS Lambda function and the Athena Boto3 client start_query_execution APIcall to invoke the Athena queries programmatically. B. Create an AWS Step Functions workflow and add two states. Add the first state beforethe Lambda function. Configure the second state as a Wait state to periodically checkwhether the Athena query has finished using the Athena Boto3 get_query_execution APIcall. Configure the workflow to invoke the next query when the current query has finishedrunning. C. Use an AWS Glue Python shell job and the Athena Boto3 client start_query_executionAPI call to invoke the Athena queries programmatically. D. Use an AWS Glue Python shell script to run a sleep timer that checks every 5 minutes todetermine whether the current Athena query has finished running successfully. Configurethe Python shell script to invoke the next query when the current query has finishedrunning. E. Use Amazon Managed Workflows for Apache Airflow (Amazon MWAA) to orchestratethe Athena queries in AWS Batch.
Answer: A,B Explanation: Option A and B are the correct answers because they meet the requirements most cost-effectively. Using an AWS Lambda function and the Athena Boto3 clientstart_query_execution API call to invoke the Athena queries programmatically is a simpleand scalable way to orchestrate the queries. Creating an AWS Step Functions workflowand adding two states to check the query status and invoke the next query is a reliable andefficient way to handle the long-running queries.Option C is incorrect because using an AWS Glue Python shell job to invoke the Athenaqueries programmatically is more expensive than using a Lambda function, as it requiresprovisioning and running a Glue job for each query.Option D is incorrect because using an AWS Glue Python shell script to run a sleep timerthat checks every 5 minutes to determine whether the current Athena query has finishedrunning successfully is not a cost-effective or reliable way to orchestrate the queries, as itwastes resources and time.Option E is incorrect because using Amazon Managed Workflows for Apache Airflow(Amazon MWAA) to orchestrate the Athena queries in AWS Batch is an overkill solutionthat introduces unnecessary complexity and cost, as it requires setting up and managing an Airflow environment and an AWS Batch compute environment.References:AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 5: Data Orchestration, Section 5.2: AWS Lambda, Section 5.3: AWS StepFunctions, Pages 125-135Building Batch Data Analytics Solutions on AWS, Module 5: Data Orchestration,Lesson 5.1: AWS Lambda, Lesson 5.2: AWS Step Functions, Pages 1-15AWS Documentation Overview, AWS Lambda Developer Guide, Working withAWS Lambda Functions, Configuring Function Triggers, Using AWS Lambda withAmazon Athena, Pages 1-4AWS Documentation Overview, AWS Step Functions Developer Guide, GettingStarted, Tutorial: Create a Hello World Workflow, Pages 1-8
Question # 37
A data engineer must use AWS services to ingest a dataset into an Amazon S3 data lake. The data engineer profiles the dataset and discovers that the dataset contains personally identifiable information (PII). The data engineer must implement a solution to profile the dataset and obfuscate the PII. Which solution will meet this requirement with the LEAST operational effort?
A. Use an Amazon Kinesis Data Firehose delivery stream to process the dataset. Createan AWS Lambda transform function to identify the PII. Use an AWS SDK to obfuscate thePII. Set the S3 data lake as the target for the delivery stream. B. Use the Detect PII transform in AWS Glue Studio to identify the PII. Obfuscate the PII.Use an AWS Step Functions state machine to orchestrate a data pipeline to ingest the datainto the S3 data lake. C. Use the Detect PII transform in AWS Glue Studio to identify the PII. Create a rule inAWS Glue Data Quality to obfuscate the PII. Use an AWS Step Functions state machine toorchestrate a data pipeline to ingest the data into the S3 data lake. D. Ingest the dataset into Amazon DynamoDB. Create an AWS Lambda function to identifyand obfuscate the PII in the DynamoDB table and to transform the data. Use the sameLambda function to ingest the data into the S3 data lake.
Answer: C Explanation: AWS Glue is a fully managed service that provides a serverless data integration platform for data preparation, data cataloging, and data loading. AWS GlueStudio is a graphical interface that allows you to easily author, run, and monitor AWS GlueETL jobs. AWS Glue Data Quality is a feature that enables you to validate, cleanse, andenrich your data using predefined or custom rules. AWS Step Functions is a service thatallows you to coordinate multiple AWS services into serverless workflows.Using the Detect PII transform in AWS Glue Studio, you can automatically identify andlabel the PII in your dataset, such as names, addresses, phone numbers, email addresses,etc. You can then create a rule in AWS Glue Data Quality to obfuscate the PII, such asmasking, hashing, or replacing the values with dummy data. You can also use other rulesto validate and cleanse your data, such as checking for null values, duplicates, outliers, etc.You can then use an AWS Step Functions state machine to orchestrate a data pipeline toingest the data into the S3 data lake. You can use AWS Glue DataBrew to visually exploreand transform the data, AWS Glue crawlers to discover and catalog the data, and AWSGlue jobs to load the data into the S3 data lake.This solution will meet the requirement with the least operational effort, as it leverages theserverless and managed capabilities of AWS Glue, AWS Glue Studio, AWS Glue DataQuality, and AWS Step Functions. You do not need to write any code to identify orobfuscate the PII, as you can use the built-in transforms and rules in AWS Glue Studio andAWS Glue Data Quality. You also do not need to provision or manage any servers orclusters, as AWS Glue and AWS Step Functions scale automatically based on the demand.The other options are not as efficient as using the Detect PII transform in AWS GlueStudio, creating a rule in AWS Glue Data Quality, and using an AWS Step Functions statemachine. Using an Amazon Kinesis Data Firehose delivery stream to process the dataset,creating an AWS Lambda transform function to identify the PII, using an AWS SDK toobfuscate the PII, and setting the S3 data lake as the target for the delivery stream will require more operational effort, as you will need to write and maintain code to identifyandobfuscate the PII, as well as manage the Lambda function and its resources. Using theDetect PII transform in AWS Glue Studio to identify the PII, obfuscating the PII, and usingan AWS Step Functions state machine to orchestrate a data pipeline to ingest the data intothe S3 data lake will not be as effective as creating a rule in AWS Glue Data Quality toobfuscate the PII, as you will need to manually obfuscate the PII after identifying it, whichcan be error-prone and time-consuming. Ingesting the dataset into Amazon DynamoDB,creating an AWS Lambda function to identify and obfuscate the PII in the DynamoDB tableand to transform the data, and using the same Lambda function to ingest the data into theS3 data lake will require more operational effort, as you will need to write and maintaincode to identify and obfuscate the PII, as well as manage the Lambda function and itsresources. You will also incur additional costs and complexity by using DynamoDB as anintermediate data store, which may not be necessary for your use case. References:AWS GlueAWS Glue StudioAWS Glue Data Quality[AWS Step Functions][AWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide],Chapter 6: Data Integration and Transformation, Section 6.1: AWS Glue
Question # 38
During a security review, a company identified a vulnerability in an AWS Glue job. The company discovered that credentials to access an Amazon Redshift cluster were hard coded in the job script. A data engineer must remediate the security vulnerability in the AWS Glue job. The solution must securely store the credentials. Which combination of steps should the data engineer take to meet these requirements? (Choose two.)
A. Store the credentials in the AWS Glue job parameters. B. Store the credentials in a configuration file that is in an Amazon S3 bucket. C. Access the credentials from a configuration file that is in an Amazon S3 bucket by usingthe AWS Glue job. D. Store the credentials in AWS Secrets Manager. E. Grant the AWS Glue job 1AM role access to the stored credentials.
Answer: D,E Explanation: AWS Secrets Manager is a service that allows you to securely store andmanage secrets, such as database credentials, API keys, passwords, etc. You can use Secrets Manager to encrypt, rotate, and audit your secrets, as well as to control access tothem using fine-grained policies. AWS Glue is a fully managed service that provides aserverless data integration platform for data preparation, data cataloging, and data loading.AWS Glue jobs allow you to transform and load data from various sources into varioustargets, using either a graphical interface (AWS Glue Studio) or a code-based interface(AWS Glue console or AWS Glue API).Storing the credentials in AWS Secrets Manager and granting the AWS Glue job 1AM roleaccess to the stored credentials will meet the requirements, as it will remediate the securityvulnerability in the AWS Glue job and securely store the credentials. By using AWS SecretsManager, you can avoid hard coding the credentials in the job script, which is a badpractice that exposes the credentials to unauthorized access or leakage. Instead, you canstore the credentials as a secret in Secrets Manager and reference the secret name orARN in the job script. You can also use Secrets Manager to encrypt thecredentials usingAWS Key Management Service (AWS KMS), rotate the credentials automatically or ondemand, and monitor the access to the credentials using AWS CloudTrail. By granting theAWS Glue job 1AM role access to the stored credentials, you can use the principle of leastprivilege to ensure that only the AWS Glue job can retrieve the credentials from SecretsManager. You can also use resource-based or tag-based policies to further restrict theaccess to the credentials.The other options are not as secure as storing the credentials in AWS Secrets Managerand granting the AWS Glue job 1AM role access to the stored credentials. Storing thecredentials in the AWS Glue job parameters will not remediate the security vulnerability, asthe job parameters are still visible in the AWS Glue console and API. Storing thecredentials in a configuration file that is in an Amazon S3 bucket and accessing thecredentials from the configuration file by using the AWS Glue job will not be as secure asusing Secrets Manager, as the configuration file may not be encrypted or rotated, and theaccess to the file may not be audited or controlled. References:AWS Secrets ManagerAWS GlueAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide,Chapter 6: Data Integration and Transformation, Section 6.1: AWS Glue
Question # 39
A company needs to partition the Amazon S3 storage that the company uses for a data lake. The partitioning will use a path of the S3 object keys in the following format: s3://bucket/prefix/year=2023/month=01/day=01. A data engineer must ensure that the AWS Glue Data Catalog synchronizes with the S3 storage when the company adds new partitions to the bucket. Which solution will meet these requirements with the LEAST latency?
A. Schedule an AWS Glue crawler to run every morning. B. Manually run the AWS Glue CreatePartition API twice each day. C. Use code that writes data to Amazon S3 to invoke the Boto3 AWS Glue create partitionAPI call. D. Run the MSCK REPAIR TABLE command from the AWS Glue console.
Answer: C Explanation: The best solution to ensure that the AWS Glue Data Catalog synchronizes with the S3 storage when the company adds new partitions to the bucket with the leastlatency is to use code that writes data to Amazon S3 to invoke the Boto3 AWS Glue createpartition API call. This way, the Data Catalog is updated as soon as new data is written toS3, and the partition information is immediately available for querying by otherservices. The Boto3 AWS Glue create partition API call allows you to create a new partitionin the Data Catalog by specifying the table name, the database name, and the partitionvalues1. You can use this API call in your code that writes data to S3, such as a Pythonscript or an AWS Glue ETL job, to create a partition for each new S3 object key thatmatches the partitioning scheme.Option A is not the best solution, as scheduling an AWS Glue crawler to run every morningwould introduce a significant latency between the time new data is written to S3 and thetime the Data Catalog is updated. AWS Glue crawlers are processes that connect to a datastore, progress through a prioritized list of classifiers to determine the schema for yourdata, and then create metadata tables in the Data Catalog2. Crawlers can be scheduled torun periodically, such as daily or hourly, but they cannot runcontinuously or in real-time.Therefore, using a crawler to synchronize the Data Catalog with the S3 storage would notmeet the requirement of the least latency.Option B is not the best solution, as manually running the AWS Glue CreatePartition API twice each day would also introduce a significant latency between the time new data iswritten to S3 and the time the Data Catalog is updated. Moreover, manually running theAPI would require more operational overhead and human intervention than using code thatwrites data to S3 to invoke the API automatically.Option D is not the best solution, as running the MSCK REPAIR TABLE command from theAWS Glue console would also introduce a significant latency between the time new data iswritten to S3 and the time the Data Catalog is updated. The MSCK REPAIR TABLEcommand is a SQL command that you can run in the AWS Glue console to add partitionsto the Data Catalog based on the S3 object keys that match the partitioning scheme3.However, this command is not meant to be run frequently or in real-time, as it can take along time to scan the entire S3 bucket and add the partitions. Therefore, using thiscommand to synchronize the Data Catalog with the S3 storage would not meet therequirement of the least latency. References:AWS Glue CreatePartition APIPopulating the AWS Glue Data CatalogMSCK REPAIR TABLE CommandAWS Certified Data Engineer - Associate DEA-C01 Complete Study Guide
Are there pre-requisites for the AWS Certified Data Engineer - Associate exam?
No, there are no pre-requisites. The recommended experience prior to taking this exam is the equivalent of 2-3 years in data engineering or data architecture and a minimum of 1-2 years of hands-on experience with AWS services.
How will the AWS Certified Data Engineer - Associate help my career?
This is an in-demand role with a low supply of skilled professionals. AWS Certified Data Engineer - Associate and accompanying prep resources offer you a means to build your confidence and credibility in data engineer, data architect, and other data-related roles.
What certification(s) should I earn next after AWS Certified Data Engineer - Associate?
The AWS Certified Security - Specialty certification is a recommended next step for cloud data professionals to validate their expertise in cloud data security and governance.
How long is this certification valid for?
This certification is valid for 3 years. Before your certification expires, you can recertify by passing the latest version of this exam.
Are there any savings on exams if I already hold an active AWS Certification?
Yes. Once you earn one AWS Certification, you get a 50% discount on your next AWS Certification exam. You can sign in and access this discount in your AWS Certification Account.
Customers Feedback
What our clients say about Data-Engineer-Associate Exam Materials
Lucas Mitchell
Jan 02, 2025
The Data-Engineer-Associate braindumps provided by Salesforcexamdumps are nothing short of excellent for AWS Data-Engineer-Associate Exam Preparation. I had complete trust in their originality and relevance throughout the Data-Engineer-Associate Certification Guide. Covering all necessary topics with detailed explanations, I wholeheartedly recommend these to anyone serious about passing.
Faraz Kashyap
Jan 01, 2025
A heartfelt thank you to Salesforcexamdumps! My success story stands as undeniable proof of the credibility and effectiveness of their Data-Engineer-Associate Exam Tips. These dumps guided me seamlessly through every topic, making the exam appear effortlessly conquerable.
Shobha Sibal
Jan 01, 2025
Salesforcexamdumps is a hidden gem for Data-Engineer-Associate exam preparation. The offers they have are unbeatable. I love that they offer the study materials in both PDF and test engine formats. Plus, their money-back guarantee shows they stand behind their product. I couldn't be happier with my purchase!
Nilima Mody
Dec 31, 2024
I stumbled upon Salesforcexamdumps, and it's been a revelation. The Data Engineer Associate content is top-notch. The 80% discount is unbelievable, and the money-back guarantee gives you peace of mind. Don't miss out!
Owen King
Dec 31, 2024
The AWS Certified Data Engineer - Associate (DEA-C01) Learning Path offered by Salesforcexamdumps is your ultimate roadmap to success. With comprehensive coverage of AWS Fundamental Knowledge, passing the exam becomes inevitable. I personally tried and endorsed Salesforcexamdumps's Data-Engineer-Associate question answers. You should definitely give them a try!
Jaxson Jackson
Dec 30, 2024
I couldn't resist sharing this glowing AWS Certified Data Engineer - Associate (DEA-C01) Review. Salesforcexamdumps has turned my dreams into reality, and today, I stand as a proud AWS-certified professional. This achievement is all thanks to their Data-Engineer-Associate braindumps, along with their invaluable insights and tips, which gave me the unwavering confidence to succeed.
Damian Brown
Dec 30, 2024
Salesforcexamdumps's AWS Certified Data Engineer - Associate (DEA-C01) Certification Questions are an essential resource to explore. The invaluable AWS Certified Data Engineer - Associate (DEA-C01) Exam Insights were a welcomed bonus to their well-structured Data-Engineer-Associate practice test. I am deeply appreciative of the invaluable help and support I received.
Ragini Loke
Dec 29, 2024
I am incredibly impressed with Salesforcexamdumps! The amazing offers they provide make it extremely affordable to access Data-Engineer-Associate exam materials. The PDF and test engine formats are a game-changer, allowing me to study in the way that suits me best. And the icing on the cake is their money-back guarantee, which gave me the confidence to try it out. Highly recommended!
Ishat Gopal
Dec 29, 2024
Salesforcexamdumps has made my exam preparation a breeze. Their Data Engineer Associate content is comprehensive and well-structured. The 80% discount and money-back guarantee . Don't look elsewhere!
Roman Martin
Dec 28, 2024
A big thanks to Salesforcexamdumps for covering AWS Cloud Compliance and Security in their Data-Engineer-Associate Test Prep. It's thanks to them that I sailed through the exam effortlessly. Their challenging Data-Engineer-Associate dumps were a true reflection of the actual exam, leading me to pass with flying colors.
Leave a comment
Your email address will not be published. Required fields are marked *
Leave a comment
Your email address will not be published. Required fields are marked *