Final Exam: Data Analyst
Beginner
- 1 video | 32s
- Includes Assessment
- Earns a Badge
Final Exam: Data Analyst will test your knowledge and application of the topics presented throughout the Data Analyst track of the Skillsoft Aspire Data Analyst to Data Scientist Journey.
WHAT YOU WILL LEARN
-
List the six phases of the data lifecycleuse the ggplot2 library to visualize data using rcreate vectors in rcrawl data stored in a dynamodb tablecompare and contrast sql and nosql database solutionsstandardize a distribution to express its values as z-scores and use pandas to generate a correlation and covariance matrix for your datasetspecify the configurations of the mapreduce applications in the driver program and the project's pom.xml fileexplain the concept of hierarchical index or multi-index and why can be usefuluse the numpy library to manipulate arrays and the pandas library to load and analyze a datasetdelete a google cloud dataproc cluster and all of its associated resourcesconfigure and view permissions for individual files and directories using the getfacl and chmod commandsrecall how apache zookeeper enables the hdfs namenode and yarn resourcemanager to run in high-availability modeload data into a redshift cluster from s3 bucketsdescribe the options available when iterating over 1-dimensional and multi-dimensional arraysrun the application and examine the outputs generated to get the word frequencies in the input text documentuse the get and getmerge functions to retrieve one or multiple files from hdfsexport the contents of a dataframe into files of various formatsusing the independent t-test and with a related sample using a paired t-test using the scipy librarycreate data frames in rcreate and configure simple graphs with lines and markers using the matplotlib data visualization libraryconfigure hdfs using the hdfs-site.xml file and identify the properties which can be set from itwork with the yarn cluster manager and hdfs namenode web applications that come packaged with hadoopinstall pandas and create a pandas seriesrecognize and deal with missing data in rexport the contents of a dataframe into files of various formatsuse numpy to compute the correlation and covariance of two distributions and visualize their relationship with scatterplotsrecognize the challenges involved in processing big data and the options available to address them such as vertical and horizontal scalingcreate and configure a hadoop cluster on the google cloud platform using its cloud dataproc servicedefine the inter-quartile range of a dataset and enumerate its propertiesdraw the shape of a gaussian distribution and enumerate its defining properties
-
execute the application and verify that the filtering has worked correctly; examine the job and the output files using the yarn cluster manager and hdfs namenode web uisset up a jdbc connection on glue to the redshift clusterimport and export data in rdeploy dynamodb in the amazon web services cloudusing the mutate methodcreate matrices in ridentify different tools available for data managementdescribe the etl process and different tools availableinitialize a spark dataframe from the contents of an rddconfigure a jdbc connection on glue to the redshift clusterdescribe nosql stores and how they are usedcreate and load data into an rddread data from an excel spreadsheetrun etl scripts using glueuse fancy indexing with arrays using an index maskwrite a simple bash scriptidentify the various gcp services used by dataproc when provisioning a clusterread data from files and write data to files using the python pandas librarydefine linear regressionedit individual cells and entire rows and columns in a pandas dataframedefine the mean of a dataset and enumerate its propertiesretrieve specific parts of an array using row and column indicesrecall the steps involved in building a mapreduce application and the specific workings of the map phase in processing each row of data in the input fileuse numpy to compute statistics such as the mean and median on your datadescribe the concept of hierarchical index or multi-index and why can be usefuldefine the contents of a dataframe using the sqlcontextdescribe and apply the different techniques involved in handling datasets where some information is missinguse the dplyr library to load data framesbuild and run the application and confirm the output using hdfs from both the command line and the web applicationtransfer files from your local file system to hdfs using the copyfromlocal command
EARN A DIGITAL BADGE WHEN YOU COMPLETE THIS COURSE
Skillsoft is providing you the opportunity to earn a digital badge upon successful completion on some of our courses, which can be shared on any social network or business platform.
Digital badges are yours to keep, forever.