Starting a big data project inherently comes with questions. What are the goals of the project? What should you know about your data? And where do you begin? Northeastern alumna, Paula Muñoz, joined us at a recent event to talk through her process and outlined the steps data analysts take when working through data analysis projects.
Step 1: Understanding the Business Issues
When presented with a data project, you will be given a brief outline of the expectations. From that outline, you should identify the key objectives that the business is trying to uncover. You should examine the overall scope of the work, business objectives, information the stakeholders are seeking, the type of analysis they want you to use, and the deliverables (the outputs of the project) they want.
You need to have these elements clearly defined prior to beginning your data analysis project to provide the best deliverable you can. Additionally, it’s important to ask as many questions as you can at the outset of the project because, often, you may not have another chance before the completion of the project.
Step 2: Understanding Your Data Set
There are a variety of tools you can use to organize your data. When presented with a small dataset, you can use Excel, but for heftier jobs, you’ll likely want to use more rigid tools to explore and prepare your data. Muñoz suggests R, Python, Alteryx, Tableau Prep or Tableau Desktop to help prepare your data for it’s cleaning.
Within these programs, you should identify key variables to help categorize the data. When going through the data sets, look for errors in the data. These can be anything from omitted data, data that doesn’t logically make sense, duplicate data, or even spelling errors. These missing variables need to be amended so you can properly clean your data.
Step 3: Data Preparation
Once you have organized and identified all the variables in your dataset, you can begin cleaning. In this step, you will input missing variables, create new broad categories to help categorize data that doesn’t have a proper place, and remove any duplicates in your data. Imputing average data scores for categories where there are missing values will help the data be processed more efficiently without skewing it.
Step 4: Exploratory Analysis/ Modeling
In this step, you will begin building models to test your data and seek out answers to the objectives given. Using different modeling methods, you can determine which is the best for your data. Common models include linear regressions, decision trees, and random forest modeling, among others.
Step 5: Validation
Once you have crafted your models, you’ll need to assess the data and determine if you have the correct information for your deliverable. Did the models work properly? Does the data need more cleaning? Did you find the outcome the client was looking to answer? If not, you may need to go over the previous steps again. You should expect a lot of trial and error!
Step 6: Visualization and Presentation
Once you have all your deliverables met, you can begin your data visualization. In many cases, data visualization will be crucial in communicating your findings to the client. Not all clients are data-savvy, and interactive visualization tools like Tableau are tremendously useful in illustrating your conclusions to clients. Being able to tell a story with your data is essential. Telling a story will help explain to the client the value of your findings.
As with any project, you need to identify your objectives clearly. Outlining your work will ensure you get the best deliverables for your clients. While all of these steps are important, if you start the project without all the data you need, you are likely to have to backtrack.
Follow Paula on twitter @paulisdataviz
If you’re interested in improving your data analytics skills and advancing your career, download our free guide below.