Starting a big data project inherently comes with questions. What are the goals of the project? What should you know about your data? And where do you begin? You’ll want to know these answers to start before you dive in. To break it down further Level Alum, Paula Muñoz, joined us at our recent event, Understanding the Lifecycle of a Data Analysis Project, and outlined the process.
Step 1: Understanding the Business Issues
When presented with a data project you will be given a brief outline of the expectations. From that outline, you should look for key objectives that the business is trying to uncover. You should look for the overall scope of the work, the business objective, the information they are seeking, the type of analysis they want you to use, the deliverables (or the result of the project).
You need to have these points clearly defined to provide the best deliverable you can. Level Alum, Lantz Wagner stresses the importance of asking as many questions as you can in the beginning because often, you may not have another chance before the completion of the project.
Step 2: Understanding Your Data Set
There are a variety of tools you can use to organize your data. When presented with a small dataset, you can use Excel, but for heftier jobs, you may want to use more rigid tools to explore and prepare your data. Paula suggests R, Python, Alteryx, Tableau Prep or Tableau Desktop to help prepare your data for it’s cleaning.
Within these programs, you should identify key variables to help categorize the data. When going through the data sets, you should be looking for errors in the data. These can be anything from omitted data, data that doesn’t logically make sense, duplicate data or even spelling errors. These missing variables need to be amended so you can properly clean your data.
Step 3: Data Preparation
Once you have organized and identified all the variables in your dataset, you can begin cleaning. In this step, you will input missing variables, create new broad categories to help categorize data that doesn’t have a proper place and remove any duplicates in your data. Imputing average data scores for categories where there are missing values will help the data be processed more efficiently without skewing it.
Step 4: Exploratory Analysis/ Modeling
Here, you will begin building models to test your data and seek out answers to the objectives given. Using different modeling methods, you can determine which is the best for your data. Models include Linear Regressions, Decision Trees, Random Forest Modeling and more!
Step 5: Validation
Once you have crafted your models, you’ll need to assess the data and determine if you have the correct information for your deliverable. Did the models work properly? Does the data need more cleaning? Did you find the outcome the client was looking to answer? If not, you may need to go over the previous steps again. You should expect a lot of trial and error!
Step 6: Visualization and Presentation
Once you have all your deliverables met, you can begin your data visualization. Data visualization will be crucial in some projects in communicating your findings to the client. Not all clients are data savvy, and visualization tools like Tableau are tremendous and interactive so that you can show the clients your conclusions. Being able to tell a story with your data is essential. Telling a story will help explain to the client the value of your findings.
As with any project, you need to identify your objectives clearly. Outlining your work will ensure you get the best deliverables for your clients. While all of these steps are important, if you start the project without all the data you need, you are likely to have to backtrack. Special shout out to Paula for putting together a fantastic presentation for everyone. If you are interested in more events like this, be sure to check our website for upcoming events in your area!
Check out Paula’s presentation and follow her on twitter @paulisdataviz