3 Steps to Creating Any Data Project

Zaid Alissa Almaliki
5 min readOct 31, 2022
Data Project

How to Gather Requirements

This is the first step to create and build your data pipeline in any company, relax, believe in this process that I present to you right here, right now, and if you have a huge ego, try to control it, or leave outside your company, because your dealing here with end users, and end users are you customers.

Who are your end users?

You need to understand who is going to be your end users. You prepare the data, and your end users cook the data. This is like you are the farmer and your end users are chefs, or bakers. So to make this job easy for you, let me introduce you to some of the end users that you will find in any company.

  • Data analysts used SQL and files.
  • Data scientists used SQL and files.
  • Software engineers use SQL and APIs.
  • Business Analysts used Report, Dashboards, Excel files.
  • Project managers used dashboards.
  • External users used S3 objects, SFTP/FTPS, and APIs.

How to Help End Users Define Requirements?

You don’t get all your requirements on the first day. You need to have patience. You have to empathise with your end-users’ pain. You should believe in the process. I can’t do it for you, but I can help you to achieve that by asking these magic questions:

  1. How will this data improve the business? For example, the churn rate problem.
  2. What does the data represent?
  3. What is the business process used to collect this data?
  4. What is the origin of this data ? (S3 files, SFTP, APIs, external database, internal database, manual upload of files)
  5. What is the regularity of the data?
  6. Do you want data in seconds, minutes, hours, days, weeks, or months?
  7. Do you need to store historical data?
  8. What is the seasonality of the data?
  9. Does the data have a skew in size?
  10. How do the end users access the data through one of these options (SQL, Dashboards, APIs)?
  11. What are the data quality metrics of the company?
  12. How do you know the data passed the business logic-based checks?
  13. What are your numeric field checks?
  14. Do you follow a naming convention for files, schemas, tables, columns, or API fields?
  15. What is the standard size of the files in the company?

What is the End User Validation Process?

  1. Provide samples of the data to the end users.
  2. Take their feedback seriously regarding the validation of the data because they understand the data better than you.
  3. Observe their access pattern, which means what schemas, tables, and columns are used mostly inside a database, or what filters they apply if this is a dashboard.
  4. Write any new requirement as a ticket in Jira, and don’t start working on a new transformation layer until you have the green light from the end users, which means the sign-off of the ticket.
  5. Forget about your ego when you work with end users.

What is your delivery process?

  1. Deliver the job in small chunks, in small pieces, bit by bit.
  2. Put the end users in the loop, which means asking them to review your tickets, because you can easily spot any new requirements.
  3. Record your job using Jira tickets. These Jira tickets need to have clear acceptance criteria.
  4. Document every big change in Confluence pages.

Let me give you here a clear example, imagine you need to ingest a huge number of source data in your data lake, then you need to transform them and put them in your data warehouse, so to do that you need to follow this process:

  1. Modeling the data from only one source.
  2. Pull the data from the source that you chose in the first step. For example, that data source is the Google Ads API.
  3. Put the data in the data lake.
  4. Apply a simple transformation of the data, and put it in your data warehouse.
  5. Build the dashboard based on the data that you have in the data warehouse.

How to Create a Process for Managing Change Requests?

  1. Don’t accept adhoc requests (unexpected requests).
  2. Educate the end user in the process of the requested change.
  3. Allow end users to request changes in an easy way.
  4. Communicate delivery times to the end users.
  5. If some requests are more important than others, work with the stakeholders to decide which one has top priority.

How to Add Testing to Your Project?

What is system testing?

This type of testing can only be done in the development environment.

  1. Take any data sample and pass it through the data pipeline.
  2. Generate the output from this data sample.
  3. Compare the output from the second step with the expected output.

What is Data Quality Testing?

You can only implement this data quality testing as a step in the production environment. You can follow these steps to make it happen:

  1. DBT and Great Expectations will be your weapons in this test.
  2. You must load the data into a staging table, and after that, start applying constraint checks, business logic-based checks, and outlier checks.
  3. If there is a problem, something that didn’t pass the checks, will trigger an alarm.
DBT

What are Monitoring and Alerting Systems?

  1. You use monitoring to catch changes in our data pipeline.
  2. Create logs from your applications, or if you are working in the AWS cloud, create them using AWS cloudwatch.
  3. You can send the logs to DataDog, or you can send them to table in the database.
  4. Program an alarm using DataDog, or through your database, if you detect something bad happening with your pipeline.

What is Your Offboarding Process?

  1. Create video tutorials of all the big changes in the project.
  2. Create Confluence pages for any new data pipeline project that you implement.
  3. In the last weeks of your project in the company, support your colleagues in case they have problems following your tutorials.

Alternatively, you can get a Medium subscription for $5/month. If you use this link, it will support me.

Reference

This article is possible because of these references. some external links in this post are affiliate.

Links

Data Engineering Start

Books

--

--

Zaid Alissa Almaliki

Founder, Principal Data Engineer and Cloud Architect Consultant in DataAkkadian.