Data ingestion is a multi-step process that comes with a variety of complex challenges, especially when data comes from different sources, is in different formats, or is mixed with irrelevant data. The data ingestion process requires putting datasets through an extract, transform, and load (ETL) process and the first task is to conduct data profiling. Flexwind helps teams expedite the data profiling process by implementing custom front-end solutions using Python and Neo4J.
What is data profiling?
The first task in the ETL process is to qualify the data, and while this can be done manually, it’s safe to say that this effort is time-consuming and can result in errors and oversights. Using Python and Neo4J, teams Flexwind clients can quickly run graph analysis on their datasets before ingestion to:
- Understand data structure
- Identify relationships
- Determine content quality
Essentially, data profiling provides insights about the quality of your data: accuracy, validity, and completeness.
Learn your process and business needs
Flexwind’s first step is to gather information about the business’ needs to understand the business logic we will need built into the solution. We also review your existing process to determine how best to integrate the automation into the existing workflow. For example, when a client needs to analyze incoming datasets to determine what data qualifies to be ingested, the solution will likely need to be implemented right after the data is moved off of a landing zone and into an area where it can be manipulated.
Set up mappings
Next, we apply business logic to create mappings of the relevant correlations using Python. Below is an example where the intent is to ingest Person and Travel data.
- Create 2 nodes
Person node: should connect to metadata such as: First Name, Last Name, Date of Birth, etc.
Travel node: should connect to metadata such as: Arrival Time, Departure Location, etc. - Ingest data headers
First Name, Last Name, Date of Birth, Place of Birth, Arrival Time, Departure Location, Flight Number, Favorite Book.
Run the program
Neo4J allows you to see a graph of the nodes and edges to determine if the data has any edges (relationships). The graphs are compared and the outcome shows which headers connected to which node, which ones didn’t connect at all, and if the data qualified based on specific business rules, such as: “must have 3 person data and 1 travel data.” In this case, the following connections would be seen:
Travel – Arrival Time, Departure Location, Flight Number
Unmatched – Favorite Book
If common failures are identified, such as a discrepancy between the nodes and the data (i.e., node is DOB vs Date of Birth), this is when we would make those adjustments.
Test the solution
To ensure the solution is effective, we run pre-deployment integration tests to check the accuracy of the correlations and ensure that only desired data is identified and retained, based on the business logic. If any issues are identified, the front-end program is refined and retested. Ultimately, the Neo4J graph output provides the ETL personnel with enough information to conclude if the data correlates enough to be ingested.
Deploy the solution
Upon successful testing, the solution is deployed and integrated into the existing ETL workflow. With the automated data profiling solution successfully in place, the first step in the ETL process is significantly more efficient, increasing the ETL team’s velocity and ability to ingest the right data.
Automate your data profiling workflow with Flexwind
Manually data profiling is time-consuming and can be wrought with errors and oversights. Automating this process not only increases efficiency and accuracy of your ETL process, but can provide your business with valuable metrics about the data, including how much data was analyzed, percentage of data that qualified for ingestion, relationship counts, duration, etc.
Get in touch and let our engineers work with you to understand your business’ data challenges and customize an automated solution that will improve the accuracy and efficiency of your data analysis, so you can ingest the right data with confidence.
About the Author
Erin Hilton, Sr. Software Engineer
Erin joined Flexwind in 2019 and has experience with the full data life cycle and software development life cycle. Erin holds a Bachelor’s degree in Mathematics from University of North Carolina at Wilmington.