On September 23, 1999, at 09:00:46 UT, the NASA spacecraft Orbiter lost contact with Earth as it passed behind Mars. The anticipated reconnection, 27 minutes later, never occurred – by that time, the spacecraft had crashed onto the red planet. Subsequent investigations revealed that the incident was caused by commands not being converted from English units to the metric standard.1
On April 8, 2018, a Samsung Securities worker inadvertently entered “shares” instead of the Korean currency “won” due to a keyboard error. This led to the accidental distribution of a “ghost” share worth over 100 billion dollars, ultimately causing a significant decline in Samsung stocks, not to mention a loss of credibility.2
What ties these incidents together? Data quality.
Data quality matters
Ensuring data quality involves tasks such as checking if values are within range or have the correct format, and it has been the center of many discussions since the early 1990s.3
Data quality issues may originate in the realm of data, but are certainly not confined to it, significantly impacting business efficiency, incurring higher costs and even jeopardizing the success of projects.4
To tackle the intricacies of data quality problems, organizations of all kinds are constantly looking for effective solutions that can combine both industry expertise and data knowledge.
Will a spreadsheet cut it?
While spreadsheets may suffice for small datasets with simple rules, they prove inadequate as data volume and rule complexity increase. Suppose you have only one or two data sources that you can merge into a small, unified dataset. If the data quality can be assured with simple rules, such as verifying payment amounts within an expected range, a basic spreadsheet formula might suffice.
However, as business requirements grow more intricate, data volume expands, or the need arises to integrate new sources, spreadsheets really start to feel the strain, along with the people trying to use them. Similar to training wheels for a novice rider in a park, a spreadsheet is of little use to an experienced rider navigating a steep downhill track.
Find the expert, spell it out, iterate
So with increased complexity, you’ll need a data quality tool that can handle it. Unless you have a very technical background, you’ll also have to get an expert onboard who can implement your rules, such as a Python programmer.
The sort of script your programmer could produce to perform a simple validation check, such as ensuring an amount lies within the 10,000 to 50,000 range, applicable only to projects categorized as “small” or “medium” in size, would look something like this:
- import pandas as pd
- data = {
- ‘amount’: [12000, 30000, 60000, 15000],
- ‘project’: [‘Small’, ‘Medium’, ‘Large’, ‘Small’]
- }
- Payments = pd.DataFrame(data)
- Payments[‘PaymentStatus’] = ‘INVALID’
- mask = ((Payments[‘amount’].between(10000, 50000)) & (Payments[‘project’].isin([‘Small’, ‘Medium’])))
- Payments.loc[mask, ‘PaymentStatus’] = ‘VALID’
- print(Payments[[‘PaymentStatus’]])
Using an expert to implement the solution is presumably a viable approach, as it allows you to handle volume and complexity. However, it has some important drawbacks:
- As the execution of any implementation is not within your control, adapting to changes in requirements can be a bit of a journey, involving scheduling meetings to coordinate with programmers and/or tool specialists.
- Despite investing time in clarifying these changes, there’s always a chance that not every detail will be fully grasped or smoothly executed.
- And when it comes to integrating new data sources and ensuring they seamlessly align with existing datasets, things can get even more intricate. This can lead to a quick escalation in the effort required, calling for a diverse set of skills to merge and harmonize everything effectively.
The perfect solution would be a tool that can handle high data volumes and varying rule complexities while remaining accessible to a citizen developer.
Meet the Rule Engine
At Rulex, we address data validation challenges with a task called the “Rule Engine“.
This specially designed tool allows users to write business rules in a simple Excel file using an intuitive syntax. The rules can be applied to datasets, and the outputs can be exported to various formats, such as a database, a local file, or via API to an Advanced Planning System.
To assess the validity of our payment data with the Rule Engine, instead of writing a script, it’s sufficient to write a straightforward rule like the following:
- IF “amount” > 10000 AND “amount” < 50000 AND “project” in {'Small', 'Medium'} THEN "PaymentStatus" in {'VALID'}
As these rules are written in an external spreadsheet, business users can independently add and modify them, without delving into the intricacies of the workflow, or even needing to know how the software works.
Managing business rules becomes seamless. If the complexity grows, it can be easily addressed thanks to the Rule Engine’s support for formulas within the rule syntax, prioritization of rules (executing fundamental rules first), and the ability to manage rule dependencies.
And if new data sources come into play, they can be imported and merged into the existing flow through a user-friendly drag-and-drop interface.
3 main benefits of the Rule Engine:
- SIMPLE: You won’t need to onboard programmers to write complex scripts.
- FAST: You can independently modify and test rules and check results in minutes.
- FLEXIBLE: You can quickly add new data sources, prioritize rules, and change output, adapting easily to changing needs.
Whether mitigating a space exploration mishap or simply ensuring your business is not losing money, data quality is crucial. The Rule Engine is designed to give citizen developers complete control over the rule management process, enhancing efficiency and contributing to the vigilant maintenance of optimal data quality.
Now is the right time to cast aside those training wheels and confidently navigate your own path along the data trail!