Data Quality | Rulex

Weeding the Data Garden: how Rulex Platform Cultivates Quality

Fabio Rivieccio — Tue, 04 Jun 2024 07:00:09 +0000

Whether it’s an out-of-range value or an incorrect format, the quality of data is fundamental to any data-reliant process and significantly impacts the results, despite often being overlooked. Imagine an enterprise which decides to undergo a large end-to-end business transformation program, where the final aim is to switch to the newest APS featuring so many fancy capabilities. If the data provided to that software is not consistent and accurate, the results will be adversely impacted.

Unmask inconsistency
Ignite your rules
Sharpen your rules
Let your data bloom

So, we need to monitor quality in order to ensure adequate accuracy; and to monitor quality we need to… define what Data Quality is. In fact this is rather an “umbrella term” to refer to different issues in the data: Accuracy, Completeness, Consistency, Timeliness, Validity and Uniqueness are some Data Quality dimensions. You can find more information (and more dimensions!) by googling around, so here we rather focus on the solution, which, as the problem, is also multi-faceted.

Rulex Platform provides different capabilities and approaches to solve different Data Quality issues in the same way as a gardener uses various tools and practices to uproot the weeds and make his garden bloom.

When it comes to harmonizing fragmented data, handling missing values and duplicates, and formatting errors or outliers, Rulex Platform can quickly spot and correct the issue.

A typical Rulex flow foresees these cleansing activities as one of the first actions performed on the raw and dirty dataset. Specific tasks are available which make it simple for any citizen developer to cleanse their data, such as:

Fill & Clean: which imputes missing data with fixed or dynamic values
Data Manager: which spots and dismisses duplicate rows with a single click

And if these are not enough you can leverage various other features such as:

An advanced join capability to merge different datasets based, for example, on string similarity
Statistical or textual Data Manager functions, which deal with outliers or incorrect formats
…and much more!

Albeit the above approaches have proven useful with many basic issues, there are some cases where a data value seems pretty normal and yet hides an inconsistency.

Unmask inconsistency with eXplainable AI

Among all these dimensions, consistency is one of the most difficult to deal with. A target attribute is considered “consistent” if it changes in accordance with its related attributes; i.e., its values change consistently when the context changes.

The table below illustrates an example of inconsistency (guess why!):

Name

Age

Married

John

Yes

Mike

Paul

Yes

Brenda

Yes

Also, sometimes you know that a subset of your data is inconsistent, but you don’t have the proper rules to correct it.

Or you have some basic rules, but there are so many exceptions that the final correct values can hardly be identified.

Rulex approaches all the above scenarios with a disruptive solution called Robotic Data Corrections (RDC), which seamlessly provides correction proposals to inconsistent data.

Behind the magic there is a proprietary eXplainable AI algorithm called Logic Learning Machine, capable of inferring a ruleset according to which proposals are devised. With this approach, the user simply accepts or rejects recommendations according to their domain knowledge. The algorithm integrates this new knowledge into successive iterations. After four to five iterations, the accuracy is usually close to 99%.

In addition, RDC catches any new issues in data quality associated to material “phasing in”: at a steady state, minimum effort is required to attain the highest levels of accuracy.

But as we mentioned, the realm of Data Quality is complex and the issue types are diverse: sometimes dependencies from driving attributes involve mathematical formulations, or sometimes even if you do have a settled ruleset, it is not easy to update it. Or maybe the dependency between rules is too complex to manage.

Luckily, the realm of the solutions provided by Rulex is also diverse.

Ignite your rules with the Rule Engine

Rulex provides a task which allows any citizen developer to write their rules with a simple syntax in a simple spreadsheet, import this rule file, and apply the rules to a dataset. This empowering task is called the “Rule Engine”.

The beauty of this approach is that any existing rules can be coded in the task: from the simplest rules to rules involving complex conditions or output values resulting from complex mathematical or logical functions. Also, the whole process of ensuring data quality is completely in the hands of the citizen developer, without needing to resort to skilled expertise to modify the rules or create new ones (definitely shortening the time-to-value).

Finally, our Product Team is working on a solution for those unsure if all the rules are properly configured.

Sharpen your rules with the Rule Enhancer

The Rule Enhancer is an innovative task which refines existing rules: think of it as a tuning tool. It requires a data (sub)set which contains clean and accurate values (the so-called “ground truth”), used to adjust the rules. It also requires some sort of performance criterion (such as the F1 score); as a result, fine-tuned parameters are provided for each rule. If you are interested just hang in there a bit longer: the task will be released in the short!

Let your data bloom

These multiple approaches together constitute the basis for a 360 degree solution that reaches top accuracy levels, and which can be applied in a comprehensive Data Quality pipeline, so that any kind of issue can be tackled and solved. And what’s more: the implementation can be proficiently managed by any citizen developer who well understands the underlying data.
Rulex Platform provides all the solutions needed to make your knowledge blossom into colourful, accurate data.

Discover Rulex Platform’s data quality solutions

learn more

Rule-based Validation: 3 Reasons Why Rulex Does It Better

Fabio Rivieccio — Wed, 28 Feb 2024 08:00:34 +0000

On September 23, 1999, at 09:00:46 UT, the NASA spacecraft Orbiter lost contact with Earth as it passed behind Mars. The anticipated reconnection, 27 minutes later, never occurred – by that time, the spacecraft had crashed onto the red planet. Subsequent investigations revealed that the incident was caused by commands not being converted from English units to the metric standard.¹

On April 8, 2018, a Samsung Securities worker inadvertently entered “shares” instead of the Korean currency “won” due to a keyboard error. This led to the accidental distribution of a “ghost” share worth over 100 billion dollars, ultimately causing a significant decline in Samsung stocks, not to mention a loss of credibility.²

What ties these incidents together? Data quality.

Data quality matters
Will a spreadsheet cut it?
Find the expert, spell it out, iterate
Meet the Rule Engine
3 main benefits of the Rule Engine

Data quality matters

Ensuring data quality involves tasks such as checking if values are within range or have the correct format, and it has been the center of many discussions since the early 1990s.³

Data quality issues may originate in the realm of data, but are certainly not confined to it, significantly impacting business efficiency, incurring higher costs and even jeopardizing the success of projects.⁴

To tackle the intricacies of data quality problems, organizations of all kinds are constantly looking for effective solutions that can combine both industry expertise and data knowledge.

Will a spreadsheet cut it?

While spreadsheets may suffice for small datasets with simple rules, they prove inadequate as data volume and rule complexity increase. Suppose you have only one or two data sources that you can merge into a small, unified dataset. If the data quality can be assured with simple rules, such as verifying payment amounts within an expected range, a basic spreadsheet formula might suffice.

However, as business requirements grow more intricate, data volume expands, or the need arises to integrate new sources, spreadsheets really start to feel the strain, along with the people trying to use them. Similar to training wheels for a novice rider in a park, a spreadsheet is of little use to an experienced rider navigating a steep downhill track.

Find the expert, spell it out, iterate

So with increased complexity, you’ll need a data quality tool that can handle it. Unless you have a very technical background, you’ll also have to get an expert onboard who can implement your rules, such as a Python programmer.

The sort of script your programmer could produce to perform a simple validation check, such as ensuring an amount lies within the 10,000 to 50,000 range, applicable only to projects categorized as “small” or “medium” in size, would look something like this:

import pandas as pd
data = {
‘amount’: [12000, 30000, 60000, 15000],
‘project’: [‘Small’, ‘Medium’, ‘Large’, ‘Small’]
}
Payments = pd.DataFrame(data)
Payments[‘PaymentStatus’] = ‘INVALID’
mask = ((Payments[‘amount’].between(10000, 50000)) & (Payments[‘project’].isin([‘Small’, ‘Medium’])))
Payments.loc[mask, ‘PaymentStatus’] = ‘VALID’
print(Payments[[‘PaymentStatus’]])

Using an expert to implement the solution is presumably a viable approach, as it allows you to handle volume and complexity. However, it has some important drawbacks:

As the execution of any implementation is not within your control, adapting to changes in requirements can be a bit of a journey, involving scheduling meetings to coordinate with programmers and/or tool specialists.
Despite investing time in clarifying these changes, there’s always a chance that not every detail will be fully grasped or smoothly executed.
And when it comes to integrating new data sources and ensuring they seamlessly align with existing datasets, things can get even more intricate. This can lead to a quick escalation in the effort required, calling for a diverse set of skills to merge and harmonize everything effectively.

The perfect solution would be a tool that can handle high data volumes and varying rule complexities while remaining accessible to a citizen developer.

Meet the Rule Engine

At Rulex, we address data validation challenges with a task called the “Rule Engine“.

This specially designed tool allows users to write business rules in a simple Excel file using an intuitive syntax. The rules can be applied to datasets, and the outputs can be exported to various formats, such as a database, a local file, or via API to an Advanced Planning System.

To assess the validity of our payment data with the Rule Engine, instead of writing a script, it’s sufficient to write a straightforward rule like the following:

IF “amount” > 10000 AND “amount” < 50000 AND “project” in {'Small', 'Medium'} THEN "PaymentStatus" in {'VALID'}

As these rules are written in an external spreadsheet, business users can independently add and modify them, without delving into the intricacies of the workflow, or even needing to know how the software works.

Managing business rules becomes seamless. If the complexity grows, it can be easily addressed thanks to the Rule Engine’s support for formulas within the rule syntax, prioritization of rules (executing fundamental rules first), and the ability to manage rule dependencies.

And if new data sources come into play, they can be imported and merged into the existing flow through a user-friendly drag-and-drop interface.

3 main benefits of the Rule Engine:

SIMPLE: You won’t need to onboard programmers to write complex scripts.
FAST: You can independently modify and test rules and check results in minutes.
FLEXIBLE: You can quickly add new data sources, prioritize rules, and change output, adapting easily to changing needs.

Whether mitigating a space exploration mishap or simply ensuring your business is not losing money, data quality is crucial. The Rule Engine is designed to give citizen developers complete control over the rule management process, enhancing efficiency and contributing to the vigilant maintenance of optimal data quality.

Now is the right time to cast aside those training wheels and confidently navigate your own path along the data trail!