Web Robots platform has a built-in system for data validity management (Data QC). This feature is accessible exclusively to our paid users. Data QC compares statistics about current run with known good runs from the past. As a result runs gets one of the two Data QC results: OK or QC_FAIL. Data QC reliably alerts about typical issues that occur:
- Website layout changed.
- Website layout changed slightly, only one or few fields disappeared.
- Amount of data changed.
Data QC for each run consists of 4 global metrics (duration, emits, row_size, rows) that are calculated irrespective of what data fields are emitted, and 3 metrics ( fieldName_avg, fieldName_cnt, fieldName_dist ) for each Table/Field that is found in data. Min and Max values are calculated during training, and then a real value for each run.
- duration – duration of a robot run in seconds;
- emits – number of data emits in a robot run;
- row_size – average size of a single row of data in bytes;
- rows – number of data rows in a robot run;
example global metrics:
Dependent QC metrics:
- fieldName_avg – average value of a field in a robot run. The value of this metric depends on the field data type. For numbers it is a simple arithmetic mean. For strings, arrays and objects it is the average of their length as strings;
- fieldName_cnt – number of rows that have some data present in this field;
- fieldName_dist – number of distinct values among all the data rows in this field;
example dependent metrics:
Training QC Metrics
When a new robot is created its QC parameters are blank and Data QC process will mark all runs as OK. Robot writer must verify data manually and leave good reference runs as OK and mark bad runs as QC_FAIL. then click the “Train QC params” button.
Statistical significance thresholds are calculated during QC training from the most recent runs marked OK. In all subsequent runs QC metrics will be calculated and compared to thresholds determined during training. If the value fits between the thresholds then it’s marked OK, else QC_FAIL.
At any time “Train QC params” button can be used to re-train QC metric thresholds.
Manually Editing QC Metrics
Sometimes it is necessary to edit QC thresholds by hand. To edit the robot QC manually click the “Edit QC” button and wait for the new page to load. A list of all QC metrics and their min/max values will be displayed in a table. The QC values of the latest run will also be displayed. The min/max threshold values of any field can be edited. If you want to ignore a certain field from data QC, check the “ignore” checkbox. After all changes are made, hit the “Save” button in the lower left corner to save and update the QC metrics.
example of Edit QC page:
Deleting QC Metrics
Sometimes you need to delete and reset the QC data, to do that simply press the “Edit QC” button to enter the Edit QC page, and click the “Turn off QC” button.
To notify the user of issues with robot runs our platform sends out QC Alert emails. An email is sent if a robot run finished with one of these three statuses:
- QC_FAIL – The data of the run did not pass the Quality Control
- FAILED – The robot run failed because it reached the limit of retries
- DEFUNCT – The robot stopped responding for a significant amount of time, and was marked as defunct.
QC alerts are sent to the client, the owner of the robot, and the last person who edited the robot. The email contains basic info about the run: start date, run/fork id, owner email, name of robot. If it’s a QC FAIL alert, it will include a table with all the QC metrics for that run as well as a link to the robot.
example QC Alert email: