There are many available methods to integrate information source reliability in an uncertainty representation, but there are only a few works focusing on the problem of evaluating this reliability. However, data reliability and confidence are essential components of a data warehousing system, as they influence subsequent retrieval and analysis.
In this paper, we propose a generic method to assess data reliability from a set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided. The chosen illustrative example comes from real-world data issued from the Sym’Previus predictive microbiology oriented data warehouse. Evaluating Data Reliability An Evidential Answer with Application to a Web-Enabled Data Warehouse
These data are used in further inferences. During collection, data reliability is mostly ensured by measurement device calibration, by adapted experimental design and by statistical repetition. However, full traceability is no longer ensured when data are reused at a later time by other scientists. This estimation is especially important in areas where data are scarce and difficult to obtain as it is the case, for example, in Life Sciences. The growth of the web and the emergence of dedicated data warehouses offer great opportunities to collect additional data, be it to build models or to make decisions. The reliability of these data depends on many different aspects Meta information data source, experimental protocol, developing generic tools to evaluate this reliability represents a true challenge for the proper use of distributed data.
The evaluation of their reliability, it is natural to be interested in the reasons explaining why some particular data were assessed as (un)reliable. We now show how maximal coherent subsets of criteria, i.e., groups of agreeing criteria, may provide some insight as to which reasons have led to a particular assessment.
We present an application of the method a web-enabled data warehouse. Indeed, the framework developed in this paper was originally motivated by the need to estimate the reliability of scientific experimental results collected in open data warehouses. To lighten the burden laid upon domain experts when selecting data for a particular application, it is necessary to give them indicative reliability estimations.
We will hopefully be a better asset for them to justify their choices and to capitalize knowledge than the use of an ad hoc estimation. Tools development was carefully done using Semantic Web recommended languages, so that created tools would be generic and reusable in other data warehouses. This required an advanced design step, which is important to ensure modularity and to foresee future evolutions.