BEGIN:VCALENDAR VERSION:2.0 PRODID:-//hacksw/handcal//NONSGML v1.0//EN METHOD:PUBLISH BEGIN:VEVENT DTSTAMP:20240328T093301Z DESCRIPTION:Click for Latest Location Information: http://dgiq2020.datavers ity.net/sessionPop.cfm?confid=139&proposalid=11748\n
Operational and tran snational data are collected in large volumes, in different formats, from m ultiple sources and flow through multiple platforms. Even to validate a mer e 1,000 tables, organizations typically have to write close to 100,000 Data Quality (DQ) rules. DQ validation using conventional approaches such as ru les writing is costly, error prone and not scalable. Organizations are left to constantly firefight and keep writing new DQ validation rules when data errors are discovered, leaving the management very nervous and un-trusting of the data. Even when data has just 1-3% errors, the resulting analytical models are inaccurate and predictions have significant errors. Thus, it is paramount to detect and correct poor data before it propagates throughout the organization.
\nThis speaker extensively used a Machine Learning-bas ed framework to discover DQ issues within cloud data lake at many organizat ions. This framework helps organizations to validate data assets through th e lens of five DQ dimensions: Completeness, Conformity, Consistency, Reason -ability, and Validity. This ML-based approach has discovered complex and u nexpected DQ errors in several leading organizations. The talk will also ou tline the different strategies for Lake-level and Application-level DQ vali dation using AI/ML.\n