Becoming Data Fluent: Know your Data Ecosystem

I published a short blog post on how analysts can upskill in the age of AI to remain relevant and lead with data. I want to dive a little deeper into each of the ideas with practical ways you can train on the skills you need for the future.

In the era of AI-driven decision-making, good data matters. Before you can even think about leveraging AI, you need to master your data ecosystem. Here’s how analysts can build a strong foundation by understanding where data comes from, how to evaluate it, and how to ensure it’s ready for AI.

1. Know Where Your Data Comes From—and How It’s Collected

It’s a dream to have good, clean data. However, in my experience, the data is almost never clean, complete or even usable until you decide to do some analysis. This poses problems when we get excited about implementing AI projects or even doing a manual analysis. Before you get started, you need to ask yourself some key questions:

  • How is the data sourced? Is it created manually by humans? Is it generated by processes? If so, how are those processes biased or lacking? Can you trust the data if it’s input by hands on keyboards? Knowing how your data is created can be one of the great keys to understanding its’s impact. Reach out to the data and system owners to understand their confidence in the data and any of its limitations before you get started on your project.

  • What tools or systems are collecting the data? Whether it’s IoT sensors, web scraping tools, or manual entry, the method of collection affects accuracy and timeliness. Ensure you can actually gather and use the data. Does the system allow for direct connections to analysis and AI tools? Do you have to extract the data before using it? How does the security around the system impact access to the data? Is there anything you need to know about privacy or data law that could impact your analysis?

  • Are there any biases introduced during collection? For example, survey data might be skewed by how questions are phrased or who is surveyed. Have you accounted for those biases and are you able to explain them to others? Do you have the most unbiased data available for use? Should you create unbiased dummy data to train your model? Are you aware of your own biases as an analyst and are you able to mitigate them?

2. Identify the Most Reliable and Relevant Data Sources

Not all data is created equal. Analysts must learn to distinguish between relevant and irrelevant sources and choose the best data for projects. Leverage data dictionaries, talk to data stewards and get to know these key indicators of data quality:

  • Reliability: Is the source consistent and trustworthy (a.k.a. accurate, consistent, documented, timely and as objective as possible)? Historical performance, update frequency, and documentation quality are good indicators. Look into the metadata and documentation for your sources before choosing to use them. You should be able to explain the data reliability to anyone who asks.

  • Relevance: Does the data align with your business goals or analytical objectives? Even high-quality data is useless if it doesn’t serve your purpose. Don’t use data that is irrelevant or you will be making irrelevant decisions with your outcomes. If the data you need doesn’t exist, don’t choose the next best thing. Again as an analyst, your integrity is critical. If you can’t trust the data, people can’t trust your analyses.

  • Timeliness: Is the data current enough to support real-time or near-real-time decisions? If not, how are you mitigating the risk of stale data or improving processes to get access to better data? How are communicating the timeline of the data collection to your stakeholders?

3. Assess Data Quality and Completeness

Before feeding data into AI models or analytical processes, it’s crucial to evaluate if its of sufficient quality to allow for decision making. Check in on:

  • Completeness: Are there missing values or gaps in the dataset? Missing data can skew results or cause models to fail.

  • Accuracy: Does the data reflect reality? Cross-check with known benchmarks or use validation rules.

  • Consistency: Are formats and units standardized across datasets? Inconsistent data can lead to integration issues.

  • Uniqueness: Are there duplicate records? Redundancy can inflate metrics and distort insights.

Final Thought:
In the field of analytics, data matters above all.

Post conceived of by Justeen Gales and written with the support of Microsoft Copilot

Previous
Previous

Becoming Data Fluent: How to Write Data Stories (a really brief overview)

Next
Next

Becoming Data Fluent: Lead with Data in the Age of AI