Analytics Engineering
Analytics Engineering
- Analytics Engineering is the organization of an organization's information.
- Real impact comes from continuous decision-making and implementing actions with feedback.
- Accept that analytics is a mess.
- Explicitly create separate workspaces for curated (production) and messy (experimental) work.
- Reports are rarely read, and often forgotten. Decision-making involves getting data, summarizing and predicting ad then taking action.
- One of the best ways to communicate data is telling stories. Stories are more captative and present a coherent view around a topic.
- The analytics engineer workload is a lot like being a data librarian.
- If you are running a library you have these books coming in and you have people who are looking for books on specific topics and you've got to figure out a way to organize all those books so that all those people can find what they need. There are many different ways to organize books, not just one perfect solution. A librarian is interested in helping people find the books that they're looking for but also discovering new books that they didn't realize that they were looking for.
- Analytics code should be version controlled, tested, modular and maintainable.
- Define all resources (Dashboards in YAML, Cohorts in SQL, …) as code.
- Treat data the same way engineers treat code. That means CI/CD, tests, frequent PRs, …
- Use Data Practices#Data Request Template when getting questions.
- Analytics work can be roughly split in two buckets:
- Building automated Systems, from metrics to Dashboards, to enable self-service use cases for business users. This is what we now typically call analytics engineering.
- Doing ad-hoc analyses, to answer some questions directly.
- Make your modeling approach explicit (e.g. Dimensional Modeling).
- Modeling reality gets complex quickly. There are small nuances, special conditions, things that changed, edge cases and, of course, errors.
- Imagine your company today as a human society where only half the population can read (understand the data), one tenth can write (SQL queries), where half a dozen languages are spoken, and where most of the books (Dashboards/insight reports) in the library contain things that once were true but have since been outdated (but you don't know which ones). Not a highly productive information ecosystem.
- Domain knowledge is more important than your coding skills.
- Ground truth isn't a single place. Start by joining on common unique keys and counting things, then figure out what's different and why.
- Collaborate with your team and break down complex models into reusable pieces.
- Working with data is like exploring the horizon. It changes as soon as you look it from a higher place (more data).
- Find out what decisions your stakeholders need to make, repeatedly, and help with those.
- Attach a date to your team output resources (Dashboards, analysis, …) so they exist as artifacts that were true at a certain point on time.
- Reduce the areas where business logic can be injected, create "time to live" policies on last mile transforms, build a culture of standardizing + celebrating access to cross-functional codebases.
- People default to writing business logic in the tool they are most comfortable with. The best way for data teams to prevent sprawling business logic is to limit last mile transforms in other tools and invite others into their tools. The logic will be written, and if the data team gate-keeps, it will be written outside of their visibility! If a data team can educate and encourage contributions to their codebase, they invite code to be written where it most belongs.
- Modern data warehouses might need new model design paradigms.
Resources
Communities
Public Data Projects
- GitLab
- Mattermost
- Mozilla
- Dagster Open Platform
- LLM Support Bot
- MIT Open Learning
- Our World in Data
- Catalyst Cooperative PUDL
- Ecosyste.ms
- Spellbook
- Flipside Crypto
- Artemis
- MetricsDAO
- RA Analytics
- Anomstack
- Skrimmage
- Ibis
- OSO
- Tuba
- Department of Education for New South Wales
- OP Analytitcs
Dagster Resources
- dagster-io/hooli-data-eng-pipelines
- zsvoboda/ngods-stocks
- westmarindata/dagster-integration-demo
- b-gar/dagster-cfb
- jonathanneo/data-aware-orchestration
- mitodl/ol-data-platform
- fremantle-industries/tabletop
- franloza/coches-net-dashboard
- westmarindata/dagster-pypi-github
- slopp/dagster_s3_clickhouse_demo
- dagster-io/mdsfest-opensource-mds