High level infrastructure
The DataBank's architecture is designed to aggregate a rich and diverse collection of data sources and datasets from various providers, first cataloged by data analysts. Leveraging the power of Microsoft Azure and Azure Data Factory, the DataBank seamlessly ingests and extracts data from thousands of publicly available repositories.
Data is sourced in multiple formats, including structured data like CSV, JSON, SQL databases, APIs, and Parquet files, as well as unstructured data such as PDF documents. These diverse data types are processed through connectors into Azure Data Factory. Once data is processed, it is stored in Azure Storage with a tiered structure signifying its processing stage.
Data stored in the DataBank are classified into three stages:
Bronze
Raw data
Silver
Processed data after Extract, Transform, Load (ETL) processes
Gold
Enriched data post-Machine Learning (ML)
The DataBank has the majority of its data in the silver stage, ready for descriptive and/or real-time analytics, with a few datasets in the bronze stage. Datasets processed upon clients’ request through ML will be considered golden data.
What We Do
- Seamless Data Ingestion
Harness the power of Microsoft Azure to effortlessly pull data from thousands of public sources. Our Azure Data Factory pipelines handle vast volumes and diverse data types with ease. - Advanced Data Processing
Utilize the extensive Python data ecosystem for in-depth analysis, manipulation, and visualization. Our tools bridge the gap between raw data and advanced data science. - AI Integration
Once your data is enriched, it’s ready to be fed into Azure Machine Learning or Azure AI Studio, setting the stage for innovation and progress.