How the DataBank has been built

Main navigation

How the DataBank has been built

High level infrastructure

The DataBank's architecture is designed to aggregate a rich and diverse collection of data sources and datasets from various providers, first cataloged by data analysts. Leveraging the power of Microsoft Azure and Azure Data Factory, the DataBank seamlessly ingests and extracts data from thousands of publicly available repositories.

Data is sourced in multiple formats, including structured data like CSV, JSON, SQL databases, APIs, and Parquet files, as well as unstructured data such as PDF documents. These diverse data types are processed through connectors into Azure Data Factory. Once data is processed, it is stored in Azure Storage with a tiered structure signifying its processing stage.

Data stored in the DataBank are classified into three stages:

Bronze

Raw data

Silver

Processed data after Extract, Transform, Load (ETL) processes

Gold

Enriched data post-Machine Learning (ML)

The DataBank has the majority of its data in the silver stage, ready for descriptive and/or real-time analytics, with a few datasets in the bronze stage. Datasets processed upon clients鈥� request through ML will be considered golden data.

What We Do

Seamless Data Ingestion
Harness the power of Microsoft Azure to effortlessly pull data from thousands of public sources. Our Azure Data Factory pipelines handle vast volumes and diverse data types with ease.
Advanced Data Processing
Utilize the extensive Python data ecosystem for in-depth analysis, manipulation, and visualization. Our tools bridge the gap between raw data and advanced data science.
AI Integration
Once your data is enriched, it鈥檚 ready to be fed into Azure Machine Learning or Azure AI Studio, setting the stage for innovation and progress.

Tags:

DataSphere Lab