And with the blink of an eye, summer is gone and it is time to return to blogging. Recently the idea to start a short blog series was formed in my mind and in the few forthcoming posts I will cover all the basics (and more) that you need to start using Pandas. 

First and foremost – what is Pandas?

Pandas is a popular Python library that allows users to easily analyse and manipulate data. It offers powerful and flexible data structures and is vastly popular among data scientists and analysts. As with any other library to be able to use Pandas you have to import the library. 

import pandas as pd

Before you start manipulating data with Pandas, you should understand how data is shaped into a readable form – the topic that I will cover in Pandas#1. The library provides two data structures- Series and DataFrame. 


A panda series is a one-dimensional array that consists of a key-value pair. You can think of a series like a column in a table. Series are similar to Python dictionaries, however Series makes data manipulation much easier. 

To create a Series we use pd.Series() . There are lots of optional arguments that you can use, however the most commonly used one is data, which specifies the element of the series. 

sample_series = pd.Series ([‘September’,‘October’,‘November’])

On the left hand side of the Series elements there are integers – those integers are known as index of a series. By default index is set to numbered list, however you can update it. The custom index should be the same length as the number of Series element. 

sample_series_index = pd.Series([‘September’,‘October’,‘November’], index = [‘first’,‘second’,‘third’])


A DataFrame is a 2-dimensional data structure which is similar to a SQL table or a spreadsheet. You can also think of it as a combination of two or more Series. Each column of a DataFrame can contain different data types. Similar to Series, when you create a DataFrane you can specify a custom index. 

We can create a Dataframe using pd.DataFrame 

sample_DataFrame = ( [ [‘2021’ , ‘2019’ , ‘2021’] , [ ‘London’ , ‘Paris’ , ‘Berlin] , [‘GBP’ , ‘EUR’ , ‘EUR’]] , index = [‘row_1’ , ‘row_2’ , ‘row_3’] , columns = [‘year’ , ‘city’ , ‘currency’])

Now you know how Pandas shapes data into a credible format and in the next post I will cover some data manipulation techniques like adding a new row, dropping a row, updating a row and more. Hope that this mini blog series will be beneficial to other folks and I would love it if people who use Pandas share their experience in the comments bellow. Pandas#2 coming soon 🙂

T-SQL Tuesday #137: Using Notebooks Every Day

Another month, another blog party! Shout out to Steve (b|t) for hosting and choosing a great topic – this month we will be chatting about Jupiter notebooks and how we use them.

I discovered Jupiter notebooks not that long ago, but the more I use them the more I see how powerful they could be. For those of you who are not familiar whit Jupiter Notebook: It is an open-source web application where you can combine code, output, visualizations and explanatory text all in one document allowing you to write a code that tells a story. Now that you have an idea of what Jupiter notebook is I will walk you through how you can use it in Azure Machine Learning Studio.

But first thing first, to be able to use Jupiter notebooks in Azure Machine Learning Studio you should have an Azure Machine Learning workspace. You can easily do this from the Azure portal- Just search for Machine Learning in the “Create new resource” section and fill in the required information.

Now that you have your workspace created you will need to start Azure Machine Learning Studio. Go to the workspace overview page and click on the “Launch Studio” button.

Once you have Azure Machine Learning Studio open on the menu on the left you will see “Notebooks” under the Author section.

In the Netbooks window, you have two subsections. Under “File” you can create your own notebook by clicking on the  symbol and selecting “Create New File” which will prompt you to a window where you have to fill in the file name and choose the file type – one of the options being Notebook (*.ipynb).

To be able to run a notebook you have to point it to a Compute – if you already have a compute instance you can select it otherwise you can click on the plus and create a new Compute resource.

You can also use one of the sample notebooks – go to “Samples” and choose a sample notebook, but note that to work with it you will have to clone it first.

To edit an existing notebook just click on the cell you want to update. You don’t need to be connected to a compute instance to edit the notebook, however that is not valid if you want to run it.

When you create an ipynb file Azure Machine Learning creates a checkpoint file. Every notebook is autosaved every 30 seconds, however, that AutoSave updates only the initial ipynb file. You can also manually save the notebook by clicking on the menu and under “File” select “Save and checkpoint”.

Now that you know how to start using notebooks in Azure Machine Learning Studio I strongly recommend going and playing with them as it can really help you put your code together and collaborate easily with others.

T-SQL Tuesday #136: Blog About Your Favourite Data Type

It is time for the March T-SQL Tuesday blog party. For some time now I have been planning to blog about data types, so you can imagine my excitement when I saw that this month Brent Ozar (t|b) is asking us to blog about our favourite data types. Shout out to Brent Ozar for hosting and choosing such an interesting and useful topic.

Some of you may already know that I have dedicated a large part of my life studying maths and I LOVE numbers;the fact that I have chosen to talk about numeric data types in SQL Server probably will not be a surprise to anyone. 

There is no one-size-fits-all data type and I want to highlight the importance of choosing the right one – doing so can not only improve the performance, but  also the data integrity. I will walk you through the various numeric data types in SQL Server, indicating when each should be used.


Imagine you are working with a column that has two possible values- this may prompt you  to work with Boolean data. Boolean allows only two possible values- “True” or “False”. In SQL Server, we don’t have Boolean;instead  we have the data type BIT that can be used to store Boolean values.

BIT is an integer data type that can take a value of 0, 1 or NULL. With regards to storage, if we have 8 or fewer bit columns in a table, the columns are stored as 1 byte. If we have between 9 and 16-bit columns, the columns are stored as 2 bytes, and so on. Additionally, string values TRUE and FALSE can be converted to 1 and 0 into BIT values.

A good example of when to  use BIT is a column containing “Gender” information – we have Male(M), Female(F) and unknown(NULL). However, BIT may not be the best choice for column storing “Status” information:although initially, you may think that values for status can only be “on” or “off”, later on, you may decide to add other classifications such as “in progress”, “archived” etc. Therefore, a better option for the “Status” column would be TINYINT which we will cover next.


SQL Server integer data types represent whole numbers (both negative and positive). SQL Server supports four variations  of integer data types that have different ranges and storage requirements – TINYINT, SMALLINT, INT and BIGINT.

Some practical uses of integer data types include: using it to count values, store a person’s age, or use it as an ID key to a table. It is a good practice to use the smallest integer data type that will reliably suit your data. For example, to store data about people’s age we can use TINYINT as it is not likely that someone will be older than 255 years.


NUMERIC(p,s) and DECIMAL(p,s) are exact data types that are defined by precision(p) and scale(s). Precision is the maximum number of total digits allowed, and scale is the number of digits to the right of the decimal point. The default value of p is 18 and s is 0, and for both these values, the minimum is 1 and the maximum is 38. To illustrate that, let us say we have a column as DECIMAL(10,4) – precision of 10 and scale of 4. Our column can safely store the number 123456.1234

According to the Microsoft documentation, NUMERIC and DECIMAL are synonyms and, therefore, can be used interchangeably. So why are there two data types for the same purpose? Actually DECIMAL and NUMERIC are almost the same, but after some research, I found that the small difference between them lies in the the fact that NUMERIC determines the exact precision and scale, while DECIMAL specifies only the exact scale and the precision is equal to or greater than the specified value.

DECIMAL is one of the most used numeric data types because of its accuracy, but you should always ensure the precision of the DECIMAL variable specified is enough to accommodate the values assigned to it.


SQL Server provides two dedicated data types for storing monetary values. You can think of MONEY and SMALLMONEY as DECIMAL – Money is effectively the same as DECIMAL(19,4) while SMALLMONEY is effectively the same as DECIMAL(10,4). If you are planning to use MONEY you should have in mind that performing division and multiplication can cause rounding errors that result in the unintentional loss of precision. The cause of the problem is that MONEY only saves information up to the 4th decimal place and if your multiplication or division results in an integer that goes to the 5th decimal place or more, MONEY will round it off, causing an accuracy error.

To prevent the accuracy error, the recommendation (almost without exception) is to use DECIMAL instead of MONEY. DECIMAL doesn’t cut numbers short with a rounding error, so you’re multiplying and dividing the whole number, and thereby maintaining the accuracy of your calculations. MONEY is a good option only if you are a non-enterprise user of SQL Server and choosing that data type will optimise storage.

It is also worth mentioning that MONEY is the only data type that can accept formatted numbers (1,234.1234) and can also accept currency symbol prefixes with a number ($1,234.1234).


The FLOAT data type accepts approximate numerical values meaning that not all values in the data type range can be represented exactly. The float data type is declared as FLOAT[(n)] where n is the number of bits that are used to store the mantissa of the float number – or in other words, n dictates the precision. If no precision is specified during the declaration, the default precision is 53. Attempting to assign a value larger than the declared precision will cause an error to be raised. REAL is similar to FLOAT but it is an IEEE standard floating point value, equivalent to FLOAT(24)

The main difference between DECIMAL and FLOAT is that DECIMAL data types are good for storing exact numbers but they can have up to 38 digits in total, in comparison to FLOAT data types which are really good for long numeric values where the number stored is an extremely close approximation. Note that using FLOAT and REAL inadvertently can lead to storage issues. They are the right choice if you are using scientific data that requires the properties of floating point numbers, but in most cases, the better choice will be to use Decimal.


I hope that this helped you to understand the numeric data types in SQL Server a bit better. The original topic for this month was to blog about our favourite data type, however, I cannot choose only one magical universal data type as I believe each of them can be magical when used in the right circumstances. My formula for choosing the right data type is: understand your data- how it is used and how it may be used, plus have a good understanding of the different data types so you can select the best option when developing your databases.

#5 Succinct Introduction to Azure Synapse Analytics

Think of Azure Synapse Analytics as an Azure SQL Data Warehouse on steroids. It is a limitless analytics service that brings together data integration, enterprise data warehousing and big data analytics, all into a single service.

To better understand Azure Synapse, let’s take a step back and start by briefly covering what Azure SQL Data Warehouse (SQL DW) is. SQL DW was a massively parallel processing (MPP) cloud-based, scale-out relational database, designed to process large volumes of data within the Microsoft Azure cloud platform. Over the last couple of years, Microsoft has added features that have evolved SQL DW into a more powerful and unique data analytics solution known as Azure Synapse Analytics. 

One of the key capabilities of Azure Synapse Analytics is that it combines data warehousing and big data analytics into a single service. Formerly we used SQL DW to access structured data and Data Lake for big data analysis; now Synapse merges both under a single bracket. Azure Synapse Analytics is integrated with numerous other Azure services, such as Azure Data Catalog, Azure Databricks, Azure HDInsight, Azure Machine Learning and Power BI.

The central data storage of the Synapse Workspace is based on an Azure Data Lake Storage Gen2. On top of the blob storage, you can choose between two different kinds of analytic runtimes:

  • SQL-based runtime using a dedicated or serverless approach
  • Apache Spark runtime 

Azure Synapse Analytics comes with an integrated orchestration engine that is identical to Azure Data Factory to create data pipelines and rich data transformation capabilities within the Synapse workspace itself.

Another key aspect is the security features in Azure Synapse Analytics:

  • Already compliant with the industry-leading compliances like ISO, DISA, HIPPA etc
  • Supports AD authentication, SQL authentication and Multi-factor authentication
  • Supports data encryption at rest and in transit as well as data classification for sensitive data 
  • Supports row-level, column-level, as well as object-level security along with dynamic data masking. 
  • Supports network-level security with a virtual network as well as firewalls 

Create an Azure Synapse Analytics Workspace

Navigate to the Azure Portal and search for Azure Synapse Analytics. You will land on the homepage of the service. Click on the ‘Add’ button (or if that is your first workspace you can also use the Create Synapse workspace).

Once the wizard opens in the Basic section, enter your preferred Subscription, Resource Group, Workspace name and Region. Next is the “Select Data Lake Storage Gen2” section. Data in Azure Synapse can be accessed from different repositories and Synapse can natively read data from Azure Data Lake Storage Gen2 – for that purpose we need to have an account as well as a file system. 

If you already have an account you can manually specify the URL. To create a new Azure Data Lake Storage Gen2 account, click on the Create New button under Account Name. Check the box titled “Assigned myself the Storage Blob Data Contributor role” as Contributor level access is required by different Synapse features to access the data.

In the Security tab, under the SQL administrator credentials, you need to provide the administration credential that would be used to connect to SQL pools.

Next is the Network tab. You can enable a Synapse-managed virtual network by selecting “Enable managed virtual network” – this will ensure that the traffic uses the Azure internal network only (Note: This feature has an additional cost). Here you can also specify which IP addresses can connect to the workspace and you have the option to select “Allow connection to all IP addresses”.

Optionally you can add any Tags to add metadata to your workspace and, voila, you have provided all the required information to create your workspace. Review all the details and before clicking ‘Create’, note that when creating a Synapse workspace a SQL On-demand pool is created by default and the estimated cost for it is £3.73/TB of data scanned (cost as of February 2021).

Creating the Azure Synapse Workspace should take a couple of minutes. Click on the ‘Go to resource’ button to open the workspace and you will land on the dashboard page where you can find different properties, endpoint information, create new pools, restart your credentials, change Firewall settings, start Synapse Studio and others.

Azure Synapse Studio

Now that you have a workspace, let’s jump onto Azure Synapse Studio. There are two ways you can open the tool. You can either follow the link from your workspace dashboard page; or you can go to and sign in to your workspace.

On the dashboard page, you can see four categories: Ingest, Explore & Analyse, Visualize and Learn.

If you are new to Synapse, I strongly recommend you check the Learn section. It provides samples and tutorials that will help you kick start your Azure Synapse adventure.

Go back to  the home dashboard; by clicking ‘New’, you will see a dropdown menu that gives you the option to create various artefacts like SQL Scripts, Notebooks, Data Flow, Spark Job or Pipelines. You can also select ‘Import’ if you want to copy data from an external data repository.

In the menu on the left, you will see  six buttons. Below the Home icon, there is  the Data section where you can create databases, tables and other database objects, as well as create linked databases to access data from external data repositories.

Next is the Developer tab which allows you to create new artefacts such as SQL scripts, Notebook, etc.

The fourth tab is the Integration section. You can create data pipelines, jump directly to the Copy tool which allows you to create data pipelines step by step using a wizard, or browse a gallery of samples or previously created data pipelines.

Next in the list is the Monitoring tab. In addition to everything so far, Azure Synapse Studio also works as an administrative console. Here you can see a history of all the activities taking place in the workspace and identify which ones are active now. You can monitor pipelines, triggers, and integration runtimes as well as Spark and SQL activities.

Finally, we have the Tools tab.

And finally…

Without a doubt, Azure Synapse Analytics is a game-changer in Data processing and Analytics. I hope that with this post I have sparked your interest and have covered the basics you need to go and explore the service with limitless possibilities.

#4 Getting Started with Dbatools

Being efficient is an integral part of being an effective DBA. In this post, I will introduce you to a tool that can significantly improve your work as SQL DBA -dabtools.

Dbatools is an open-source PowerShell module for managing SQL Server. Started by Chrissy LeMaire(B|T) back in 2014 as a project to migrate SharePoint SQL Server instances, nowadays the module comes with almost 600 commands for automating SQL Server from the command line.  One of the things I love most about dbatools is that the module is community-driven and there are more than 190 contributors- if you get stuck or have any questions, the amazing folks from the community are always happy to help.

1. Installing Dbatools

Getting started with dbatools is quick and easy. First, you need to install the module- open PowerShell as Administrator and run the following command:

Install-Module dbatools 

If you don’t have administrative rights on the machine or you want to instal the module only for your current user you can use:

Install-Module dbatools -Scope CurrentUser 

Or if you don’t have internet access, you can download the module and then copy it to the machine with no internet. More information about offline install is available on

2. Sp_Configure with dbatools

Almost every DBA uses sp_configure to view server settings or to make config changes – thanks to dbatools we can use sp_configure capabilities via the command line across multiple servers simultaneously.

To get information about server-level system configuration you can use Get-DbaSpConfigure. The command will return all system configuration information. You can also add parameters to obtain a narrow view, for example specifying –Name will return only that specific configuration.

#Get all Configuration properties
Get-DbaSpConfigure -SqlInstance $Instance1 
#Get the value for XPCmdShellEnabled
Get-DbaSpConfigure -SqlInstance $Instance1 -Name XPCmdShellEnabled

The Set-DbaSpConfigure command helps you make changes to the server level system configuration.

#Set the value for XPCmdShellEnabled
Set-DbaSpConfigure -SqlInstance $Instance1 -Name XPCmdShellEnabled -Value 1

Once you have made some configuration changes you may want to export the advanced sp_configure global configuration to a SQL file:

#Export the configuration to a .sql file (useful for documentation purposes)
Export-DbaSpConfigure  -SqlInstance $Instance1 -Path C:\temp

The real power of PowerShell is that you can work with multiple instances. Here is an example of how you can compare the configuration information of two instances:

#Compare configurations of two servers 
$SPConfigure_Instance1 = Get-DbaSpConfigure -SqlInstance $Instance1
$SPConfigure_Instance2 = Get-DbaSpConfigure -SqlInstance $Instance2
$propcompare = foreach ($prop in $SPConfigure_Instance1) {
        Config                  = $prop.DisplayName
        'Windows Node1 setting' = $prop.RunningValue
        'Windows Node2 Setting' = $SPConfigure_Instance2 | Where DisplayName -eq $prop.DisplayName | Select -ExpandProperty RunningValue

Copying sp_configure settings from one instance to another – something easily achievable with dbatools:

#Copy configuration from one Instance to the other
Import-DbaSpConfigure -Source $Instance1 -Destination $Instance2

You can also see and change the  sp_configure settings of multiple servers simultaneously.

#Configure multiple Instances simultaneously
$AllInstances = $Instance1 , $Instance2
Get-DbaSpConfigure -SqlInstance $AllInstances -Name XPCmdShellEnabled
Set-DbaSpConfigure -SqlInstance $AllInstances -Name XPCmdShellEnabled -Value 0

3. Backups with Dbatools

You are only as good as your last backup, so making regular backups is a keystone of any business continuity and disaster recovery strategy. Backups are among the most performed tasks by every DBA- luckily dbatools can help you to quickly and easily perform and test backups.

The command for backing up one or multiple databases is Backup-DbaDatabase. If you want to do a full backup on all databases on an instance and place the backup files to a certain location, you can run:

#Full backup of all databases
Backup-DbaDatabase -SqlInstance $Instance1 -Path C:\temp\Backup -Type Full

To check how long it will take to complete the backup use Measure-DbaBackupThroughput. The command uses the information stored in MSDB and returns minimum and maximum throughput, average backup duration, first and last backup date, database backups count and average backup size.

Measure-DbaBackupThroughput -SqlInstance $Instance1

Get-DbaBackuphistory is used to get details about all backups belonging to a particular database or instance in SQL Server with minimal effort. The output includes the backup type, size, backup device type duration, start and end time. 

#Backup history details for all databases on Instance1:
Get-DbaDbBackupHistory -SqlInstance $Instance1

Before restoring a database, it is a good practice to check that we have enough space on the target instance. Once we know we have the required space we can perform the restore. Here is an example of how you can do this with dbatools:

#Check we have enough space to restore AdventureWorks2019 from Instancce1 to Instance2
Measure-DbaDiskSpaceRequirement -Source $Instance1 -Destination $Instance2 -Database AdventureWorks2019

#Restore databases on Instance2 from file location
Restore-DbaDatabase -SqlInstance $Instance2 -Path C:\temp\Backup -WithReplace

Having a backup that cannot be restored is like having lifeboats that don’t float- so testing your backup is essential. With dbatools you can do this easily – Test-DbaLastBackup tests the last set of full backups in a SQL instance via performing the following steps:

  • It gathers the last full backup information for the specified database. If we do not specify a particular database, it gathers information for all databases
  • It restores the last full backup of a database with a new name in the specified server. If we do not specify any instance, it restores database SQL Server on the source instance. By default, the new database name is in the format of dbatools-testrestore-$databaseNameWe can specify a custom database name as well
  • It also renames the logical and physical file of a new database. It ensures no conflict with database filenames
  • It performs DBCC CHECKTABLE to Checks the integrity of all the pages and structures for all tables in a new database
  • In the last step, it drops the newly restored database
Test-DbaLastBackup -SqlInstance $Instance2

4. Logins with Dbatools

Ensuring that users and applications can successfully connect to the databases they require is essential – with dbatools you can simplify the work that is required to administer instance logins and database users.

You can create new SQL /or Windows authentication/ login. You can also check all logins that already exist on an instance.

#Returns info for all the logins from Instance1 using NT authentication
Get-DbaLogin -SqlInstance $Instance1
#Create a new SQL login named Lina1 on Instance1
New-DbaLogin -SqlInstance $Instance1 -Login Lina1

You may also want to get the users of a Database and then add a new user. 

#Gets the users for AdventureWorks2019 
Get-DbaDbUser -SqlInstance $Instance1 -Database AdventureWorks2019 | Format-Table

#New sql user with login named Lina1 in AdventureWorks2019
New-DbaDbUser -SqlInstance $Instance1 -Database AdventureWorks2019 -Login Lina1

In addition, you can check the members’ roles and easily assign a role to a user.

#Returns all members roles of AdventureWorks2019 on Instance1
Get-DbaDbRoleMember -SqlInstance $Instance1 -Database AdventureWorks2019

#Add user Lina1 to the roles db_datareader and db_datawriter in AdventureWorks2019
Add-DbaDbRoleMember -SqlInstance $Instance1 -Database AdventureWorks2019 -Role db_datareader, db_datawriter  -User Lina1

Covering all 500+ SQL Server best practices, administration, development and migration commands included in dbatools is a mission impossible, but I strongly recommend checking the module if you are looking for automation or an easier and quicker way to perform tasks. Although I am still new to dbatools, I was lucky to be introduced to the module by my amazing colleagues; I am currently using it daily and it has helped me significantly improve my efficiency. This blog was a short summary of just a few of the many brilliant things you can do with the module- I strongly suggest checking their website and exploring it further. I would love to hear what you think of dbatools – please share your experience in the comments below.  

T-SQL Tuesday #135: The outstanding tools of the trade that make your job awesome

Welcome to my first T-SQL Tuesday blog and thanks to Michael Bronowski (b|t) for hosting and picking a great topic. Not familiar with T-SQL Tuesday? Started by Adam Mechanic (b|t) back in 2009, the idea is that each month we have a different host who selects the theme and on the second Tuesday each month bloggers from around the world share their knowledge on the topic. Let the blog party start now!

I still remember the first time when a user contacted me with the question: “The database is running slow. Can you help?”. Wouldn’t it be amazing if you have a tool that can help you capture detailed information on what’s running on your instance in situations like that? There is! Today I want to share with you one of my favourite tools- the stored procedure sp_whoisactive developed by Adam Mechanic. Sp_whoisactive allows you to get a lot of information about the active session of a SQL Server instance such as the query being executed, the user who is executing, the wait event, runtime, CPU usage, Tempdb usage and more. You can get even more information if you pass a parameter.

Getting started with sp_WhoIsActice is easy. To download the procedure go to the website downloads page and select the relevant release. Open who_is_active.sql file using SQL Server Management Studio and execute the script. You will see that by default sp_WhoIsActive only returns running user processes.

The procedure provides the following information for each process:

dd hh:mm:ss.mssProcess elapsed time
session_idThe process session id
sql_textThe currently running SQL command
login_nameThe login name associated with the process
wait_infoThe process wait information (aggregated)
CPUThe CPU time
tempdb_allocationsNumber of Tempdb writes done
tempdb_currentNumber of Tempdb pages currently allocated
blocking_session_idThe blocking session Id
readsNumber of reads done
writesNumber of writes done
physical readsNumber of physical reads done
used_memoryThe amount of memory used
statusThe process status
open_tran_countThe number of transactions used
percent_completeThe query completion percentage
host_nameThe host machine name
database_nameThe database name where the query is executed
program_nameThe application that executed the query
start_timeThe process start time
login_timeThe login time
request_idThe request Id
collection_timeThe time that this last select was run

Looking for some additional information? You can obtain more information from the procedure by passing additional parameters. For example, if you want to find information not only for the running user processes but for the system processes as well you can use the following command:

Exec sp_whoisactive @show_system_spids = 1

One of my favourite things about sp_whoisactive is how well it is documented- so much useful information plus all of the additional parameters and their description. Use this command to get the information:

Exec sp_whoisactive @help = 1

Using Azure Data Studio? Then you should check out the sp_whoisactive extension. Search for sp_whoisactive in the market place and install. Connect to your SQL Instance is Azure Data Studio, right-click on the instance and select Manage. The sp_whoisactive output is in a graphical format and by clicking on any of the blocks you will automatically get a query that includes all required parameters.

sp_whoisactive has helped me so many times dealing with performance issues. I strongly recommend adding it to your collection of tools – it will make your life easier!

#3 Getting started with AWS

A couple of weeks ago I decided that I want to get a better understanding of AWS. The question was- what is the best way to do that? In my opinion, when you want to start learning about any of the public cloud providers there are three main points you should cover:

  • Basics of cloud computing
  • Services provided
  • Pricing

I saw that all of these are covered in the AWS Cloud Practitioner exam and decided to start preparing for CLF-C01. Preparing for the exam or just wanting to get a better understanding of how AWS works- I will share with you the resources that I used and my overall experience. 

Step One:

If you are preparing for AWS Cloud Practitioner I would suggest starting with the AWS website. You will find the Exam Guide and some resources that will be useful. According to the Exam Guide, the exam consists of four domains:

  • Cloud Concepts (26%)
  • Security and Compliance (25%)
  • Technology (33%)
  • Billing and Pricing (16%)

The “Prepare for the exam” section provides couple of things I would recommend. 

  1. The AWS Cloud Practitioner Essential training. Is it enough to prepare for the exam – in my opinion NO, but if you have absolutely no idea of cloud computing you can find it beneficial as it is super simplified and gives a good overall summary of what AWS is. 
  2. AWS White Papers. You have to read those. There you can find pretty much everything. However, I wouldn’t recommend starting your preparation for the exam with the whitepapers. I found it useful reading them a bit later in the learning process, hence later in this blog, we will come back to the whitepapers. 

Step Two

Choose the platform and the learning path that best fits your style. I have three suggestions.

  1. A cloud guru. This is a great learning platform – it is well structured and easy to understand and covers about 80% of what you will need to know for the exam. The course is about 14 hours, it has more than 70 lessons (in the form of short videos), 3 labs exercises and a practice exam at the end. In my opinion, the course covers enough to pass the exam, however, if you want to get the full picture of AWS services you have to combine it with some self-learning. The only minus that I would point for this course is that although they regularly update it, most of the content is recorded back in 2018 and even though the concept of AWS hasn’t drastically changed since then, some services have been added. A cloud guru is a paid platform – it costs £33 a month, but it has a 7 days free trial which is enough to cover the AWS Cloud Practitioner materials. 
  2. PluralSight. There is a great AWS Cloud Practitioner path on Pluralsight. The Pluralsight course is not as detailed as the one in a cloud guru, but they cover pretty much the same material. Again we have lessons in the form of videos- the whole path takes 8 hours to complete. No lab exercises, but there are great demos that you can follow along. Again I would say that this course covers enough for passing the exam, however, to fully understand AWS you have to combine it with self-learning. Plural sight is a paid platform as well, but you have a one-month free trial. 
  3. YouTube. You are looking for a free platform – there are many courses on YouTube. Like demos? Then I would recommend the video course by This is a four hours video of demos of the most important services offered by AWS. 

Step Three:

Go and play with AWS. AWS offers the so-called Free tier that allows you to try some AWS services free of charge within certain usage limits. When you create an AWS account, you’re automatically signed up for the AWS Free Tier for 12 months. You can find more information about the free tier limits per service here and I strongly recommend to go and launch EC2, set up S3, create a bucket, you can even create a static website using S3 bucket and many more for free. 

Step Four: 

Read the whitepapers. Remember that I mentioned them back in Step One- now is the time to read and understand them. Whitepapers cover everything- they are the official AWS documentation. At the time when I am writing this blog, there are a couple of whitepapers that are recommended for the cloud practitioner exam – Overview of Amazon Web Services and How AWS Pricing Works

  1. When was the last time you solved a 1000 pieces puzzle? The feeling of reading and understanding this white paper is similar. AWS offers more than 160+ services and in this paper, you will read about all of them. I would say it takes some time to understand each service and put the picture together, but once you do you feel the satisfaction. You don’t need to know all of the services for the cloud practitioner exam, and I think it is impossible to become expert in all of them, however, if you are interested in AWS and are planning to use it I recommend getting the overall idea of all services (or at least as many as possible). 
  2. How AWS Pricing Works. In Step Four we discussed that you can use some AWS for free, however depending on your configuration or the services used AWS can be expensive as well- you should know your services and how pricing works! This whitepaper contains all the pricing information you need to know. It gives you an overview of the AWS Pricing model and then explains the pricing of each service individually. For the exam, you don’t need to know the pricing for each service, but you should make sure you cover at least the most important ones.

These were all of the resources that I used to broaden my understanding of AWS. And if you go through them you will have no problem passing the AWS Cloud Practitioner. It took me about 10 days, spending 3-4 hours a day studying. The exam itself is 90 minutes and has 65 questions. You get a pass or fail straight after the exam, within 5 business days you get your detailed report with your score and your certificate (in my case it was the day after the exam). Hope that this will be useful to anyone looking for AWS learning sources and if you have decided to take the exam- Good luck! 

#2 Azure DP-900 Exam Preparation Guide

Passing DP-900 was on my new year resolutions list. Just ten days after 2021 started, I can happily check that as completed, because today I passed the Microsoft Azure Data Fundamentals. In this post, I want to share with you how I prepared and my overall experience. 

DP-900 tests on foundational level knowledge of core data concepts and how they are implemented using Microsoft Azure data services. As someone interested in working with data in the cloud, I found that exam beneficial. You can find detailed information about the exam on the Microsoft website. A few things that I didn’t see on the site, but think are good to know:

  • Number of questions: Between 40 and 60 (In my case- 49 questions) 
  • Duration: You have 60 minutes for the actual exam
  • Result: You are getting your result straight after answering all questions- something that I particularly like. You also get some analysis on what are the areas you have done the best and what are the bits that you have to improve

Personally, it took me around a week, studying 3-4 hours a day to get ready for the exam.

The only source that I used to prepare was the Microsoft learning platform. This is a free learning platform that covers all of the information needed to pass any Azure certificate. 95% of the materials are text format and there are only a few short videos- a format that I prefer as I find it easier to take notes while reading. It took me about 12 hours including taking notes to complete it. 

You can find the skills measured on the Microsoft website and they are quite accurate. The exam is composed of four main parts- Core Data Concepts; Relational Data on Azure; Non-relational Data on Azure; and Analytics workload on Azure. 

At the end of the blog I am adding my notes to speed up your learning or you can use them as revision material.

You are ready for DP-900! 

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

DP-900 Summary Notes

Describe core data concepts (15–20%)

1.Data types

1.1 Structured Data:

Structured data is typically tabular data that is represented by rows and columns in a database. Databases that hold tables in this form are called relational databases. Each row in a table has the same set of columns.

1.2 Semi-structured Data:

Semi-structured data is information that doesn’t reside in a relational database but still has some structure to it. Examples include documents held in JavaScript Object Notation (JSON) format.  There are other types of semi-structured data as well. Examples include key-value stores and graph databases. A key-value store is similar to a relational table, except that each row can have any number of columns. You can use a graph database to store and query information about complex relationships. A graph contains nodes (information about objects), and edges (information about the relationships between objects).

1.3 Unstructured Data:

Not all data is structured or even semi-structured. For example, audio and video files, and binary data files might not have a specific structure. They’re referred to as unstructured data.

2.Storage Service based on data type:

Depending on the type of data such as structured, semi-structured, or unstructured, data will be stored differently.

2.1 Structured data – Structured Data is typically stored in a relational database such as SQL Server or Azure SQL Database.

2.2 Unstructured data –  If you want to store unstructured data such as video or audio files, you can use Azure Blob storage

2.3 Semi-structured data –  If you want to store semi-structured data such as documents, you can use a service such as Azure Cosmos DB.

3. Provisioning – The act of setting up the database server is called provisioning.

4. Levels of access to your data in Azure. 

4.1 Read-only access means the users can read data but can’t modify any existing data or create new data.

4.2 Read/write access gives users the ability to view and modify existing data.

4.3 Owner privilege gives full access to the data including managing the security like adding new users and removing access to existing users.

*You can also define which users should be allowed to access the data in the first place.

5.Data processing solutions

5.1 Transactional system (OLTP) – A transactional system records transactions. A transaction could be financial, such as the movement of money between accounts in a banking system, or it might be part of a retail system, tracking payments for goods and services from customers. Think of a transaction as a small, discrete, unit of work.

5.2 Analytical system (OLAP) – An analytical system is designed to support business users who need to query data and gain a big picture view of the information held in a database. Analytical systems are concerned with capturing raw data, and using it to generate insights. An organization can use these insights to make business decisions. For example, detailed insights for a manufacturing company might indicate trends enabling them to determine which product lines to focus on, for profitability.

6. Tasks involved in the analytical system 

6.1 Data Ingestion: Data ingestion is the process of capturing the raw data. This data could be taken from control devices measuring environmental information such as temperature and pressure, point-of-sale devices recording the items purchased by a customer in a supermarket, financial data recording the movement of money between bank accounts, and weather data from weather stations. Some of this data might come from a separate OLTP system. To process and analyze this data, you must first store the data in a repository of some sort. The repository could be a file store, a document database, or even a relational database.

6.2 Data Transformation/Data Processing: The raw data might not be in a format that is suitable for querying. The data might contain anomalies that should be filtered out, or it may require transforming in some way. For example, dates or addresses might need to be converted into a standard format. After data is ingested into a data repository, you may want to do some cleaning operations and remove any questionable or invalid data, or perform some aggregations such as calculating profit, margin, and other Key Performance Metrics (KPIs). KPIs are how businesses are measured for growth and performance.

6.3 Data Querying: After data is ingested and transformed, you can query the data to analyze it. You may be looking for trends, or attempting to determine the cause of problems in your systems. Many database management systems provide tools to enable you to perform ad-hoc queries against your data and generate regular reports.

6.4 Data Visualization: Data represented in tables such as rows and columns, or as documents, aren’t always intuitive. Visualizing the data can often be useful as a tool for examining data. You can generate charts such as bar charts, line charts, plot results on geographical maps, pie charts, or illustrate how data changes over time. Microsoft offers visualization tools like Power BI to provide rich graphical representation of your data.

7. Normalization – The Process of splitting  into a large number of narrow, well-defined tables (a narrow table is a table with few columns), with references from one table to another. However, querying the data often requires reassembling information from multiple tables by joining the data back together at run-time.

*Disadvantage: You split the information into tables. When you read this info you need to assemble this information at runtime by joins. These queries might be expensive sometimes.

8. ACID principles:

8.1 Atomicity guarantees that each transaction is treated as a single unit, which either succeeds completely, or fails completely. If any of the statements constituting a transaction fails to complete, the entire transaction fails and the database is left unchanged. An atomic system must guarantee atomicity in each and every situation, including power failures, errors, and crashes. 

8.2 Consistency ensures that a transaction can only take the data in the database from one valid state to another. A consistent database should never lose or create data in a manner that can’t be accounted for. In the bank transfer example described earlier, if you add funds to an account, there must be a corresponding deduction of funds somewhere, or a record that describes where the funds have come from if they have been received externally. You can’t suddenly create (or lose) money.

8.3 Isolation ensures that concurrent execution of transactions leaves the database in the same state that would have been obtained if the transactions were executed sequentially. A concurrent process can’t see the data in an inconsistent state (for example, the funds have been deducted from one account, but not yet credited to another.)

8.4 Durability guarantees that once a transaction has been committed, it will remain committed even if there’s a system failure such as a power outage or crash.

*A transactional database must adhere to the ACID properties to ensure that the database remains consistent while processing transactions.

9.Eventual consistency 

Many systems implement relational consistency and isolation by applying locks to data when it is updated. The lock prevents another process from reading the data until the lock is released.

A distributed database is a database in which data is stored across different physical locations. It may be held in multiple computers located in the same physical location (for example, a datacenter), or may be dispersed over a network of interconnected computers.

If you require transactional consistency in this scenario, locks may be retained for a very long time, especially if there’s a network failure between databases at a critical point in time. To counter this problem, many distributed database management systems relax the strict isolation requirements of transactions and implement “eventual consistency.” In this form of consistency, as an application writes data, each change is recorded by one server and then propagated to the other servers in the distributed database system asynchronously. While this strategy helps to minimize latency, it can lead to temporary inconsistencies in the data. Eventual consistency is ideal where the application doesn’t require any ordering guarantees.

10.Data processing

Data processing is simply the conversion of raw data to meaningful information through a process.

10.1 Streaming processing – Processing data as it arrives is called streaming.

10.2 Batch processing – Buffering and processing the data in groups is called batch processing.

10.2.1 Advantages and disadvantages of batch processing:


* Large volumes of data can be processed at a convenient time.

* It can be scheduled to run at a time when computers or systems might otherwise be idle, such as overnight, or during off-peak hours.


* The time delay between ingesting the data and getting the results.

* All of a batch job’s input data must be ready before a batch can be processed.Even minor data errors, such as typographical errors in dates, can prevent a batch job from running.

10.3 Differences between streaming and batch processing

10.3.1 Data Scope: Batch data can process all the data in the dataset. Stream processing typically only has access to the most recent data received, or within a rolling time window (the last 30 seconds, for example).

10.3.2 Data Size: Batch data is suitable for handling large datasets efficiently. Stream processing is intended for individual records or micro batches consisting of few records.

10.3.3 Performance: The latency for batch processing is typically a few hours. Stream processing typically occurs immediately, with latency in the order of seconds or milliseconds. Latency is the time taken for the data to be received and processed.

10.3.4 Analysis: You typically use batch processing for performing complex analytics. Stream processing is used for simple response functions, aggregates, or calculations such as rolling averages.

11. Roles in the world of data

11.1 Azure Database Administrator role – An Azure database administrator is responsible for the design, implementation, maintenance, and operational aspects of on-premises and cloud-based database solutions built on Azure data services and SQL Server. They are responsible for the overall availability and consistent performance and optimizations of the database solutions. They work with stakeholders to implement policies, tools, and processes for backup and recovery plans to recover following a natural disaster or human-made error.

11.2 Data Engineer role – A data engineer collaborates with stakeholders to design and implement data-related assets that include data ingestion pipelines, cleansing and transformation activities, and data stores for analytical workloads. They use a wide range of data platform technologies, including relational and nonrelational databases, file stores, and data streams.

11.3 – Data Analyst role – A data analyst enables businesses to maximize the value of their data assets. They are responsible for designing and building scalable models, cleaning and transforming data, and enabling advanced analytics capabilities through reports and visualizations.A data analyst processes raw data into relevant insights based on identified business requirements to deliver relevant insights.

12. Azure Data Studio – Azure Data Studio provides a graphical user interface for managing many different database systems. It currently provides connections to on-premises SQL Server databases, Azure SQL Database, PostgreSQL, Azure SQL Data Warehouse, and SQL Server Big Data Clusters, amongst others. It’s an extensible tool, and you can download and install extensions from third-party developers that connect to other systems, or provide wizards that help to automate many administrative tasks.

13. SQL Server Management Studio – SQL Server Management Studio provides a graphical interface, enabling you to query data, perform general database administration tasks, and generate scripts for automating database maintenance and support operations.

14. Characteristics of relational data

– All data is tabular. Entities are modeled as tables, each instance of an entity is a row in the table, and each property is defined as a column.

– All rows in the same table have the same set of columns.

– A table can contain any number of rows. 

– A primary key uniquely identifies each row in a table. No two rows can share the same primary key. 

– A foreign key references rows in another, related table. For each value in the foreign key column, there should be a row with the same value in the corresponding primary key column in the other table.

15. Primary key and foreign key

15.1 Primary Key –  The primary key indicates the column (or combination of columns) that uniquely identifies each row. Every table should have a primary key.

15.2 Foreign Key – The columns marked FK are Foreign Key columns. They reference, or link to, the primary key of another table, and are used to maintain the relationships between tables. A foreign key also helps to identify and prevent anomalies, such as orders for customers that don’t exist in the Customers table.

16. Querying relational data: 

Most relational databases support Structured Query Language (SQL). You use SQL to create tables, insert, update, and delete rows in tables, and to query data.

17. Use cases of relational database:

Examples of OLTP applications that use relational databases are banking solutions, online retail applications, flight reservation systems, and many online purchasing applications.

16. JOINS in SQL queries:  

You can combine the data from multiple tables in a query using a join operation. A join operation spans the relationships between tables, enabling you to retrieve the data from more than one table at a time. The following query retrieves the name of every customer, together with the product name and quantity for every order they’ve placed. Notice that each column is qualified with the table it belongs to:

SELECT Customers.CustomerName, Orders.QuantityOrdered, Products.ProductName FROM Customers JOIN Orders ON Customers.CustomerID = Orders.CustomerID JOIN Products ON Orders.ProductID = Products.ProductID

17. Index – When you create an index in a database, you specify a column from the table, and the index contains a copy of this data in a sorted order, with pointers to the corresponding rows in the table. When the user runs a query that specifies this column in the WHERE clause, the database management system can use this index to fetch the data more quickly than if it had to scan through the entire table row by row.

*An index might consume additional storage space, and each time you insert, update, or delete data in a table, the indexes for that table must be maintained. This additional work can slow down insert, update, and delete operations, and incur additional processing charges.

18. View – A view is a virtual table based on the result set of a query. In the simplest case, you can think of a view as a window on specific rows in an underlying table.

19. IaaS 

19.1 IaaS is an acronym for Infrastructure-as-a-Service. Azure enables you to create a virtual infrastructure in the cloud that mirrors the way an on-premises data center might work.

19.2 You’re still responsible for many of the day-to-day operations, such as installing and configuring the software, patching, taking backups, and restoring data when needed

19.3 The IaaS approach is best for migrations and applications requiring operating system-level access. SQL virtual machines are lift-and-shift. That is, you can copy your on-premises solution directly to a virtual machine in the cloud. The system should work more or less exactly as before in its new location, except for some small configuration changes (changes in network addresses, for example) to take account of the change in environment.


20.1 PaaS stands for Platform-as-a-service. Rather than creating a virtual infrastructure, and installing and managing the database software yourself, a PaaS solution does this for you. You specify the resources that you require (based on how large you think your databases will be, the number of users, and the performance you require), and Azure automatically creates the necessary virtual machines, networks, and other devices for you.

*PaaS solutions enable you to scale up and out without having to procure your own hardware.

21. Non-relational data

21.1 A key aspect of non-relational databases is that they enable you to store data in a very flexible manner. Non-relational databases don’t impose a schema on data. Instead, they focus on the data itself rather than how to structure it. This approach means that you can store information in a natural format that mirrors the way in which you would consume, query and use it.

22. Use cases of non-relational databases

22.1 IoT and telematics: These systems typically ingest large amounts of data in frequent bursts of activity. Non-relational databases can store this information very quickly. The data can then be used by analytics services such as Azure Machine Learning, Azure HDInsight, and Microsoft Power BI. Additionally, you can process the data in real-time using Azure Functions that are triggered as data arrives in the database.

22.2 Retail and marketing: Microsoft uses CosmosDB for its own ecommerce platforms that run as part of Windows Store and XBox Live. It’s also used in the retail industry for storing catalog data and for event sourcing in order processing pipelines. 

22.3 Gaming: The database tier is a crucial component of gaming applications. Modern games perform graphical processing on mobile/console clients, but rely on the cloud to deliver customized and personalized content like in-game stats, social media integration, and high-score leaderboards. Games often require single-millisecond latencies for reads and write to provide an engaging in-game experience. A game database needs to be fast and be able to handle massive spikes in request rates during new game launches and feature updates.

22.4 Web and mobile applications: A non-relational database such as Azure Cosmos DB is commonly used within web and mobile applications, and is well suited for modeling social interactions, integrating with third-party services, and for building rich personalized experiences. The Cosmos DB SDKs (software development kits) can be used build rich iOS and Android applications using the popular Xamarin framework.

23. Formats of semi-structured data

23.1 JSON – A JSON document is enclosed in curly brackets ({ and }). Each field has a name (a label), followed by a colon, and then the value of the field. Fields can contain simple values, or subdocuments (each starting and ending with curly brackets). Fields can also have multiple values, held as arrays and surrounded with square brackets ([ and ]). Literals in a field are enclosed in quotes, and fields are separated with commas.

23.2 Avro – Avro is a row-based format. It was created by Apache. Each record contains a header that describes the structure of the data in the record. This header is stored as JSON. The data is stored as binary information. An application uses the information in the header to parse the binary data and extract the fields it contains. Avro is a very good format for compressing data and minimizing storage and network bandwidth requirements.

23.3 ORC – ORC (Optimized Row Columnar format) organizes data into columns rather than rows. It was developed by HortonWorks for optimizing read and write operations in Apache Hive. Hive is a data warehouse system that supports fast data summarization and querying over very large datasets. Hive supports SQL-like queries over unstructured data. An ORC file contains stripes of data. Each stripe holds the data for a column or set of columns. A stripe contains an index into the rows in the stripe, the data for each row, and a footer that holds statistical information (count, sum, max, min, and so on) for each column.

23.4 Parquet – Parquet is another columnar data format. It was created by Cloudera and Twitter. A Parquet file contains row groups. Data for each column is stored together in the same row group. Each row group contains one or more chunks of data. A Parquet file includes metadata that describes the set of rows found in each chunk. An application can use this metadata to quickly locate the correct chunk for a given set of rows, and retrieve the data in the specified columns for these rows. Parquet specializes in storing and processing nested data types efficiently. It supports very efficient compression and encoding schemes.

24.NoSQL databases

NoSQL (non-relational) databases generally fall into four categories: key-value stores, document databases, column family databases, and graph databases.

24.1 Key-value store – A key-value store is the simplest (and often quickest) type of NoSQL database for inserting and querying data. Each data item in a key-value store has two elements, a key and a value. The key uniquely identifies the item, and the value holds the data for the item. The value is opaque to the database management system. Items are stored in key order.

24.2 Document database – A document database represents the opposite end of the NoSQL spectrum from a key-value store. In a document database, each document has a unique ID, but the fields in the documents are transparent to the database management system. Document databases typically store data in JSON format. they could be encoded using other formats such XML, YAML, JSON, BSON

24.3 Column family database –  A column family database organizes data into rows and columns. Examples of this structure include ORC and Parquet files

In its simplest form, a column family database can appear very similar to a relational database, at least conceptually. The real power of a column family database lies in its denormalized approach to structuring sparse data.

24.4 Graph database – Graph databases enable you to store entities, but the main focus is on the relationships that these entities have with each other. A graph database stores two types of information: nodes that you can think of as instances of entities, and edges, which specify the relationships between nodes. Nodes and edges can both have properties that provide information about that node or edge (like columns in a table). Additionally, edges can have a direction indicating the nature of the relationship.

25. Key-value store

25.1 Characteristics:

* A query specifies the keys to identify the items to be retrieved.

* You can’t search on values. An application that retrieves data from a key-value store is responsible for parsing the contents of the values returned.

* The value is opaque to the database management system.

* Write operations are restricted to inserts and deletes.

* If you need to update an item, you must retrieve the item, modify it in memory (in the application), and then write it back to the database, overwriting the original (effectively a delete and an insert).

25.2 Use case for Key-value store:

The focus of a key-value store is the ability to read and write data very quickly. Search capabilities are secondary. A key-value store is an excellent choice for data ingestion, when a large volume of data arrives as a continual stream and must be stored immediately.

26. Data Wrangling – Wrangling is the process by which you transform and map raw data into a more useful format for analysis. It can involve writing code to capture, filter, clean, combine, and aggregate data from many sources.

27. Stages of data analysis

27.1 Data Ingestion

Data ingestion is the process of obtaining and importing data for immediate use or storage in a database. The data can arrive as a continuous stream, or it may come in batches, depending on the source. The purpose of the ingestion process is to capture this data and store it. This raw data can be held in a repository such as a database management system, a set of files, or some other type of fast, easily accessible storage.

The ingestion process might also perform filtering and transformation at this stage.

27.2 Data Processing

The data processing stage occurs after the data has been ingested and collected. Data processing takes the data in its raw form, cleans it, and converts it into a more meaningful format (tables, graphs, documents, and so on). The result is a database of data that you can use to perform queries and generate visualizations, giving it the form and context necessary to be interpreted by computers and used by employees throughout an organization.

28. ETL and ELT 

28.1 ETL stands for Extract, Transform, and Load. The raw data is retrieved and transformed before being saved. The extract, transform, and load steps can be performed as a continuous pipeline of operations.

It is suitable for systems that only require simple models, with little dependency between items.

28.2 ELT is an abbreviation of Extract, Load, and Transform. The process differs from ETL in that the data is stored before being transformed. The data processing engine can take an iterative approach, retrieving and processing the data from storage, before writing the transformed data and models back to storage.

ELT is more suitable for constructing complex models that depend on multiple items in the database, often using periodic batch processing.

29. Reporting – Reporting is the process of organizing data into informational summaries to monitor how different areas of an organization are performing. Reporting helps companies monitor their online business, and know when data falls outside of expected ranges. Good reporting should raise questions about the business from its end users. Reporting shows you what has happened, while analysis focuses on explaining why it happened and what you can do about it.

30. Business Intelligence – The term Business Intelligence (BI) refers to technologies, applications, and practices for the collection, integration, analysis, and presentation of business information. The purpose of business intelligence is to support better decision making.

31. Data Visualization – Data visualization is the graphical representation of information and data. By using visual elements like charts, graphs, and maps, data visualization tools provide an accessible way to spot and understand trends, outliers, and patterns in data.

31.1 The most common types of visualisation: 

* Bar and column charts: Bar and column charts enable you to see how a set of variables changes across different categories.

* Line charts: Line charts emphasize the overall shape of an entire series of values, usually over time.

* Matrix: A matrix visual is a tabular structure that summarizes data. Often, report designers include matrixes in reports and dashboards to allow users to select one or more elements (rows, columns, cells) in the matrix to cross-highlight other visuals on a report page.

* Key influencers: A key influencer chart displays the major contributors to a selected result or value. Key influencers are a great choice to help you understand the factors that influence a key metric.

* Treemap: Treemaps are charts of colored rectangles, with size representing the relative value of each item. They can be hierarchical, with rectangles nested within the main rectangles.

* Scatter: A scatter chart shows the relationship between two numerical values. A bubble chart is a scatter chart that replaces data points with bubbles, with the bubble size representing an additional third data dimension.

* Filled map: If you have geographical data, you can use a filled map to display how a value differs in proportion across a geography or region.

32. Categories of data analytics:

32.1 Descriptive analytics  – Descriptive analytics helps answer questions about what has happened, based on historical data. Descriptive analytics techniques summarize large datasets to describe outcomes to stakeholders.

32.2 Diagnostic analytics – Diagnostic analytics helps answer questions about why things happened. Diagnostic analytics techniques supplement more basic descriptive analytics. They take the findings from descriptive analytics and dig deeper to find the cause.

32.3 Predictive analytics – Predictive analytics helps answer questions about what will happen in the future. Predictive analytics techniques use historical data to identify trends and determine if they’re likely to recur. Predictive analytical tools provide valuable insight into what may happen in the future.

32.4 Prescriptive analytics – Prescriptive analytics helps answer questions about what actions should be taken to achieve a goal or target. By using insights from predictive analytics, data-driven decisions can be made. This technique allows businesses to make informed decisions in the face of uncertainty.

32.5 Cognitive analytics – Cognitive analytics attempts to draw inferences from existing data and patterns, derive conclusions based on existing knowledge bases, and then add these findings back into the knowledge base for future inferences–a self-learning feedback loop. Cognitive analytics helps you to learn what might happen if circumstances change, and how you might handle these situations.

Describe how to work with relational data on Azure (25–30%)

1. Azure Data Services

1.1 Azure Data Services fall into the PaaS category. These services are a series of DBMSs managed by Microsoft in the cloud. Each data service takes care of the configuration, day-to-day management, software updates, and security of the databases that it hosts.

1.2 Available relational databases on Azure:

* Azure SQL Database

* Azure Database for MySQL servers

* Azure Database for MariaDB servers

* Azure Database for PostgreSQL servers

1.3 Using Azure Data Services reduces the amount of time that you need to invest to administer a DBMS

1.4 Azure Data Services ensure that your databases are available for at least 99.99% of the time

1.5 The base price of each service covers underlying infrastructure and licensing, together with the administration charges. Additionally, these services are designed to be always on. This means that you can’t shut down a database and restart it later.

1.6 Limitations: Not all features of a database management system are available in Azure Data Services. This is because Azure Data Services takes on the task of managing the system and keeping it running using hardware situated in an Azure datacenter. Exposing some administrative functions might make the underlying platform vulnerable to misuse, and even open up some security concerns.You have no direct control over the platform on which the services run. 

* If you need more control than what Azure Data Services allow, you can install your database management system on a virtual machine that runs in Azure. Migrating from the system running on-premises to an Azure virtual machine is no different than moving the databases from one on-premises server to another.When running SQL Server on Virtual Machines the customer takes care of maintaining the SQL Server software and performing the various administrative tasks to keep the database running from day-to-day.

1.6.1 When you should run the Microsoft SQL Server on Virtual Machine?

This approach is optimized for migrating existing applications to Azure, or extending existing on-premises applications to the cloud in hybrid deployments.

A hybrid deployment is a system where part of the operation runs on-premises, and part in the cloud. Your database might be part of a larger system that runs on-premises, although the database elements might be hosted in the cloud.

1.6.2 Capabilities of running SQL Server on Virtual machine:

* Create rapid development and test scenarios when you do not want to buy on-premises non-production SQL Server hardware.

* Become lift-and-shift ready for existing applications that require fast migration to the cloud with minimal changes or no changes.

* Scale up the platform on which SQL Server is running, by allocating more memory, CPU power, and disk space to the virtual machine. You can quickly resize an Azure virtual machine without the requirement that you reinstall the software that is running on it.

2. Azure SQL Database

2.1 Azure SQL Database is a PaaS offering from Microsoft. Azure SQL Database is available with several options: Single Database, Elastic Pool, and Managed Instance.

2.1.1 Single Database – This option enables you to quickly set up and run a single SQL Server database. You create and run a database server in the cloud, and you access your database through this server. Microsoft manages the server, so all you have to do is configure the database, create your tables, and populate them with your data. You can scale the database if you need additional storage space, memory, or processing power.

2.1.2 Elastic Pool – This option is similar to Single Database, except that by default multiple databases can share the same resources, such as memory, data storage space, and processing power. The resources are referred to as a pool. You create the pool, and only your databases can use the pool. This model is useful if you have databases with resource requirements that vary over time, and can help you to reduce costs.

2.1.3 Managed Instance – Managed instance effectively runs a fully controllable instance of SQL Server in the cloud. You can install multiple databases on the same instance. You have complete control over this instance, much as you would for an on-premises server.

2.2 Use cases of Azure SQL Database:

Modern cloud applications that need to use the latest stable SQL Server features.

Applications that require high availability.

Systems with a variable load, that need the database server to scale up and down quickly.

2.3 Features of the Azure SQL database:

* Azure SQL Database automatically updates and patches the SQL Server software to ensure that you are always running the latest and most secure version of the service.

* The scalability features of Azure SQL Database ensure that you can increase the resources available to store and process data without having to perform a costly manual upgrade.

* The service provides high availability guarantees, to ensure that your databases are available at least 99.99% of the time.

* Azure SQL Database supports point-in-time restore, enabling you to recover a database to the state it was in at any point in the past.

* Databases can be replicated to different regions to provide additional assurance and disaster recovery

* Advanced threat protection provides advanced security capabilities, such as vulnerability assessments, to help detect and remediate potential security problems with your databases.

* It continuously monitors your database for suspicious activities, and provides immediate security alerts on potential vulnerabilities, SQL injection attacks, and anomalous database access patterns.

* SQL Database helps secure your data by providing encryption. For data in motion, it uses transport layer security. For data at rest, it uses transparent data encryption.

2.4 Use cases of Azure SQL server managed instance:

Consider Azure SQL Database managed instance if you want to lift-and-shift an on-premises SQL Server instance and all its databases to the cloud, without incurring the management overhead of running SQL Server on a virtual machine.

3.MySQL, MariaDB, PostgreSQL

3.1 MySQL – MySQL started life as a simple-to-use open-source database management system. It’s available in several editions; Community, Standard, and Enterprise. The Community edition is available free-of-charge, and has historically been popular as a database management system for web applications, running under Linux. Versions are also available for Windows.

3.2 MariaDB – MariaDB is a newer database management system, created by the original developers of MySQL. The database engine has since been rewritten and optimized to improve performance. MariaDB offers compatibility with Oracle Database (another popular commercial database management system). One notable feature of MariaDB is its built-in support for temporal data. A table can hold several versions of data, enabling an application to query the data as it appeared at some point in the past.

3.3 PostgreSQL – PostgreSQL is a hybrid relational-object database. You can store data in relational tables, but a PostgreSQL database also enables you to store custom data types, with their own non-relational properties. The database management system is extensible; you can add code modules to the database, which can be run by queries. Another key feature is the ability to store and manipulate geometric data, such as lines, circles, and polygons.

4. Deployment options for Azure Database for PostgreSQL

4.1 Azure Database for PostgreSQL single-server – The single-server deployment option for PostgreSQL provides similar benefits as Azure Database for MySQL. You choose from three pricing tiers: Basic, General Purpose, and Memory Optimized. Each tier supports different numbers of CPUs, memory, and storage sizes—you select one based on the load you expect to support.

4.2 Azure Database for PostgreSQL Hyperscale (Citus) – Hyperscale (Citus) is a deployment option that scales queries across multiple server nodes to support large database loads. Your database is split across nodes. Data is split into chunks based on the value of a partition key or sharding key. Consider using this deployment option for the largest database PostgreSQL deployments in the Azure Cloud.

5. In Azure scaling up or out will take effect without restarting the SQL database


6.1 You use DML statements to manipulate the rows in a relational table. These statements enable you to retrieve (query) data, insert new rows, or edit existing rows. You can also delete rows if you don’t need them anymore.


Describe how to work with non-relational data on Azure (25–30%)

1. Azure Table Storage – Azure Table Storage is a scalable key-value store held in the cloud. You create a table using an Azure storage account. In an Azure Table Storage table, items are referred to as rows, and fields are known as columns.

*Azure Table Storage tables have no concept of relationships, stored procedures, secondary indexes, or foreign keys.

* Azure Table Storage provides much faster access to the data because the data is available in a single row, without requiring that you perform joins across relationships. To help ensure fast access, Azure Table Storage splits a table into partitions.

1.1 Partition in Azure Table Storage:

Partitioning is a mechanism for grouping related rows, based on a common property or partition key.

* Rows that share the same partition key will be stored together.

* Partitioning not only helps to organize data, it can also improve scalability and performance

* Partitions are independent from each other, and can grow or shrink as rows are added to, or removed from, a partition. A table can contain any number of partitions.

*When you search for data, you can include the partition key in the search criteria. This helps to narrow down the volume of data to be examined, and improves performance by reducing the amount of I/O (reads and writes) needed to locate the data.

1.2 Azure Table Storage keys

The key in an Azure Table Storage table comprises two elements:

1.2.1 Partition Key – The partition key that identifies the partition containing the row

1.2.2 Row Key – The row key that is unique to each row in the same partition.

Items in the same partition are stored in row key order. If an application adds a new row to a table, Azure ensures that the row is placed in the correct position in the table.

1.3 Point queries and range queries in Azure Table Storage

1.3.1 Point query – In a point query, when an application retrieves a single row, the partition key enables Azure to quickly hone in on the correct partition, and the row key lets Azure identify the row in that partition.

1.3.2 Range query – In a range query, the application searches for a set of rows in a partition, specifying the start and end point of the set as row keys. This type of query is also very quick, as long as you have designed your row keys according to the requirements of the queries performed by your application.

1.4 Advantages of using Azure Table Storage

*It’s simpler to scale. It takes the same time to insert data in an empty table, or a table with billions of entries. An Azure storage account can hold up to 500 TB of data.

*A table can hold semi-structured data

*There’s no need to map and maintain the complex relationships typically required by a normalized relational database.

*Row insertion is fast

*Data retrieval is fast, if you specify the partition and row keys as query criteria

1.5 Disadvantages of using Azure Table Storage

* Consistency needs to be given consideration as transactional updates across multiple entities aren’t guaranteed

* There’s no referential integrity; any relationships between rows need to be maintained externally to the table

* It’s difficult to filter and sort on non-key data. Queries that search based on non-key fields could result in full table scans

1.6 Use cases of Azure Table Storage:

1.6.1 Storing TBs of structured data capable of serving web scale applications. Examples include product catalogs for eCommerce applications, and customer information, where the data can be quickly identified and ordered by a composite key. In the case of a product catalog, the partition key could be the product category (such as footwear), and the row key identifies the specific product in that category (such as climbing boots).

1.6.2 Storing datasets that don’t require complex joins, foreign keys, or stored procedures, and that can be denormalized for fast access. In an IoT system, you might use Azure Table Storage to capture device sensor data. Each device could have its own partition, and the data could be ordered by the date and time each measurement was captured.

1.6.3 Capturing event logging and performance monitoring data. Event log and performance information typically contain data that is structured according to the type of event or performance measure being recorded. The data could be partitioned by event or performance measurement type, and ordered by the date and time it was recorded. Alternatively, you could partition data by date, if you need to analyze an ordered series of events and performance measures chronologically. If you want to analyze data by type and date/time, then consider storing the data twice, partitioned by type, and again by date. Writing data is fast, and the data is static once it has been recorded.

**You need to create storage account before creating Azure Table Storage

2. Azure Blob Storage – Azure Blob storage is a service that enables you to store massive amounts of unstructured data, or blobs, in the cloud.

**You need to create storage account before creating Azure Blob Storage

2.2 Types of Azure Blob Service Support

2.2.1 Block blobs: A block blob is handled as a set of blocks. Each block can vary in size, up to 100 MB. A block blob can contain up to 50,000 blocks, giving a maximum size of over 4.7 TB. The block is the smallest amount of data that can be read or written as an individual unit. Block blobs are best used to store discrete, large, binary objects that change infrequently.

2.2.2 Page blobs: A page blob is organized as a collection of fixed size 512-byte pages. A page blob is optimized to support random read and write operations; you can fetch and store data for a single page if necessary. A page blob can hold up to 8 TB of data. Azure uses page blobs to implement virtual disk storage for virtual machines.

2.2.3 Append blobs: An append blob is a block blob optimized to support append operations. You can only add blocks to the end of an append blob; updating or deleting existing blocks isn’t supported. Each block can vary in size, up to 4 MB. The maximum size of an append blob is just over 195 GB.

** Use a page block for blobs that require random read and write access. Use a block blob for discrete objects that change infrequently.

2.3 Access tiers that Blob Storage provides

2.3.1 Hot –  The Hot tier is the default. You use this tier for blobs that are accessed frequently. The blob data is stored on high-performance media.

2.3.2 Cold – The Cool tier. This tier has lower performance and incurs reduced storage charges compared to the Hot tier. Use the Cool tier for data that is accessed infrequently. It’s common for newly created blobs to be accessed frequently initially, but less so as time passes. In these situations, you can create the blob in the Hot tier, but migrate it to the Cool tier later. You can migrate a blob form the Cool tier back to the Hot tier.

2.3.3  Archive – The Archive tier. This tier provides the lowest storage cost, but with increased latency. The Archive tier is intended for historical data that mustn’t be lost, but is required only rarely. Blobs in the Archive tier are effectively stored in an offline state. Typical reading latency for the Hot and Cool tiers is a few milliseconds, but for the Archive tier, it can take hours for the data to become available. To retrieve a blob from the Archive tier, you must change the access tier to Hot or Cool. The blob will then be rehydrated. You can read the blob only when the rehydration process is complete.

2.4 Lifecycle management policies – You can create lifecycle management policies for blobs in a storage account. A lifecycle management policy can automatically move a blob from Hot to Cool, and then to the Archive tier, as it ages and is used less frequently (policy is based on the number of days since modification).

2.5 Use cases of Azure Blob Storage

2.5.1 Serving images or documents directly to a browser, in the form of a static website.

2.5.2 Storing files for distributed access

2.5.3 Streaming video and audio

2.5.4 Storing data for backup and restore, disaster recovery, and archiving

2.5.5 Storing data for analysis by an on-premises or Azure-hosted service


2.6.1 Versioning if you want to maintain and restore earlier versions of blob

2.6.2  Soft Delete – if you want to recover a blob that has been removed or overwritten by accident

2.6.3  Snapshot – a read-only version of a blob at a particular point in time.

2.6.4  Change Feed option – provides an ordered, read-only, record of the updates made to a blob.

3. Azure File Storage – Azure File Storage enables you to create files shares in the cloud, and access these file shares from anywhere with an internet connection.

3.2 Features:

* Azure File Storage enables you to share up to 100 TB of data in a single storage account.

* The maximum size of a single file is 1 TiB

* Azure File Storage supports up to 2000 concurrent connections per shared file.

3.3 How do you upload files into Azure File Storage?

You can upload files to Azure File Storage using the Azure portal, or tools such as the AzCopy utility. You can also use the Azure File Sync service to synchronize locally cached copies of shared files with the data in Azure File Storage.

3.4 Tiers that Azure File Storage offers:

3.4.1 Standard – The Standard tier uses hard disk-based hardware in a datacenter

3.4.2 Premium –  The Premium tier uses solid-state disks. The Premium tier offers greater throughput, but is charged at a higher rate.

3.5 Use cases of Azure File Storage:

3.5.1.Migrate existing applications to the cloud.

3.5.2  Share server data across on-premises and cloud.

3.5.3 Integrate modern applications with Azure File Storage.

3.5.4 Simplify hosting High Availability (HA) workload data.


Don’t use Azure File Storage for files that can be written by multiple concurrent processes simultaneously. Multiple writers require careful synchronization, otherwise the changes made by one process can be overwritten by another. The alternative solution is to lock the file as it is written, and then release the lock when the write operation is complete. However, this approach can severely impact concurrency and limit performance.

4. Azure Cosmos DB –  Azure Cosmos DB is a multi-model NoSQL database management system. Cosmos DB manages data as a partitioned set of documents. A document is a collection of fields, identified by a key. The fields in each document can vary, and a field can contain child documents. Many document databases use JSON (JavaScript Object Notation) to represent the document structure. In this format, the fields in a document are enclosed between braces, { and }, and each field is prefixed with its name.

Cosmos DB is a highly scalable database management system. Cosmos DB automatically allocates space in a container for your partitions, and each partition can grow up to 10 GB in size. Indexes are created and maintained automatically.

4.2 Different APIs that Cosmos DB supports:

4.2.1 SQL API. This interface provides a SQL-like query language over documents, enable to identify and retrieve documents using SELECT statements.

4.2.2 Table API. This interface enables you to use the Azure Table Storage API to store and retrieve documents. The purpose of this interface is to enable you to switch from Table Storage to Cosmos DB without requiring that you modify your existing applications.

4.2.3 MongoDB API. MongoDB is another well-known document database, with its own programmatic interface. Many organizations run MongoDB on-premises. You can use the MongoDB API for Cosmos DB to enable a MongoDB application to run unchanged against a Cosmos DB database.

4.2.4.Cassandra API. Cassandra is a column family database management system. This is another database management system that many organizations run on-premises. The Cassandra API for Cosmos DB provides a Cassandra-like programmatic interface for Cosmos DB. Cassandra API requests are mapped to Cosmos DB document requests.

4.2.5 Gremlin API. The Gremlin API implements a graph database interface to Cosmos DB. A graph is a collection of data objects and directed relationships. Data is still held as a set of documents in Cosmos DB, but the Gremlin API enables you to perform graph queries over data. Using the Gremlin API you can walk through the objects and relationships in the graph to discover all manner of complex relationships

*The primary purpose of the Table, MongoDB, Cassandra, and Gremlin APIs is to support existing applications. If you are building a new application and database, you should use the SQL API.

4.3 Cosmos DB guarantees less than 10-ms latencies for both reads (indexed) and writes at the 99th percentile, all around the world

4.4 Some of the use cases where Cosmos DB is suitable:

4.4.1 IoT and telematics. These systems typically ingest large amounts of data in frequent bursts of activity. Cosmos DB can accept and store this information very quickly. The data can then be used by analytics services, such as Azure Machine Learning, Azure HDInsight, and Power BI. Additionally, you can process the data in real-time using Azure Functions that are triggered as data arrives in the database.

4.4.2 Retail and marketing. Microsoft uses CosmosDB for its own e-commerce platforms that run as part of Windows Store and Xbox Live. It’s also used in the retail industry for storing catalog data and for event sourcing in order processing pipelines.

4.4.3 Gaming. The database tier is a crucial component of gaming applications. Modern games perform graphical processing on mobile/console clients, but rely on the cloud to deliver customized and personalized content like in-game stats, social media integration, and high-score leaderboards. Games often require single-millisecond latencies for reads and write to provide an engaging in-game experience. A game database needs to be fast and be able to handle massive spikes in request rates during new game launches and feature updates.

4.4.4 Web and mobile applications. Azure Cosmos DB is commonly used within web and mobile applications, and is well suited for modeling social interactions, integrating with third-party services, and for building rich personalized experiences. The Cosmos DB SDKs can be used to build rich iOS and Android applications using the popular Xamarin framework.

5. Tools you can use to provision services:

* The Azure portal.

* The Azure command-line interface (CLI)

* Azure PowerShell

* Azure Resource Manager templates

6. Replication – Data in an Azure Storage account is always replicated three times in the region you specify as the primary location for the account. Azure Storage offers two options for how your data is replicated in the primary region:

6.1 Locally redundant storage (LRS) copies your data synchronously three times within a single physical location in the region. LRS is the least expensive replication option, but isn’t recommended for applications requiring high availability.

6.2 Geo-redundant storage (GRS) copies your data synchronously three times within a single physical location in the primary region using LRS. It then copies your data asynchronously to a single physical location in the secondary region. This form of replication protects you against regional outages.

6.3 Read-access geo-redundant storage (RA-GRS) replication is an extension of GRS that provides direct read-only access to the data in the secondary location. In contrast, the GRS option doesn’t expose the data in the secondary location, and it’s only used to recover from a failure in the primary location. RA-GRS replication enables you to store a read-only copy of the data close to users that are located in a geographically distant location, helping to reduce read latency times.

7. Options for protecting Azure resources (Storage account, Azure CosmosDB, etc) 

* Azure Private Endpoint

* Firewalls and virtual networks

* Configure authentication

* Configure access control

* Configure advanced security

8. Consistency 

8.1 Eventual Consistency – This option is the least consistent. It’s based on the situation just described. Changes won’t be lost, they’ll appear eventually, but they might not appear immediately. Additionally, if an application makes several changes, some of those changes might be immediately visible, but others might be delayed; changes could appear out of order.. Eventual Consistency provides the lowest latency and the least consistency. 

8.2 Consistent Prefix Option – This option ensures that changes will appear in order, although there may be a delay before they become visible. In this period, applications may see old data.

8.3 Session option – If an application makes a number of changes, they’ll all be visible to that application, and in order. Other applications may see old data, although any changes will appear in order, as they did for the Consistent Prefix option. This form of consistency is sometimes known as read your own writes.

8.4 Bounded Staleness option – There’s a lag between writing and then reading the updated data. You specify this staleness either as a period of time, or number of previous versions the data will be inconsistent for.

8.5 Strong Consistency option – all writes are only visible to clients after the changes are confirmed as written successfully to all replicas. This option is unavailable if you need to distribute your data across multiple global regions.

9. Shared Access Signature (SAS) – You can use shared access signatures (SAS) to grant limited rights to resources in an Azure storage account for a specified time period.

This feature enables applications to access resources such as blobs and files, without requiring that they’re authenticated first.

10. Security Principal – An object that represents a user, group, service, or managed identity that is requesting access to Azure resources.

Describe an analytics workload on Azure (25–30%)

1. Data warehousing – A data warehouse gathers data from many different sources within an organization. This data is then used as the source for analysis, reporting, and online analytical processing (OLAP). The focus of a data warehouse is to provide answers to complex queries, unlike a traditional relational database, which is focused on transactional performance.

Data warehouses have to handle big data. Big data is the term used for large quantities of data collected in escalating volumes, at higher velocities, and in a greater variety of formats than ever before. It can be historical (meaning stored) or real time (meaning streamed from the source). Businesses typically depend on their big data to help make critical business decisions.

2. Azure Data Factory – Azure Data Factory is described as a data integration service. The purpose of Azure Data Factory is to retrieve data from one or more data sources, and convert it into a format that you process.

3. Azure Data Lake Storage – A data lake is a repository for large quantities of raw data. Because the data is raw and unprocessed, it’s very fast to load and update, but the data hasn’t been put into a structure suitable for efficient analysis. You can think of a data lake as a staging point for your ingested data, before it’s massaged and converted into a format suitable for performing analytics.

*Azure Data Lake Storage is essentially an extension of Azure Blob Storage that is organized as a near-infinite file system. Data Lake Storage organizes your files into directories and subdirectories for improved file organization. 

4. Azure Databricks – Azure Databricks is an Apache Spark environment running on Azure to provide big data processing, streaming, and machine learning.

5. Azure Synapse Analytics – Azure Synapse Analytics is an analytics engine. It’s designed to process large amounts of data very quickly.

Using Synapse Analytics, you can ingest data from external sources, such as flat files, Azure Data Lake, or other database management systems, and then transform and aggregate this data into a format suitable for analytics processing. You can perform complex queries over this data and generate reports, graphs, and charts.

5.1 MPP –  Azure Synapse Analytics lavarages a massively parallel processing (MPP) architecture. It includes a control node and a pool of compute nodes:

5.1.1  Control Node – The Control node is the brain of the architecture. It’s the front end that interacts with all applications. The MPP engine runs on the Control node to optimize and coordinate parallel queries. When you submit a processing request, the Control node transforms it into smaller requests that run against distinct subsets of the data in parallel.

5.1.2 Compute Nodes – The Compute nodes provide the computational power. The data to be processed is distributed evenly across the nodes. Users and applications send processing requests to the control node. The control node sends the queries to compute nodes, which run the queries over the portion of the data that they each hold. When each node has finished its processing, the results are sent back to the control node where they’re combined into an overall result.

5.2 Computational models –  Azure Synapse Analytics supports two computational models- SQL pool and Spark pool. * In a Spark pool the nodes are replaced with a Spark cluster.

5.3 PolyBase – Azure Synapse Analytics uses a technology named PolyBase. PolyBase enables you to retrieve data from relational and non-relational sources, such as delimited text files, Azure Blob Storage, and Azure Data Lake Storage. You can save the data read in as SQL tables within the Synapse Analytics service.

6.Azure Analysis Services – Azure Analysis Services enables you to build tabular models to support online analytical processing (OLAP) queries. You can combine data from multiple sources, including Azure SQL Database, Azure Synapse Analytics, Azure Data Lake store, Azure Cosmos DB, and many others.

You use these data sources to build models that incorporate your business knowledge. A model is essentially a set of queries and expressions that retrieve data from the various data sources and generate results. The results can be cached in-memory for later use, or they can be calculated dynamically, directly from the underlying data sources.

7. Azure Synapse Analytics vs Azure Analysis Services

7.1 Differences:

Azure Analysis Services has significant functional overlap with Azure Synapse Analytics, but it’s more suited for processing on a smaller scale. Use Azure Synapse Analytics for very high volumes of data (multi-terabyte to petabyte sized databases) and very complex queries and aggregations. 

7.2 Use Analysis Services and Synapse Analytics together:

Many scenarios can benefit from using Synapse Analytics and Analysis Services together. If you have large amounts of ingested data that require preprocessing, you can use Synapse Analytics to read this data and manipulate it into a model that contains business information rather than a large amount of raw data. The scalability of Synapse Analytics gives it the ability to process and reduce many terabytes of data down into a smaller, succinct dataset that summarizes and aggregates much of this data. You can then use Analysis Services to perform detailed interrogation of this information, and visualize the results of these inquiries with Power BI.

8. Azure HDInsight – Azure HDInsight is a big data processing service, that provides the platform for technologies such as Spark in an Azure environment. HDInsight implements a clustered model that distributes processing across a set of computers. This model is similar to that used by Synapse Analytics, except that the nodes are running the Spark processing engine rather than Azure SQL Database.


9.1 Common flow of activity:

Bring data into Power BI Desktop and create a report, share it to the Power BI service, view and interact with reports and dashboards in the service and Power BI mobile.

9.2 Building blocks of PowerBI:

9.2.1 Visualizations — A visual representation of data, sometimes just called visuals

9.2.2 Datasets — A collection of data that Power BI uses to create visualizations

9.2.3 Reports — A collection of visuals from a dataset, spanning one or more pages

9.2.4 Dashboards — A single-page collection of visuals built from a report

9.2.5 Tiles — A single visualization on a report or dashboard

9.3 Elements of PowerBI

9.3.1 PowerBI Desktop

9.3.2 PowerBI Service

9.3.3 PowerBI Mobile

9.4 The common flow when using PowerBI:

9.4.1 Bring data into Power BI Desktop, and create a report.

9.4.2 Second: Publish to the Power BI service, where you can create new visualizations or build dashboards.

9.4.3 Third: Share dashboards with others, especially people who are on the go.

9.4.4 Fourth: View and interact with shared dashboards and reports in Power BI Mobile apps.

#1 Let’s get started


Welcome to my first blog post! I am starting this blog with the idea to share my experience and knowledge as a DBA who is still at the beginning of their career. And what a better way to start this journey than introducing myself.

Originally from Bulgaria I was born and raised in a small town in the centre of the country- Kazanlak (Definitely recommend adding to your list of destinations to visit). Since I remember myself I have always been interested in mathematics and physics and in 2015 that brought me to the UK to study mathematics. Three years later I graduated from University of Southampton with Bachelors of Science and started my career journey being analyst on a graduate programme for a wealth management company in Southampton. Then in 2019 I had the opportunity to join the DBA team in the same company and accidentally I become DBA. Back then I had no experience and limited knowledge but was lucky to join an incredible team with people who share their knowledge and have been supporting my development- people who inspire me to even start blogging.

You may wonder what to expect from this blog? As 2021 is just around the corner, one of the things on my resolution list is to undertake Microsoft DP-900 exam at the beggining of the year and I am planning to share more on how I am preparing for it, i am also planning to share what is the best way to get started with SQL, SSMS and PowerShell based on my experience and some of the tips and tricks I have found useful. Blog #2 coming soon.