Create your first ML model at scale

Photo by Genessa Panainte on Unsplash

Machine learning that is applied to build personalizations, suggestions, and future analyses are becoming increasingly important as companies generate increasingly diversified and user-focused digital goods and solutions. Rather than dealing with the complications of different datasets, the Apache Spark machine learning library (MLlib) enables data engineers to concentrate on specific data challenges and algorithms.

A linear technique to modeling the connection across a dependent factor and one or maybe more random factors is known as linear regression. It is one of the most fundamental and widely used kinds of predictive modeling.

What is Spark MLlib?

An extensive guide to set up Pyspark

Photo by SELİM ARDA ERYILMAZ on Unsplash

Stick around if you’re for a complete guide to set up a PySpark environment for data science applications; PySpark functionality, and the best platforms to be explored.

What is Pyspark?

Pyspark, a robust language that must be considered to learn if you’re into the idea of creating more scalable pipelines and analyses. According to Chris Min, a data engineer, Pyspark basically enables writing Spark apps in Python and makes data processing efficient in a distributed fashion. Python is not just a great language, but an all-in-one ecosystem to perform exploratory data analysis, create ETLs for data platforms, and build ML pipelines. …

How do Non-Fungible Tokens Work?

Photo by Eric Prouzet on Unsplash

NFTs are digital directories that are powered by a blockchain system which is the same infrastructure that characterizes common cryptocurrency. However, an NFT is a unique kind of cryptocurrency, and the blockchain database on which it is stored authenticates whoever the legitimate holder of that cryptocurrency.

The NFTs are considered an element of the Ethereum blockchain that is like any other cryptocurrency. Before we understand what NFT is, let’s look at the underlying technology behind NFTs which is Ethereum.

What is Ethereum blockchain?

After Bitcoin, Ethereum is the second-largest blockchain by trading volume. However, it wasn’t even designed to serve only as an electronic…

An Introduction To AWS SageMaker

Photo by Lloyd Blunk on Unsplash

Many data scientists develop, train, and deploy ML models within a hosted environment. Regrettably for them, they do not have the convenience and facility for scaling up or scaling down resources as and when required based on their models.

This is where AWS SageMaker comes into picture! It solves the issue by facilitating developers to build and train models in order to get faster production with bare minimum efforts at an economical cost.

But first…what is AWS you ask?

Will it be Amazon? Or will Google take the cake? Let’s find out

For enterprises in the big data domain, it is imperative to have data warehouses that are agile, scalable, and at the same time cost-effective. Given how modern businesses are increasingly looking at big data as a solution to enhance in all areas; from customer support to production pace, analytical data warehouses have become critical to most business needs.

While the world of data analytics is still blooming, the large fishes have successfully established their hold in the market with their own data warehouses. Industry giants Amazon and Google — companies at the core of the big data boom, offer their…

Exploring the uber cool tool that helps build data apps

Photo by Isaac Smith on Unsplash

A user-friendly tool that helps deploy any machine learning model and any Python project with ease by turning data scripts into shareable apps in minutes? Yep, it is true. And it’s here!

Video on Getting Started with Streamlit

Getting started with Streamlit | Build your first Data/ML application by Anuj Syal

Decoding Streamlit

Created by Adrien Treuille, Thiago Teixeira, and Amanda Kelly, Streamlit is an open-source Python library that enables you to effortlessly build beautiful, custom web apps for machine learning and data science without worrying about the front end, for free. Astutely developed keeping the data scientists and ML engineers in mind, this tool allows them to…

To The Cloud and Beyond! I got you, fam!

Tanner Boriack on Unsplash

Choosing the right GCP Database depends on a lot of factors including your workload and the architecture involved. Today, I’m going to provide you all with an overview of popular Google cloud database services, including key considerations when assessing and choosing a service.

Know Thy Database

Google Cloud Platform (GCP) was built to provide an array of computing resources, database services being one of them. Competent and capable of handling modern data, bound with efficiency, flexibility, and great performance, GCP is a hosted platform solution for disseminated data across geography.

When choosing a Google database service, one should consider a lot of things…

Exploring the ‘data elite’ company and what solutions they have to offer

Carlos Muza on Unsplash

Breaking (Data)brick By Brick!

Founded in 2013 by the real OGs… the creators of Apache Spark, Delta Lake, and MLflow, Databricks is a single platform for all your data needs. It is a software (Data + AI) company that offers a Unified Data Analytics Platform (UDAP) and is basically built on a modern Lakehouse architecture in the cloud.

At present, Databricks is one of the fastest-growing data services on AWS and Azure with its headquarters in San Francisco and offices around the world serving over 5000 customers and over 450 partners worldwide. …

Bye-bye Pandas, hello dask!

Photo by Brian Kostiuk on Unsplash

For data scientists, big data is an ever-increasing pool of information and to comfortably handle the input and processing, robust systems are always a work-in-progress. To deal with the large inflow of data, we either have to resort to buying faster servers that adds to the costs or work smarter and build custom libraries like Dask for parallel computing.

Before I go over Dask as a solution for parallel computing, let us first understand what this type of computing means in the big data world. By the very definition, parallel computing is a type of computation where many calculations or…

In my previous blog, I introduced Ansible as a tool for IT automation that ends repetitive tasks to drive focus on more strategic work. As promised, in this part, I will elaborate on the deployment of Ansible. However, before we dig into how Ansible is the go-to multi-utility automation tool, let us rewind to what it is all about and why is it so important in automation.

Ansible, allows you to write the configuration files in YAML in a certain format, and they work cohesively to start a server, build a network, deploy the application, add configuration files, and restart…

Anuj Syal

Data Engineering | Full Stack Engineering |

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store