ML Beginners
Machine Learning (ML) for Beginners (Part 1)
Why Machine Learning?
Trending theme:
- - Artificial Intelligence, "deep learning",
"big data"...
Epistemological reason:
- - We don't know how to model complex problems but there are many examples representing the Variety of situations.
- -
"Data driven" vs. "Model Based".
Scientific reason:
- -
Learning is an essential faculty of life.
Economic reason:
- -
Collecting data is easier than developing expertise.
Technical
areas using ML
ML as a design tool:
- -
Vision & Pattern Recognition
- -
Language processing
- -
Speech processing
- -
Robotics
- -
"Data Mining"
- -
Search in BDD
- -
Recommendations
- -
Marketing...
ML as an explanatory tool:
- -
Neuroscience
- -
Psychology
- -
Cognitive Science
Data = ML fuel (BIG DATA)
The LargeHadron Collider | CERN ~70 PB/year
NOTE: A petabyte is equal to 1000
terabytes, or 1 million gigabytes. So that's a very large amount of data.
To give you
an idea, 70 petabytes per year corresponds to:
-
Approximately 140 million movies in high definition
(1080p)
-
Approximately 280 billion photos
-
Approximately 560000 hours of music
This amount
of data is enormous and represents a challenge for the storage and processing
of information.
Google 24 PetaBytes/day
NOTE: 24 petabytes is 24 million gigabytes
or 24 billion megabytes.
It would
take modern high-speed Internet speeds years to download this amount of data.
Google’s
ability to process such volumes of data is critical to its services, e.g.
-
Search Engine: Google processes billions of search queries every day and
relies on its vast data centers to deliver relevant results.
-
YouTube: YouTube is the world’s largest video sharing service and
handles billions of hours of video uploads and views per day.
-
Gmail: Gmail is one of the most popular email services, processing
millions of emails every day.
-
Google Maps: Google Maps is a widely used mapping service that relies on
a wealth of data to provide accurate directions and information.
Copernicus
NOTE: Copernicus' data processing capacity
is estimated to exceed 1 petabyte per year. This means that the organization is
able to process and analyze a significant number of cases each year.
Some of the
things that could make Copernicus more capable of handling data are:
-
Satellite data: Copernicus is a series of Earth satellites that collect a
wealth of information about the planet’s weather, oceans, land and ice
-
In-situ data: Copernicus collects data from ground-based sensors, buoys,
and other sources.
-
Complex analysis: Copernicus uses sophisticated systems and models to
analyze his data and gain valuable insights.
GermanClimate Computation Center (DKRZ)
NOTE: The storage capacity of the German
Center for Meteorological Statistics (DKFZ) is approximately 500 petabytes.This
large repository is necessary to accommodate the large data sets generated by
climate models and simulations.
The DKFZ
supercomputers process these datasets to understand climate models, predict
future climate conditions and develop climate adaptation strategies.
SquareKilometre Array
NOTE: The Square Kilometer Array (SKA) is
expected to generate approximately 1376 petabytes of data per year until 2024.
SKA's large radio telescope network will produce this vast amount of data,
making it the largest and most sensitive radio telescope array ever built ever.
SKA will be
used to study a wide range of astronomical phenomena, including the origin of
the universe, the nature of dark matter and dark energy, and the search for
extraterrestrial life.