Big Data Denmark
Big data and data science community

About


Big Data Denmark is a community of data scientists, software developers and analysts that meets regularly to discuss big data and data science concepts, ideas, tools, methods, models and technologies used for analyzing and processing large scale data, extracting meaning and gaining insight into data.

 

The goal is to bring together people from various industries in Denmark who are interested in dealing with large amounts of data and offer interesting talks, hands-on advice and a forum for exchange and networking.

The intent of the community is not solely technical, but also strives to discuss, discover and communicate what kind of problems and business areas Big Data and Data Science can support, solve and improve.

 

Our members like to discuss topics like: Big data analytics, Hadoop, Pig, data mining, machine learning, Mahout, predictive analytics, Spark, neural networks, heuristics, Storm, statistical computing, R language, Python, mass text mining, data science opportunities & challenges

Meetups


DATA SCIENCE IN BIG DATA

When:
Thursday, September 11, 2014
From 17:00 To 20:00
Where:
Vesterbro Torv 1-3, 3rd floor, 8000 Aarhus C

Making sense of data is first step towards extracting value from your data and many times more important than being able to scale the processing to large amounts. Thus, it is vital to have the right skills and tools for understanding your data and doing exploratory analysis. This talk will have a look at the hidden potential of big data analysis and how data science relates to big-data. We will walk through the processes of data analysis and show practical examples of using data analysis in Denmark.

Speaker:
Rasmus Nygaard Andersen and Vladimir Smida, Comiit ApS
Register here: Registration Link

REAL-TIME PROCESSING & ANALYSIS

When:
Thursday, October 9, 2014
From 17:00 To 20:00
Where:
Vesterbro Torv 1-3, 3rd floor, 8000 Aarhus C

Some decisions cannot wait days, hours or even minutes to be answered, instead the analysis needs to happen as soon at data arrives. Many business areas benefit from being real-time instead of delayed, including financial trading, health-care monitoring, fraud detection, monitoring of electrical grids, windmills, industrial machinery and much more. This talk will look at processing and analyzing large volumes of data in near real-time. Hence what kind of problems, complexity and solutions you run into when doing things in real-time, and finally the technologies that exists for doing big-data in real time.

Speaker:
Rasmus Nygaard Andersen and Vladimir Smida, Comiit ApS
Register here: Registration Link

Hadoop: What is it.. why does it matter?

When:
Tuesday, January 27, 2015
From 17:00 To 20:00
Where:
Vesterbro Torv 1-3, 3rd floor, 8000 Aarhus C

Big web companies Google, Amazon and Yahoo pioneered Hadoop and similar technologies over a decade ago building their value offerings on inexpensive, resilient distributed data processing. While the data processing capabilities of these companies are impressive, most businesses operate on a much smaller volume of data. What benefits can companies with these more modest data sets harvest from Hadoop? This talk will take a deep dive into the Hadoop platform and illustrate several real-life examples of how this technology can benefit businesses in general. 

Speaker:
Vladimir Smida, Rasmus Nygaard Andersen
Register here: Registration Link

R in Azure HDInsight and Azure Machine Learning

When:
Tuesday, March 3, 2015
From 5:00 PM To -
Where:
Møllegade 1/Moellegade 1, Aarhus

Meetup in cooperation with Rhus.club, the new R user group in Aarhus

We R happy to announce the first event in the Rhus.club! We will have a visit by Sebastian Brandes, Tech Evangelist @ Microsoft!

Description: Executing R jobs in the cloud is the new ”hot drug”, and Microsoft is your pusher! In the last few years Microsoft has worked intensively on developing new services in the cloud like HDInsight and Azure Machine Learning, and R is fully supported. In this session, Sebastian Brandes will describe how to take advantage of Azure and R and show us how to do it in practice. Expect a demo intensive presentation with many tips and tricks to get started with Azure!

Speaker:
Sebastian Brandes
Register here: Registration Link

Recommentation Engines in the cloud with HDInsight/Hadoop in Azure

When:
Monday, April 20, 2015
From 17:00 To 20:00
Where:
Vesterbro Torv 1-3, 3rd floor, 8000 Aarhus C

Have you ever thought about, how Spotify generates suggestions for new songs you should listen to? Or, how YouTube can generate automatic playlists? Or, how Stack Overflow comes up with suggestions, for new forum threads, that might be of interest to you? The answer is Recommendation Engines, and in this session we will talk about, how you can build your own engine using HDInsight/Hadoop in Azure.

The session is for everyone – regardless of prior experience, or lack thereof with the subject. The main focus will be on how Recommandation Engines (and Machine Learning in the cloud in general) work, then there will be a short explanation of the mathematics behind, and lastly we will cover how to get started training one’s own engine.


SPEAKER:

Sebastian Brandes is a Tech Evangelist at Microsoft´s Developer Experience-department. He works with Apps, Azure and developer tools such as e.g. Visual Studio and has worked with the treatment and processing of large data volumes in the cloud since 2011. He works closely with Microsoft´s internal Center of Excellence for Data Insights and has held several internal and external presentations on the subject of data processing and statistics using the support of a wide range of Microsoft-products.

Speaker:
Sebastian Brandes
Register here: Registration Link

Participating in the Strata conference in London 2015

When:
Tuesday, May 5, 2015
From 09:00 To 18:00
Where:
London

Keeping up with modern trends in Big Data, Comiit is sending representatives to the Strata Conference in London. The conference is one of the largest of its kind in the world, and is a three-day safari in the field of Big Data. Some of the world’s leading speakers on Big Data will be attending the conference, so if you want to learn what the shakers and movers are up to then this is the conference for you.

 

You can read more about the conference here: http://strataconf.com/big-data-conference-uk-2015

 

To be clear: This conference is pay to participate, and we are simply participants. However we wanted to see how many of our colleagues will be joining us this year. Therefore, we started this meetup so that we could find out beforehand who will be joining us in London.

If any of our colleagues want to join us later in the evening to discuss the presentations of the day over a cold beer, then that is simply an added bonus.

 

The extensive list of over 100 speakers for the conference can be seen here:

http://strataconf.com/big-data-conference-uk-2015/public/schedule/speakers

Speaker:
Over 100 speakers
Register here: Registration Link

Time series data crunching with kdb+

When:
Tuesday, June 9, 2015
From 17:00 To 20:00
Where:
Vesterbro Torv 1-3, 3rd floor, 8000 Aarhus C

kdb+ is a high-performance, high volume database designed from the outset in anticipation of vast increases in data volumes. The database incorporates its own powerful query language, q, so that analytics can be run directly on the data, supporting real-time analysis of billions of records and fast access to terabytes of historical data. The main focus will be exploring the key features of kdb+, how it compares to other technologies and some practical use cases. Kdb+ is widely used in finance sector as market data servers, back-ends for trading applications and investment funds and many other applications.

 

Krishan Subherwal is based in the First Derivatives headquarters in Newry. Krishan is a kdb+ developer part of the Quantitative and Derivative Strategies Group within a major US investment bank with 2 years experience. The team is tasked with developing an equity derivatives analytics portal, to be used globally by clients to provide visualization and analysis of equity volatility data and derivative trading strategies.

Speaker:
Krishan Subherwal
Register here: Registration Link

To Hadoop or Not to Hadoop

When:
Tuesday, June 16, 2015
From 17:00 To 20:00
Where:
Visma, Nørgaardsvej 32, 2800 Kongens Lyngby
“Hadoop” has become synonymous with big data processing, in the same way that the word “Colt” was used interchangeably to describe any handguns over a couple of decades. Hadoop is often described as the one framework your business needs to solve nearly all your big data problems. Hadoop was, however, purpose-built to solve a clear set of problems and definitely not all of them. 
 
In this first Big Data Denmark presentation in Copenhagen, Vladimir Smida will walk you through the core concepts of Hadoop, its components and its strengths, as well as look at other alternatives. Each with key strengths in specific areas. 
 
 
Vladimir Smida, is a founder and an active speaker in the Big Data Denmark community, but foremost, a Software developer slash Data scientist consultant at Comiit ApS. With the mindset that technology is here to solve complex problems for people, not the other way around he is pushing models and bytes to clients in energy sector, finance sector and any place with exciting problem at hand. 
Speaker:
Vladimir Smida
Register here: Registration Link

A tour of Hive

When:
Monday, August 31, 2015
From 17:00 To 20:00
Where:
Vesterbro Torv 1-3, 3rd floor, 8000 Aarhus C

Apache Hive has become a great way for people to start using Hadoop, as it it features a SQL like query language. But Hive is not a regular database, and Hive’s query language is not SQL. In this talk, I will give you a tour of Hive, and try to focus in on the areas where Hive is different, and where you might get a few surprises when you start using Hive in practice. The talk will include some live demos, and some of the topics that will be touched upon include Tez vs. MR, the ORC format, window functions and transforms.

 

Speaker:

Martin Qvist is an HPC Specialist at Vestas Wind Systems. He studied mathematics at Aalborg University, and worked on traffic modeling in computer networks as well as some approximation theoretic topics in subdivision surfaces. Since then he has worked as a software developer, and subsequently started his own consultancy company, which he ran until he started at Vestas.

 

As always the presentation is free and refreshments will be provided for all who sign up.  

 

There are max. 40 seat available for this presentation.

Speaker:
Martin Qvist
Register here: Registration Link

Data Science with Apache Spark

When:
Tuesday, October 13, 2015
From 17:00 To 20:00
Where:
Visma, Nørgaardsvej 32, Lyngby

Are you interested in applying machine learning to current research and industry problems?

Then join us in our second BigDataDenmak meetup in Copenhagen, where we will togehter with Data Scientist Peter Sergio Larsen take a deep dive into Machine Learning with Apache Spark.

 

We will present and discuss use cases where Peter and his team applied Data Discovery techniques and training of Machine Learning models using Spark and Python libraries. 

 

About speaker
Peter Sergio Larsen is the Chief Data Scientist at Visma Consulting. His experience ranges from software development to the implementation of advanced machine learning models to business understanding and the preparation of business cases. 

Speaker:
Peter Sergio Larsen
Register here: Registration Link

Spark After Dark 1.5: Real-time Analytics with Spark, Kafka, Cassandra

When:
Wednesday, November 25, 2015
From 16:30 To 19:30
Where:
Lille UP1, Department of Computer Science, University of Copenhagen, Universitetsparken 1, Copenhagen

In this highly technical talk you can expect us exploring the following:

 

  1. Building a Scalable and Performant Spark SQL/DataFrames Data Source Connector such as Spark-CSV, Spark-Cassandra, Spark-ElasticSearch, and Spark-Redshift
  2. Speeding Up Spark SQL Queries using Partition Pruning and Predicate Pushdowns with CSV, JSON, Parquet, Avro, and ORC
  3. Tuning Spark Streaming Performance and Fault Tolerance with Kafka
  4. Maintaining Stability during High Scale Streaming Ingestion using Approximations and Probabilistic Data Structures from Spark, Redis, and Twitter's Algebird
  5. Building Effective Machine Learning Models using Feature Engineering, Dimension Reduction, and Natural Language Processing with MLlib/GraphX, ML Pipelines, DIMSUM, Locality Sensitive Hashing, and Stanford's CoreNLP
  6. Tuning Core Spark Performance by Acknowledging Mechanical Sympathy for the Physical Limitations of OS and Hardware Resources such as CPU, Memory, Network, and Disk with Project Tungsten, Asynchronous Netty, and Linux epoll

 

Demos

This talk features many interesting and audience-interactive demos - as well as code-level deep dives into many of the projects listed above.  

All demo code is available on Github at the following link: https://github.com/fluxcapacitor/pipeline/wiki

In addition, the entire demo environment has been Dockerized and made available for download on Docker Hub at the following link: https://hub.docker.com/r/fluxcapacitor/pipeline/

 

Speaker

Chris Fregly is a Principal Data Solutions Engineer for the newly-formed IBM Spark Technology Center, an Apache Spark Contributor, a Netflix Open Source Committer, as well as the Organizer of the global Advanced Apache Spark Meetup and Author of the Upcoming Book, Advanced Spark.  

Speaker:
Chris Fregly
Register here: Registration Link

"Extreme"­ Apache Spark:How in 3 months we created a pipeline for processing 2.5 billion records a day

When:
Tuesday, March 15, 2016
From 17.oo To 21.oo
Where:
ITU, IT University of Copenhagen, Auditorium 1
"Apache Spark is simply awesome" says our next speaker Josef Habdank. In this talk he will give you a crash course how to design an extremely scalable data processing pipeline on Apache Spark on using tech such as: Spark Streaming, Scala, Kinesis, Snappy, Avro, Parquet, S3, Zeppelin. It will be a story of 3 crazy developers who in 3 months managed to develop and put to production a Spark data pipeline which can crunch through 2.5 billion airfares a day without breaking a sweat. It was an amazing journey in which they had to do everything themselves: take care of hardware and deploy platform, research technologies, hack out all the code in Spark/Scala, test scalability, do the monitoring tools and deliver the complete business intelligence product to the customer. Josef says: "Yes it is possible, and it is possible in 3 months. If you come to the talk I will share with you DOs and DONTs of such a process, I will explain which technologies turned out to be right and what was a mistake." You will learn how to use correct message compression and serialization (Avro + Snappy), best practices for in-stream error handling, how build a succsful 50TB+ datawarehouse (Parquet with metadata splitting) and more, with the code samples provided.
Speaker:
Josef Habdank is a Lead Data Scientist and Data Platform Architect at Infare Solutions with previous experience from Big Data and Data Science practitioners such as Thomson Reuters, Adform, as well as Department of Defence. He is an expert in Apache Spark and Spark enabled technologies such as Kafka, Kinesis, Cassandra, Tachyon and others.
Register here: Registration Link

High performance data flow with a GUI, and guts

When:
Wednesday, April 20, 2016
From 18.oo To 20.oo
Apache NiFi has seen it all. It worked for the NSA after all. What it brings to the Hadoop eco-system is a series of data flow and ingest patterns, a GUI, and a lot of security and record level data provenance. 
This is a look under the covers of Apache NiFi and its innovations around content and provenance repositories. The focus is on how NiFi achieves what it does in terms of throughput and performance, and a deep dive into the internal data structures and code that allow you to make tradeoff between latency and throughput, or resilience and speed in realtime.

We will also look at pulling apart some of the key processors that make up NiFi data flows, and examining the clues they leave to writing high performance data flows on top of the NiFi framework.
Speaker:
Simon Ball is a Principal Solutions Engineer at Hortonworks, where he helps clients do Hadoop. He is a certified Spark and Hadoop developer. Previously he has worked in the data intensive worlds of hedge funds and financial trading, ERP and e-Commerce, as well as designing and running nationwide networks and websites.
Register here: Registration Link

Forecasting with open source in multitenant cloud

When:
Wednesday, June 15, 2016
From 18 To 21
Where:
Microsoft, Kanalvej 7, Kongens Lyngby

Today, when storage has become cheap in multi-tenant cloud environments, it is possible to load and store ever-increasing data. This has initiated the generation and collection of IoT data, like sensors’ measurements, for analysis and forecasting.

Collecting and running forecasts on the measurements from the billions of sensors in PB at near real time, i.e. during the seconds from the actual data generation, poses different architectural consideration on the system design - how to organize data in storage layer and how to implement the machine learning algorithm to do forecast, at the same time supporting data ingest at high rate and high concurrency data access.

In this talk Helen will present a case study of a utility company which collects the measurements from water and heat controllers installed in a wide base of households. The company collects the measurements to plan for peak usage time, to detect leakage and to predict maintenance costs. In this case study the open source Hadoop ecosystem is chosen, which runs in the commercial multi-tenant Azure cloud. The case study covers data organization on Hadoop NoSQL database Hbase to support near real-time forecasting and to serve access to a wide base of public consumers.

 

Speaker:
Helen Priisalu is working as Big Data and Analytics architect at Microsoft in Azure cloud team. She designs and builds Big Data platforms for IoT with data volumes up to petabytes, which for economics and efficiency require multi-tenant cloud solutions. Helen has been in the field of Data Warehouse (DW) and Big Data since 1995, in the largest companies in the Scandinavia and during the recent years she has been designing the Big Data/Hadoop platform adoption at enterprise scale.
Register here: Registration Link

Contact


If you wish to receive updates about upcoming events you can sign up here:

Big Data Denmark was initiated by Comiit ApS. If you would like to hear about specific topic, or present one, or just know more about the group feel free to contact Vladimir Smida.