The full video of my talk from Hadoop Summit (San Jose, June 28, 2016) is now available. In this talk I cover performance considerations when moving analytic workloads into production. I even give away the game changing secret sauce for extreme performance in Actian’s Vector in Hadoop product for SQL analytics. VID: Solving Performance Problems […]
You are browsing archives for
Category: Hadoop
Spark Analysis of Global Place Names (GeoNames)
Spark Analysis on a Large File GeoNames.org has free gazetteer data by country or for the world, provided in tab-separated text files. In this post I show you how to do some simple analysis using DataFrames in Spark. As the global file is 280M compressed and 1.2G uncompressed. This size of file makes it difficult to […]
Serverspec checks settings on a Hadoop cluster
Serverspec is a Ruby-based system that can run Rspec formatted tests against a host. It can check for a long list of system details and status such as total memory, CPU count, running services, and more, including custom shell callouts. It compares the results of the checks with predefined thresholds and reports a pass or […]
Hadoop Options for SQL Databases
Drowning while trying to understand your options for SQL-based database management in Hadoop? This graphic is a simplified comparison of the various features of several popular products being used today. I outline some of my biggest differentiators in this post. While this is a marketing slide for Actian’s SQL in Hadoop enterprise solution, I wish I saw it earlier so I could […]
“Big Data” off 2015 Hype Cycle?
See this official 2015 hype cycle video here to get it straight from Gartner. In the video she says, first, it’s passed over the hump and is no longer just hype. Second, it’s embedded within other items throughout the cycle now. I can understand how this can get confusing to track and qualify, but isn’t […]
Zeppelin Notebook Tutorial Walkthrough
This is my short video (14 min) showing how to build and launch the Apache Zeppelin notebook platform – a web UI for interactive query and analysis. This is all done running locally via OSX on a Macbook. In this video we focus on using the tutorial notebook that comes with Zeppelin and discuss each step – including interactive querying and charting – […]
Partitioned Data & Why It Matters
As data volumes grow, so does your need to understand how to partition your data. Until you understand this distributed storage concept, you will be unable to choose the best approach for the job. This post gives an introductory explanation of partitioning and you will see why it is integral to the Hadoop Distributed File System (HDFS) increasingly […]
Web console for Kafka messaging system
Running Kafka for a streaming collection service can feel somewhat opaque at times, this is why I was thrilled to find the Kafka Web Console project on Github yesterday. This Scala application can be easily downloaded and installed with a couple steps. An included web server can then be launched to serve it up quickly. Here’s […]
Drinking from the (data) Firehose of Terror
Between classic business transactions and social interactions and machine-generated observations, the digital data tap has been turned on and it will never be turned off. The flow of data is everlasting. Which is why you see a lot of things in the loop around real time frameworks and streaming frameworks. – Mike Hoskins, CTO Actian […]
Kafka Consumer – Simple Python Script and Tips
[UPDATE: Check out the Kafka Web Console that allows you to manage topics and see traffic going through your topics – all in a browser!] When you’re pushing data into a Kafka topic, it’s always helpful to monitor the traffic using a simple Kafka consumer script. Here’s a simple script I’ve been using that […]
SPARQL Query for Graph Density Analysis
I’ve been spending a lot of time this past year running queries against the open source SPARQLverse graph analytic engine. It’s amazing how simple some queries can look and yet how much work is being done behind the scenes. My current project requires building up a set of query examples that allow typical kinds of graph/network […]
Kafka Topic Clearing after Producing Messages
[UPDATE: Check out the Kafka Web Console to more easily administer your Kafka topics] This week I’ve been working with the Kafka messaging system in a project. Basic C# Methods for Kafka Producer To publish to Kafka I built a C# app that uses the Kafka4n libraries – it doesn’t get much simpler than this: using Kafka.Client; Connector […]
From zero to HDFS in 60 min.
(Okay, so you can be up and running quicker if you have a better internet connection than me.) Want to get your hands dirty with Hadoop related technologies but don’t have time to waste? I’ve spent way too much time trying to get HBase, for example, running on my Macbook with Brew and wish I had […]