One of the biggest challenges with data lakes in general, and Hadoop in particular, is speed. How do you get real-time analytics performance out of a technology like Hadoop that was designed to trade off performance for scalability? While technologies like Hive, Presto, Parquet, ORC and others have delivered improvements, none of them provide near real-time, sub-second performance at scale.
Technologies like Apache Druid are used today alongside Hadoop to deliver real-time queries using the data from the data lake. Druid has also helped these same companies implement end-to-end real-time analytics using message buses like Kafka or Kinesis.
This whitepaper from Imply Data Inc. explains why delivering real-time analytics on a data lake is so hard, approaches companies have taken to accelerate their data lakes, and how they leveraged the same technology to create end-to-end real-time analytics architectures.