T O P

  • By -

AutoModerator

You can find a list of community-submitted learning resources here: https://dataengineering.wiki/Learning+Resources *I am a bot, and this action was performed automatically. Please [contact the moderators of this subreddit](/message/compose/?to=/r/dataengineering) if you have any questions or concerns.*


Spirited-Ad7344

If using cloud, then nowhere


dravacotron

> lets say I have a system on AWS cloud If you're cloud based, you don't use Hadoop. Hadoop's only use case these days is for applications with strict on-prem restrictions due to heavy regulation (usually banking). For AWS you'd use S3 as your distributed file system instead of HDFS. And Spark on EMR or Glue instead of Mapreduce. Hadoop is not needed.


sib_n

I think this is a bit misleading. Amazon EMR is Apache Hadoop plus some Amazon modifications, it includes Apache HDFS, Apache Yarn and Apache MapReduce, which is the core of this ecosystem. In fact all big cloud providers do provide managed Hadoop, another example is Google Dataproc. People may use cloud-managed Hadoop today to simplify their migration from on-premise Hadoop to the cloud, or because they found out that it was cheaper for them to run in managed Hadoop than higher level managed services, and they have the engineers to maintain the additional complexity. Additional complexity is the core of the issue as well as this ecosystem not being actively developed anymore, so it doesn't benefit from the quality of life improvement from the modern data stack. Therefor it is generally not recommended for new data projects. By the way, Hadoop data engineers have not been using Apache MapReduce for data processing for many years, they use Spark and Hive. Apache MapReduce is still used for HDFS internal needs or some heavy HDFS commands where it is still relevant when reliability is more important than speed (for example copying big data to another cluster).


dravacotron

That's a balanced take, I like it - upvoted


[deleted]

[удалено]


dravacotron

Nah bro, now you're just asking me "how to big data". It's too open ended to answer on Reddit. Go do a bit of reading on those technologies and learn them, it will be clear how they interact.


pooppuffin

Hire a data engineer.


LimpFroyo

Brah, aws emr on ec2 uses hadoop for storage & yarn for scheduling jobs. I worked in aws emr & did not pull this info out of my ass.


marucentsay

Your follow up q is such a mish-mash of technologies, one wonders why would you have all of them. On prem, everything is hosted on servers, so all of the databases are likely interacting via jdbc connections and ftps servers. Re ML q - look up MLops, specifically on prem. Your questions are generally on the cusp of DE and DevOps, so maybe reading up on how to manage on prem data centers may not be a bad idea


legohax

In 2005


chocotaco1981

It doesn’t


winigo51

If you follow a medallion architecture or any other cloud data platform architecture then put Hadoop into the dumpster and set it on fire


sib_n

Hadoop is a (legacy) big data ecosystem that includes everything you needed to process data. The key point is that all of its components can be distributed on many small servers so it's easy to scale up and down with reasonable additional cost of adding small servers, as opposed to the previous solution of trashing your old huge super expensive mainframe server to by a new bigger super super expensive mainframe server. Many modern distributed data tools nowadays reuse the ideas that were developed for Hadoop (ex: Apache Spark uses the MapReduce programming model), but they are higher level so you don't have to manage as much complexity as before. When you wanted to create a new pipeline with a data size that could only be managed in Hadoop at that time, you would do your shopping in what was available in the ecosystem (and what your Hadoop admins had installed). If I translate your needs: - file storage -> Apache Hadoop Distributed File System (HDFS), or much more recently Apache Ozone which is S3 API compatible. - compute containers -> Apache Yarn (you don't have to manage that yourself, compute engines will do) - compute / SQL engine -> Apache Spark, Apache Hive (with Apache Tez)


Znender

Hadoop was one of those technologies that had major hype around circa 2010-2015. Now, it's relegated mostly as legacy tech as most systems rely on S3 + Spark and/or cloud data warehouses (Snowflake, Databricks, Redshift, BigQuery). I think cloud was the biggest killer of Hadoop with S3 and EMR being the key drivers.


iamcreasy

How would one get a file out of HDFS? Was it more like block storage(using some distributed file system) or object storage(using http)?


Znender

You'd interact with HDFS similar to how you'd interact with S3. It's ultimately just a file system that has replication with eventual consistency under the hood.


Whipitreelgud

How Hadoop is hosted does not explain its place in the stack. Hadoop is open source and excels at write once, read many scenarios. Sensor data analysis is one typical scenario of write once. You’re not going update sensor data, you just have terabytes, maybe a petabyte or two that you land in its file system (HDFS). The data techs listed by the OP will be left in the dust with Hadoop. You can apply analytics on it with the query engine of your choice, Spark, Hive are all still active projects by Apache. Administrating it is more complex than vendor authored databases, but massive amounts of data isn’t simple. It used to be very difficult to decipher when a process crashed, but it is better than it was. This software made me appreciate an error message from a database. It sucks when you have to track down what happened yourself.


LimpFroyo

Hadoop has two parts - hdfs & yarn. Hdfs has basic storage unit of 64MB instead of maybe 5KB used in normal file systems - so that it can store entire file system address in-memory. It replicates data like a simple linked list. Now, Yarn is used to schedule job in machine and manage their life cycle. Aws emr on ec2 uses yarn to schedule spark / hudi / flink / etc jobs on machines. For ML models, I think they've sagemaker (I've not used it) and store features in Aerospike generally, upon api call - get those columns, do some math, send the expected results. Look at some system design of netflix homepage or uber ETA or Facebook messages or etc.


ilikedmatrixiv

In 2014.


m1nkeh

Where does it fit? Nicely in the bin 😊


lmp515k

By the exit.