Streamset Data Collector
It is a light, powerful engine for the ingestion of data in real time. In order to define the data flow for Data Collector, a pipeline is configured. A pipeline consists in stages that represent the origin and the pipeline destination, and any additional processing that is necessary to realize.
This is our primary tool when doing data engineering. We have been creating custom code for this tool.
Here are some examples:
- Visualize Apache Logs in Minecraft using SCD and Kafka
- Dockerizing SCD tutorials
- Custom Origin to pull data from Google Analytics
Django is a high-level Python Web framework that encourages rapid development and clean, pragmatic design. Built by experienced developers, it takes care of much of the hassle of Web development, so you can focus on writing your app without needing to reinvent the wheel. It’s free and open source.
Elasticsearch is a search engine based on Lucene. It provides a distributed, multitenant-capable full-text search engine with an HTTP web interface and schema-free JSON documents. Elasticsearch is developed in Java and is released as open source under the terms of the Apache License.
Drupal is a scalable, open platform for web content management and digital experiences. Drupal provides deep capabilities and endless flexibility on the web. We currently have experience with Drupal 7 and 8, both versions.
Hadoop is a software framework that supports distributed applications under a free license. It allows applications to work with thousands of nodes and petabytes of data. Hadoop was inspired by Google documents for MapReduce and Google File System (GFS).
Kafka is a distributed publish-subscribe messaging system that is designed to be fast, scalable, and durable.
Like many publish-subscribe messaging systems, Kafka maintains feeds of messages in topics. Producers write data to topics and consumers read from topics. Since Kafka is a distributed system, topics are partitioned and replicated across multiple nodes.
It is a tool to facilitate the visual explotation of information stored in Elasticsearch. The information is arranged in dashboards and/or individual visualizations. Vizes are basically created based on ElasticSearch queries.
We belive that using ElasticSearch and Kibaba to present data is fast and reliable.
Feel free to take a look at an integration example of ElasticSearch, Streamsets Data Collector and Kibana
is a data exploration platform designed to be visual, intuitive and interactive. Superset's main goal is to make it easy to slice, dice and visualize data. It empowers users to perform .
- A quick way to intuitively visualize datasets by allowing users to create and share interactive dashboards
- A rich set of visualizations to analyze your data, as well as a flexible way to extend the capabilities
- An extensible, high granularity security model allowing intricate ruleson who can access which features, and integration with major authentication providers (database, OpenID, LDAP, OAuth & REMOTE_USER through Flask AppBuiler)
- A simple semantic layer, allowing to control how data sources are displayed in the UI, by defining which fields should show up in which dropdown and which aggregation and function (metrics) are made available to the user
- Deep integration with Druid allows for to stay blazing fast while slicing and dicing large, realtime datasets
- Fast loading dashboards with configurable caching