This Month in Data: May 2015
Do data science teams lack courage? Are pop music lyrics getting dumber? Is Facebook limiting our access to opposing political viewpoints? These are just some of the fascinating questions posed by data science practitioners during the month of May. Read on for our roundup of the top data news of the month, both from within Pivotal and in the wider industry, as well as details on upcoming Pivotal events in June.
As elsewhere, messy or problematic data sets are taking a significant toll on data scientists’ productivity within the advertising industry. Mike Driscoll at Advertising Age quotes an unnamed CIO as stating that his agency spends “50% of our time acquiring and cleaning data.” He argues that practitioners’ sophisticated skillsets are largely going to waste. To address this issue, Driscoll makes the case for a dedicated team of data engineers who perform common data munging tasks so that the data scientists can go to work quickly and maximize their skills and time.
Data munging isn’t the only challenge facing data scientists, according to John Weathington at Tech Republic. Risk-averse corporate attitudes can reduce the effectiveness of data scientists’ efforts, since the practice is so contingent upon experimentation and challenging conventional wisdom. If a company’s data team is playing it safe, it will lead to subpar products and limited insight, Weathington states. He encourages a courageous approach to data science where practitioners are encouraged to make bold moves rather than avoiding risk.
The deep web isn’t only a place for illegal and unsavory activities, according to NASA’s Elizabeth Landau at Scientific Computing. Taking a cue from the Memex deep web searching and data mining software developed by DARPA, NASA is investigating how these sophisticated tools could be leveraged to better catalog the vast stores of scientific data that the agency’s probes are collecting on a daily basis.
It’s hard to make the case that the majority of pop music lyrics are high art, but according to lyric analysis by Andrew Powell-Morse, the “intelligence” of pop music lyrics has declined in the past decade. By performing text analysis upon top Billboard hits from the last ten years using the Readability Score, Powell-Morse finds that the majority of hit pop songs score at a third grade reading level.
Though much is made of the wide-ranging and sophisticated skillsets of data scientists, the field is not only the domain of Ivy League elites, according to DJ Patil, the nation’s Chief Data Scientist. Making the case for the value of community college, Patil details his own education path, which began at De Anza Jr. College in Cupertino. While Patil now holds a Ph.D. in applied mathematics, he got his start at community college, which should stand as encouragement towards would-be data scientists from any background.
The threat of the filter bubble—people only getting news and opinions from a self-selected group of sources which agree with their preexisting biases and perspectives—has been discussed in journalism circles for nearly as long as the debates over the death of print media. As social media sites like Facebook become the primary portals through which people get their news, debates over phenomenon have only grown more pitched. A pair of academic studies suggest that such fears may be overstated, however. While political discourse has grown more polarized, according to studies undertaken by Carnegie Mellon University, Stanford and Microsoft Research, Northeastern University, and Facebook’s own research team, social media appears to play only a limited role in reducing exposure to opposing viewpoints.
This Month in Pivotal Data Science
In this post, we share a recent research on the business outcomes of big data efforts. First, key statistics put out by Accenture, GE, and IBM are highlighted along with supportive information on the cost factors for technologies like Apache Hadoop compared to legacy systems. The post then summarizes and provides reference information for 20 examples of where big data is providing results for companies, and all of the companies discussed these results in the past year.
In last week’s episode of the Pivotal podcast, we cover a couple of important updates in the world of Big Data and Fast Data. Firstly, we discuss the new Apache Software Foundation Incubator project called Geode—which releases the core of Pivotal GemFire as true open-source software. Finally, Simon explores the new Pivotal Query Optimizer, the new cost based optimizer that is part of the new Pivotal Greenplum Database release.
Training on big data, in-memory, data science tools have never been more popular. With over 25 years of development and training in his background, Pivotal’s Mark Secrist, the first U.S. based trainer for the Spring Framework at VMware, gives us a question and answer session about Pivotal training (including the free training). He explains why he is passionate about training, where Spring overlaps with Pivotal Big Data Suite, and the differences between open and closed source product training. He also explains that we are hiring in his department.
A technical white paper recently published by GE Power & Water and GE Global Research highlights the benefits of running Pivotal products for storing large amounts of data, while simultaneously taking in new data and deploying real-time analysis. The paper focuses on a series of performance, stability and load tests that GE did on Pivotal GemFire. The results of the tests were impressive, with the final round running for 5 days, ingesting 100K points per second and storing three days’ worth of replicated high-velocity data in memory across a large cluster of 46 machines.
Less than 90 days after helping establish and joining the Open Data Platform initiative, Pivotal has announced the availability of Pivotal HD 3.0, now aligned with the standardized Open Data Platform (ODP) core consisting of Apache Hadoop® and Apache Ambari. This post details the progress made, and the new features added to Pivotal HD as a result of harmonizing with the ODP.
Pivotal Query Optimizer (PQO) is the industry’s first cost-based query optimizer for big data workloads. This post explores the feature differences between traditional Enterprise Data Warehouses (EDWs) and Pivotal Greenplum database and how the addition of Pivotal Query Optimizer enables the Greenplum database to deliver a complete analytics feature set in the data warehousing space.
Pivotal For Good (P4G) has partnered with Crisis Text Line, whose trained specialists assist hundreds of at-risk teens every day, to use data science to understand and predict teens’ crisis needs. In this series, Pivotal Data Scientist Noelle Sio continues to share some insights on Natural Language Processing (NLP). Specifically, this post looks at how to properly cleanse and prepare the text data, enabling the team to create topic models and word-based features that can be reused for additional text analytics tasks that feed into both counselor training and product direction.
First in a series to promote how Spring XD is better and provide a text/online version of Fred’s road show presentation. The first post’s goal is to share the business case, the challenges people are having now, and how Spring XD is the answer. Next posts will dive deeper and tell why it is better than other options.
Today, we are releasing Pivotal Greenplum Database 4.3.5 with the new Pivotal Query Optimizer, the world’s first cost-based optimizer for Big Data. This new cost-based query optimizer represents a radical improvement over the legacy optimizer that has served the Greenplum Database well over the past 10 years. This post reviews how its modularity, extensibility and performance is different from other optimizers, and drills into its new features for query optimization.
When Pivotal first announced HAWQ would be available on HDP, some of our first thoughts were about how nice it would be to provide customers the ability to install HAWQ directly onto the Hortonworks Sandbox to provide them with a place to take the software for a spin. This post includes a step-by-step guide for installation for HAWQ on Hortonworks Sandbox.
Upcoming Pivotal Data Science Events
Hadoop Summit 2015
Jun 9 – 11, 2015, San Jose, CA
Pivotal is proud to be a Platinum Sponsor of the 8th Annual Hadoop Summit, the leading conference for the Apache Hadoop community. Come visit at the Pivotal booth and speak with Pivotal experts.
Sarah Aerni: “IoT: How Data Science-Driven Software is Eating the Connected World”
June 10 11:15-11:55a
Pivotal Big Data Roadshow
Jun 16, 2015, Rocky Hill, CT
Jun 18, 2015, Burlington, MA
Join data technology experts from Pivotal to get the latest perspective on how big data analytics and applications are transforming organizations across industries.
In-Memory Computing Summit 2015
Jun 29 – 30, 2015, San Francisco, CA
Pivotal is proud to be a sponsor of The In-Memory Computing Summit 2015, the first and only industry-wide event of its kind, tailored to in-memory computing related technologies and solutions.