- Data Science – Part I – Building Predictive Analytics Capabilities
- Data Science – Part II – Working with R & R Studio
- Data Science – Part III – EDA & Model Selection
- Data Science – Part IV – Regression Analysis and ANOVA Concepts
- Data Science – Part V Decision Trees & Random Forests
- Data Science – Part VI – Market Basket and Product Recommendation Engines
- Data Science – Part VII – Cluster Analysis
R For Everyone: Advanced Analytics and graphics
Using the open source R language, you can build powerful statistical models to answer many of your most challenging questions. R has traditionally been difficult for non-statisticians to learn, and most R books assume far too much knowledge to be of help. R for Everyone, Second Edition, is the solution.
Drawing on his unsurpassed experience teaching new users, professional data scientist Jared P. Lander has written the perfect tutorial for anyone new to statistical programming and modeling. Organized to make learning easy and intuitive, this guide focuses on the 20 percent of R functionality you’ll need to accomplish 80 percent of modern data tasks.
Lander’s self-contained chapters start with the absolute basics, offering extensive hands-on practice and sample code. You’ll download and install R; navigate and use the R environment; master basic program control, data import, manipulation, and visualization; and walk through several essential tests. Then, building on this foundation, you’ll construct several complete models, both linear and nonlinear, and use some data mining techniques. After all this you’ll make your code reproducible with LaTeX, RMarkdown, and Shiny.
By the time you’re done, you won’t just know how to write R programs, you’ll be ready to tackle the statistical problems you care about most.
- Explore R, RStudio, and R packages
- Use R for math: variable types, vectors, calling functions, and more
- Exploit data structures, including data.frames, matrices, and lists
- Read many different types of data
- Create attractive, intuitive statistical graphics
- Write user-defined functions
- Control program flow with if, ifelse, and complex checks
- Improve program efficiency with group manipulations
- Combine and reshape multiple datasets
- Manipulate strings using R’s facilities and regular expressions
- Create normal, binomial, and Poisson probability distributions
- Build linear, generalized linear, and nonlinear models
- Program basic statistics: mean, standard deviation, and t-tests
- Train machine learning models
- Assess the quality of models and variable selection
- Prevent overfitting and perform variable selection, using the Elastic Net and Bayesian methods
- Analyze univariate and multivariate time series data
- Group data via K-means and hierarchical clustering
- Prepare reports, slideshows, and web pages with knitr
- Display interactive data with RMarkdown and htmlwidgets
- Implement dashboards with Shiny
- Build reusable R packages with devtools and Rcpp
Paperback: 560 pages
Publisher: Addison Wesley; 2 edition (8 Jun. 2017)
Product Dimensions: 17.8 x 2 x 23.1 cm
Data Just Right: Introduction to Large-Scale Data & Analytics
Large-scale data analysis is now vitally important to virtually every business. Mobile and social technologies are generating massive datasets; distributed cloud computing offers the resources to store and analyze them; and professionals have radically new technologies at their command, including NoSQL databases. Until now, however, most books on “Big Data” have been little more than business polemics or product catalogs. Data Just Right is different: It’s a completely practical and indispensable guide for every Big Data decision-maker, implementer, and strategist.
Michael Manoochehri, a former Google engineer and data hacker, writes for professionals who need practical solutions that can be implemented with limited resources and time. Drawing on his extensive experience, he helps you focus on building applications, rather than infrastructure, because that’s where you can derive the most value.
Manoochehri shows how to address each of today’s key Big Data use cases in a cost-effective way by combining technologies in hybrid solutions. You’ll find expert approaches to managing massive datasets, visualizing data, building data pipelines and dashboards, choosing tools for statistical analysis, and more. Throughout, the author demonstrates techniques using many of today’s leading data analysis tools, including Hadoop, Hive, Shark, R, Apache Pig, Mahout, and Google BigQuery.
Mastering the four guiding principles of Big Data success—and avoiding common pitfalls
Emphasizing collaboration and avoiding problems with siloed data
Hosting and sharing multi-terabyte datasets efficiently and economically
“Building for infinity” to support rapid growth
Developing a NoSQL Web app with Redis to collect crowd-sourced data
Running distributed queries over massive datasets with Hadoop, Hive, and Shark
Building a data dashboard with Google BigQuery
Exploring large datasets with advanced visualization
Implementing efficient pipelines for transforming immense amounts of data
Automating complex processing with Apache Pig and the Cascading Java library
Applying machine learning to classify, recommend, and predict incoming information
Using R to perform statistical analysis on massive datasets
Building highly efficient analytics workflows with Python and Pandas
Establishing sensible purchasing strategies: when to build, buy, or outsource
Previewing emerging trends and convergences in scalable data technologies and the evolving role of the Data Scientist
Paperback: 256 pages
Publisher: Addison Wesley; 01 edition (19 Dec. 2013)
Practical Data Science with Hadoop and Spark: Designing and Building Effective Analytics at Scale
Demand is soaring for professionals who can solve real data science problems with Hadoop and Spark. Practical Data Science with Hadoop® and Spark is your complete guide to doing just that. Drawing on immense experience with Hadoop and big data, three leading experts bring together everything you need: high-level concepts, deep-dive techniques, real-world use cases, practical applications, and hands-on tutorials.
The authors introduce the essentials of data science and the modern Hadoop ecosystem, explaining how Hadoop and Spark have evolved into an effective platform for solving data science problems at scale. In addition to comprehensive application coverage, the authors also provide useful guidance on the important steps of data ingestion, data munging, and visualization.
Once the groundwork is in place, the authors focus on specific applications, including machine learning, predictive modeling for sentiment analysis, clustering for document analysis, anomaly detection, and natural language processing (NLP).
This guide provides a strong technical foundation for those who want to do practical data science, and also presents business-driven guidance on how to apply Hadoop and Spark to optimize ROI of data science initiatives.
- What data science is, how it has evolved, and how to plan a data science career
- How data volume, variety, and velocity shape data science use cases
- Hadoop and its ecosystem, including HDFS, MapReduce, YARN, and Spark
- Data importation with Hive and Spark
- Data quality, preprocessing, preparation, and modeling
- Visualization: surfacing insights from huge data sets
- Machine learning: classification, regression, clustering, and anomaly detection
- Algorithms and Hadoop tools for predictive modeling
- Cluster analysis and similarity functions
- Large-scale anomaly detection
- NLP: applying data science to human language
Paperback: 256 pages
Publisher: Addison Wesley; 01 edition (12 Dec. 2016)
Product Dimensions: 17.8 x 1.8 x 23.1 cm
Expert Hadoop Administration: Managing, Tuning, and Securing Spark, YARN, and HDFS
In Expert Hadoop® Administration, leading Hadoop administrator Sam R. Alapati brings together authoritative knowledge for creating, configuring, securing, managing, and optimizing production Hadoop clusters in any environment. Drawing on his experience with large-scale Hadoop administration, Alapati integrates action-oriented advice with carefully researched explanations of both problems and solutions. He covers an unmatched range of topics and offers an unparalleled collection of realistic examples.
Alapati demystifies complex Hadoop environments, helping you understand exactly what happens behind the scenes when you administer your cluster. You’ll gain unprecedented insight as you walk through building clusters from scratch and configuring high availability, performance, security, encryption, and other key attributes. The high-value administration skills you learn here will be indispensable no matter what Hadoop distribution you use or what Hadoop applications you run.
- Understand Hadoop’s architecture from an administrator’s standpoint
- Create simple and fully distributed clusters
- Run MapReduce and Spark applications in a Hadoop cluster
- Manage and protect Hadoop data and high availability
- Work with HDFS commands, file permissions, and storage management
- Move data, and use YARN to allocate resources and schedule jobs
- Manage job workflows with Oozie and Hue
- Secure, monitor, log, and optimize Hadoop
- Benchmark and troubleshoot Hadoop
Paperback: 848 pages
Publisher: Addison Wesley (6 Dec. 2016)
Product Dimensions: 17.8 x 4.8 x 23.1 cm
Apache Hadoop YARN: Moving beyond MapReduce and Batch Processing with Apache Hadoop 2
Apache Hadoop is helping drive the Big Data revolution. Now, its data processing has been completely overhauled: Apache Hadoop YARN provides resource management at data center scale and easier ways to create distributed applications that process petabytes of data. And now in Apache Hadoop™ YARN, two Hadoop technical leaders show you how to develop new applications and adapt existing code to fully leverage these revolutionary advances.
YARN project founder Arun Murthy and project lead Vinod Kumar Vavilapalli demonstrate how YARN increases scalability and cluster utilization, enables new programming models and services, and opens new options beyond Java and batch processing. They walk you through the entire YARN project lifecycle, from installation through deployment.
You’ll find many examples drawn from the authors’ cutting-edge experience—first as Hadoop’s earliest developers and implementers at Yahoo! and now as Hortonworks developers moving the platform forward and helping customers succeed with it.
YARN’s goals, design, architecture, and components—how it expands the Apache Hadoop ecosystem
Exploring YARN on a single node
Administering YARN clusters and Capacity Scheduler
Running existing MapReduce applications
Developing a large-scale clustered YARN application
Discovering new open source frameworks that run under YARN
Paperback: 336 pages
Publisher: AddisonWesley Professional; 01 edition (19 Mar. 2014)
Product Dimensions: 17.8 x 2 x 22.6 cm
Drawing on his extensive experience as a professional graphic artist, writer, and programmer, Ritchie S. King walks you through a complete sample project—from conception through data selection and design. Step by step, you’ll build your skills, mastering increasingly sophisticated graphical forms and techniques. If you know a little HTML and CSS, you have all the technical background you’ll need to master D3.
This tutorial is for web designers creating graphics-driven sites, services, tools, or dashboards; online journalists who want to visualize their content; researchers seeking to communicate their results more intuitively; marketers aiming to deepen their connections with customers; and for any data visualization enthusiast.
- Identifying a data-driven story and telling it visually
- Creating and manipulating beautiful graphical elements with SVG
- Shaping web pages with D3
- Structuring data so D3 can easily visualize it
- Using D3’s data joins to connect your data to the graphical elements on a web page
- Sizing and scaling charts, and adding axes to them
- Loading and filtering data from external standalone datasets
- Animating your charts with D3’s transitions
- Adding interactivity to visualizations, including a play button that cycles through different views of your data
- Finding D3 resources and getting involved in the thriving online D3 community
Paperback: 288 pages
Publisher: Addison Wesley; 01 edition (27 Aug. 2014)
Product Dimensions: 18.8 x 1.6 x 23.3 cm