最新顶尖数据分析师必用的15大Python库（上）

来源：灯塔大数据时间：2017-07-05 11:33:06 作者：

　　因为这里提到的所有的库都是开源的，所以我们还备注了每个库的贡献资料数量、贡献者人数以及其他指数，可对每个Python库的受欢迎程度加以辅助说明。

　　1.NumPy

　　（资料数量：15980；贡献者：522）

　　在最开始接触Python的时候，我们不可避免的都需要寻求Python的SciPyStack的帮助，SciPyStack是一款专为Python中科学计算而设计的软件集。所以我们在讲Python库的时候就不得不提到它了。但是SciPyStack所含内容非常广泛，其中包括了十几个库，而我们需要做的是找到其中最重要的软件包。

　　NumPy（代表Numerical Python）是构建科学计算栈（scientific computation stack）的最基础的软件包。它的功能丰富，可以满足Python中n数组和矩阵的操作需求。该库提供了NumPy数组类型的数学运算向量化，可以改善性能，从而加快执行速度。

　　2.SciPy

　　（资料数量：17213；贡献者：489）

　　SciPy是一个工程和科学软件库。您还需要了解SciPyStack和SciPyLibrary之间的区别。SciPy包含线性代数，优化，集成和统计多个模块。SciPyLibrary的主要功能是建立在NumPy的基础上，因此它的数组大量使用NumPy。它通过其特定的子模块提供有效的数值例程（numerical routines），如数字积分，优化等等。SciPy的所有子模块中功能都有详细的记录–这是它的另一大优势。

　　3.Pandas

　　（资料数量：15089；贡献者：762）

　　Pandas是一个Python软件包，可以处理“标记”（labeled）和“关联”（relational）数据，简单直观。Pandas是数据整理的完美工具。使用者可以通过它快速简便地完成数据操作，聚合和可视化。

　　Pandas库有两种主要数据结构：

　　“系列”（Series）——单维结构

　　“数据帧”（Data Frames）——二维结构

　　例如，如果你通过Series在Data Frame中附加一行数据，你就能从这两种数据结构中获得一个的新的“数据帧”

　　使用Pandas你可以完成以下操作：

　　轻松删除或添加“数据帧”

　　bjects将数据结构转化成“数据帧对象”

　　处理缺失数据，用NaNs表示

　　强大的分组功能

　　4.Matplotlib

　　（资料数量：21754；贡献者：588）

　　MatPlotlib是SciPyStack另一个核心软件包和Python库，可以轻松生成简单而强大的可视化功能。这个顶尖软件包使得Python（有一些NumPy，SciPy和Pandas的帮助）可以与MatLab或Mathematica等科学工具的一较高下。

　　然而，这个库还是相对比较低级的，这意味着你需要编写更多的代码才能达到高级的可视化效果，而且通常会比使用那些高级工具要付出更多的努力，但总体来说还是值得一试的。

　　你可以使用它实现各种可视化：

　　线路图

　　散点图;

　　条形图和直方图;

　　饼状图;

　　茎叶图

　　等值线图

　　向量场图

　　频谱图

　　还可以使用Matplotlib创建标签，网格，图例和许多其他格式化字符。基本来说，一切都是可进行自定义的。

　　这个库由很多平台支持，并使用不同的图形用户界面（GUI）套件来描绘所得的可视化。很多IDE（如IPython）都支持Matplotlib的功能。

　　5.Seaborn

　　（资料数量：1699；贡献者：71）

　　Seaborn主要关注统计模型的可视化，如热图，这些可视化图形在总结数据的同时描绘数据的总体分布。Seaborn是基于Matplotlib的，并高度依赖于它。

　　6.Bokeh

　　（资料数量：15724；贡献者：223）

　　Bokeh是另一个强大的可视化库，可以实现交互式可视化。与其他的库相比，它的特别之处在于它是独立于Matplotlib的。Bokeh的主要关注点是交互性，所以它可以通过现代浏览器以数据驱动文档（d3.js）的方式进行演示。

　　7.Plotly

　　（资料数量：2486；贡献者：33）

　　它是一个基于网络的工具箱，可用于构建可视化，用编程语言（其中包括Python）处理应用程序界面（API）。在“plotly”网站上有一些强大的“开箱即用”的图形。在使用Plotly之前，您需要设置您的API密钥。这些图形将在服务器端上进行处理，然后发布到互联网上，当然也可以选择不发布。

　　英文原文

　　Top 15 Python Libraries for Data Science in 2017

As Python has gained a lot of traction in the recent years in Data Science industry, I wanted to outline some of its most useful libraries for data scientists and engineers, based on recent experience.

And, since all of the libraries are open sourced, we have added commits, contributors count and other metrics from Github, which could be served as a proxy metrics for library popularity.

Core Libraries.

1. NumPy (Commits: 15980, Contributors: 522)

When starting to deal with the scientific task in Python, one inevitably comes for help to Python’s SciPy Stack, which is a collection of software specifically designed for scientific computing in Python (do not confuse with SciPy library, which is part of this stack, and the community around this stack). This way we want to start with a look at it. However, the stack is pretty vast, there is more than a dozen of libraries in it, and we want to put a focal point on the core packages (particularly the most essential ones).

The most fundamental package, around which the scientific computation stack is built, is NumPy (stands for Numerical Python). It provides an abundance of useful features for operations on n-arrays and matrices in Python. The library provides vectorization of mathematical operations on the NumPy array type, which ameliorates performance and accordingly speeds up the execution.

2. SciPy (Commits: 17213, Contributors: 489)

SciPy is a library of software for engineering and science. Again you need to understand the difference between SciPy Stack and SciPy Library. SciPy contains modules for linear algebra, optimization, integration, and statistics. The main functionality of SciPy library is built upon NumPy, and its arrays thus make substantial use of NumPy. It provides efficient numerical routines as numerical integration, optimization, and many others via its specific submodules. The functions in all submodules of SciPy are well documented?—?another coin in its pot.

3. Pandas (Commits: 15089, Contributors: 762)

Pandas is a Python package designed to do work with “labeled” and “relational” data simple and intuitive. Pandas is a perfect tool for data wrangling. It designed for quick and easy data manipulation, aggregation, and visualization.

There are two main data structures in the library:

“Series”?—?one-dimensional

“Data Frames”, two-dimensional

For example, when you want to receive a new Dataframe from these two types of structures, as a result you will receive such DF by appending a single row to a DataFrame by passing a Series:

Here is just a small list of things that you can do with Pandas:

Easily delete and add columns from DataFrame

Convert data structures to DataFrame objects

Handle missing data, represents as NaNs

Powerful grouping by functionality

4.Matplotlib (Commits: 21754, Contributors: 588)

Another SciPy Stack core package and another Python Library that is tailored for the generation of simple and powerful visualizations with ease is Matplotlib. It is a top-notch piece of software which is making Python (with some help of NumPy, SciPy, and Pandas) a cognizant competitor to such scientific tools as MatLab or Mathematica.

However, the library is pretty low-level, meaning that you will need to write more code to reach the advanced levels of visualizations and you will generally put more effort, than if using more high-level tools, but the overall effort is worth a shot.

With a bit of effort you can make just about any visualizations:

Line plots;

Scatter plots;

Bar charts and Histograms;

Pie charts;

Stem plots;

Contour plots;

Quiver plots;

Spectrograms

There are also facilities for creating labels, grids, legends, and many other formatting entities with Matplotlib. Basically, everything is customizable.

The library is supported by different platforms and makes use of different GUI kits for the depiction of resulting visualizations. Varying IDEs (like IPython) support functionality of Matplotlib.

There are also some additional libraries that can make visualization even easier.

5. Seaborn (Commits: 1699, Contributors: 71)

Seaborn is mostly focused on the visualization of statistical models; such visualizations include heat maps, those that summarize the data but still depict the overall distributions. Seaborn is based on Matplotlib and highly dependent on that.

6. Bokeh (Commits: 15724, Contributors: 223)

Another great visualization library is Bokeh, which is aimed at interactive visualizations. In contrast to the previous library, this one is independent of Matplotlib. The main focus of Bokeh, as we already mentioned, is interactivity and it makes its presentation via modern browsers in the style of Data-Driven Documents (d3.js).

7. Plotly (Commits: 2486, Contributors: 33)

Finally, a word about Plotly. It is rather a web-based toolbox for building visualizations, exposing APIs to some programming languages (Python among them). There is a number of robust, out-of-box graphics on the plot.ly website. In order to use Plotly, you will need to set up your API key. The graphics will be processed server side and will be posted on the internet, but there is a way to avoid it.