Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
---
layout: post
title: 如何用pandas做简单的数据分析

date: 2021-03-27
author: LZY
categories:
Expand Down
32 changes: 32 additions & 0 deletions docs/views/data/2021-11-08-week1学习内容.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
---
layout: post
title: week1学习内容
date: 2021-11-08
author: 饶翰宇
categories:
- 数据分析部
tags:
- 数据分析
- Python
---

## python:

1. pandas

算法数据结构:

1. 决策树
1. 随机森林
2. 栈和队列

数学:

1. 二维随机变量的分布

其他:

1. 学习了基本的markdown语法
2. 利用Typora书写markdown
3. 安装好了pandoc,配置好了上传博客的基础工具

212 changes: 212 additions & 0 deletions docs/views/data/2021-11-25-MySQL和数据可视化.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
---
layout: post
title: MySQL和数据可视化
date: 2021-11-25
author: 饶翰宇
categories:
- 数据分析部
tags:
- MySQL
- Python
- 函数画图
---

## week3

### MySQL进阶1

1. 查询🐱‍🐉

- 排序查询

```mysql
select * from person order by age desc,id asc;
```

2. 函数🐱‍🚀

- 单行函数

1. 字符函数

- concat

```mysql
select concat('这瓜','保熟','吗');
```

- length(返回字节长度)

```mysql
select length('张三a123');
```

- substr/substring

```mysql
SELECT SUBSTR('今天希望你开心',5,3);
```

- upper & lower

```mysql
select upper('abAb1');
select lower('abAb1');
```

- instr

```mysql
select instr('泊松分布','分布');
```

- trim

```mysql
select trim(' abcd ');
select trim('ab' from 'ab abcc b');
```

- lpad & rpad

```mysql
SELECT LPAD('哥谭市',8,'*');
SELECT RPAD('哥谭市',8,'*');
```

- replace

```mysql
SELECT REPLACE('想登上高山欲穷千里目','想','不想');
```



2. 数学函数

- round

```mysql
select round(1.22,1);
```

- ceil & floor(向上、向下取整)

```mysql
select ceil(1.9);
select floor(1.9);
```

- truncate(保留几位小数)

```mysql
select truncate(1.231313,3);
```

- mod

```mysql
select mod(10,3);
```

3. 日期函数

- now
- curdate
- curtime

4. 其他函数

5. 流程控制函数

- 分组函数(统计使用)



### 绘图

#### 绘制正态分布:jack_o_lantern:

1. 利用随机数绘画:baby_chick:

- 首先利用numpy生成随机标准正态分布数组

```python
import numpy as np
np.random.seed(0)
data = np.random.standard_normal(100000000)
data
```

```python
array([ 1.76405235, 0.40015721, 0.97873798, ..., 0.32191089,
0.25199669, -1.22612391])
```

- 然后使用matplotlib绘出图像

```python
import matplotlib.pyplot as plt
%matplotlib inline
plt.hist(data,1000)
```

![屏幕截图 2021-11-26 121445.png](https://i.loli.net/2021/11/26/2yPKNiYHb6kuaQR.png)



2. 利用sympy画图:label:

- ```mysql
from sympy import *
from sympy.stats import Normal,density
```

- ```mysql
y = symbols('y')
x = symbols('x')
y = Normal(y,0,1)
plot(density(y)(x))
```

- ```python
density(y)(x)
```

- ![屏幕截图 2021-11-26 135103.png](https://i.loli.net/2021/11/26/tYrEBCmaT67iFWX.png)

-

![屏幕截图 2021-11-26 133204.png](https://i.loli.net/2021/11/28/EdlYUr84F1ceCXi.png)



绘制其他函数

1. sympy

- ```python
plot(x,pow(x,2))
```

- ![屏幕截图 2021-11-26 141341.png](https://i.loli.net/2021/11/26/LMZstOnfU2JHKF9.png)

2. matplotlib

- ```python
x = np.arange(1,10,0.01)
y = np.log10(x)
u = np.arange(1,10,0.01)
w = np.exp(u)
```

- ```python
plt.style.use('ggplot')
fig,ax = plt.subplots(1,2,figsize=(8,4))
ax[0].plot(x,y,label='log10',color='r')
ax[0].legend(loc='best')
ax[1].plot(u,w,label='ex',color='b')
ax[1].legend(loc='best')
```

- ![屏幕截图 2021-11-26 143624.png](https://i.loli.net/2021/11/26/wZLR4r2SmQ1G8XW.png)
119 changes: 119 additions & 0 deletions docs/views/data/2021-12-3-朴素贝叶斯算法实现文本分类.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,119 @@
---
layout: post
title: 朴素贝叶斯算法实现文本分类
date: 2021-12-3
author: 饶翰宇
categories:
- 数据分析部
tags:
- Python
- 文本分类
---

## 文本分类

现实中的文本复杂多样,文本分类和文本情感分析是我们开展机器学习的重要组成部分。

以下将用一个案例来实现对文本的分类。

- 首先导入原始的数据

这里我们使用一个对餐厅评价的数据集

```python
import pandas as pd
data = pd.read_csv('./restaurant.csv',encoding='gb18030')
data
```

![A5SF1_K_AV1CUQ9__9_Z8M7.png](https://s2.loli.net/2021/12/04/1cTFRozlOeU2W7I.png)

- 紧接着对每条数据附上标签,将star高于3的划分为1,反之则为0

```python
import numpy as np
star = np.array(data.star)
star[star <= 3] = 0
star[star > 3] = 1
data['label'] = star
data
```

![98ST`_R_XMSE__CTL_YN_GV.png](https://s2.loli.net/2021/12/04/RlbGJV6ZQmYShsa.png)

- 然后我们对每条评论进行切词并且新增加一列“words”

```python
import jieba
data['words'] = data['comment'].apply(lambda x:' '.join(jieba.lcut(x,cut_all=True)))
data
```

![8BI6_7E8FI7864I_CVVY1_T.png](https://s2.loli.net/2021/12/04/Z8QMfj7LEolWqrX.png)

- 对数据集进行训练集和测试集的划分

```python
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(data.words,data.label,test_size=0.2,random_state=42)
```

- 导入文本特征提取方法

```python
from sklearn.feature_extraction.text import CountVectorizer
```

- 计算次数

```python
counter = CountVectorizer()
x_train = counter.fit_transform(x_train)
x_test = counter.transform(x_test)
```

- 画出图表

```python
amount = x_train.toarray()
name = counter.get_feature_names()
result = pd.DataFrame(data=amount,columns=name)
result
```

![屏幕截图 2021-12-05 164011.png](https://s2.loli.net/2021/12/05/ABOXHVwYGRyFZhM.png)

- 搭建模型

```python
from sklearn.naive_bayes import MultinomialNB
estimator = MultinomialNB()
estimator.fit(x_train,y_train)
```

```python
y_predict = estimator.predict(x_test)
```

![屏幕截图 2021-12-05 164841.png](https://s2.loli.net/2021/12/05/gsTkdQZNf3vCY6o.png)

- 计算准确率

```python
estimator.score(x_test,y_test)
```

$$
0.8475
$$



- 查看测试集和预测目标值的正确率

```python
np.array(y_test == y_predict)
```

![屏幕截图 2021-12-05 165057.png](https://s2.loli.net/2021/12/05/DWs2x9eSQcEFirm.png)

Loading