【网络采集爬虫】oncomine.org基因数据爬取

作者：

风君子

在

软件

本次爬取的是著名基因数据库www.oncomine.org，爬取的内容是基因相关因子和点位信息，由于源数据是以图片形式展现的，因此我们需要结合常规爬虫+Ocr表格识别+图像裁剪技术进行。

先看一下采集源的图片：

图中红色框选部分就是我们要抓取的内容，看似是普通文本，其实和右侧的矩阵图是一样的，都是一张完整的图片，网页其他位置无法获得此数据，因此按如下步骤进行：

1、爬虫工具自动翻页将所有图片下载下来，按日期重命名，以便后续按顺序进行整理。

2、加载目录中的图片，进行裁剪，只保留左侧部分，目的是增强Ocr表格识别的准确率和速度。

同时，由于图中左侧没有表格线，不容易识别，所以人为增加网格线，效果如下：

            string source = _hostingEnvironment.WebRootPath + "\source\";
            string[] allGene = FileHelper.GetFileNamessource + "geneSPINT2\");
            List<VM_Oncomine> finalR = new List<VM_Oncomine>);
            Bitmap background = Bitmap)Bitmap.FromFilesource + "bg.jpg");//加白色背景
            System.Drawing.Image iSource = System.Drawing.Image.FromFilepic);
            Bitmap newbm = new Bitmap133, 460);
            Graphics gh = Graphics.FromImagenewbm);
            gh.DrawImagebackground, 0, 0, 133, 460);
            gh.DrawImageiSource, new Rectangle0, 0, 133, 460), new Rectangle20, 50, 133, 460), GraphicsUnit.Pixel);
            newbm.Savepic.Replace"geneSPINT2", "temp_geneSPINT2"), ImageFormat.Bmp);

3、使用百度表格识别API进行识别，以Json形式展现结果。

4、将结果合并，导出Excel即可

完整代码如下：

 /// <summary>
        /// 采集oncomine.org图片中表格信息采集
        /// </summary>
        /// <returns></returns>
        [HttpGet"pictoexcel")]
        public IActionResult PicToExcel)
        {
            var client = new Baidu.Aip.Ocr.Ocr"QhgOZpPMfNYlUVd3kUR。。。。", "5Gw6nrqrkYkMoCqc59hlB8s。。。。。。");
            client.Timeout = 40000;  // 修改超时时间

            string source = _hostingEnvironment.WebRootPath + "\source\";
            string[] allGene = FileHelper.GetFileNamessource + "geneSPINT2\");
            List<VM_Oncomine> finalR = new List<VM_Oncomine>);
            Bitmap background = Bitmap)Bitmap.FromFilesource + "bg.jpg");//加白色背景
            foreach var pic in allGene)
            {
                System.Drawing.Image iSource = System.Drawing.Image.FromFilepic);
                Bitmap newbm = new Bitmap133, 460);
                Graphics gh = Graphics.FromImagenewbm);
                gh.DrawImagebackground, 0, 0, 133, 460);
                gh.DrawImageiSource, new Rectangle0, 0, 133, 460), new Rectangle20, 50, 133, 460), GraphicsUnit.Pixel);
                newbm.Savepic.Replace"geneSPINT2", "temp_geneSPINT2"), ImageFormat.Bmp);
                //文字识别

                byte[] imageBytes = FileHelper.ReadFileToBytepic.Replace"geneSPINT2", "temp_geneSPINT2"));
                var result = client.TableRecognitionRequestimageBytes).ToString);
                JObject jo = JObject)JsonConvert.DeserializeObjectresult);
                var options = new Dictionary<string, object>{
                    {"result_type", "json"}
                };
            retry:
                var tableR = client.TableRecognitionGetResultjo["result"][0]["request_id"].ToString), options);
                if tableR["result"]["ret_msg"].ToString) != "已完成")
                {
                    Thread.Sleep2000);
                    goto retry;
                }
                else
                {
                    var formResult = JObject)JsonConvert.DeserializeObjecttableR["result"]["result_data"].ToString));
                    var bodyRows = formResult["forms"][0]["body"];
                    foreach var r in bodyRows)
                    {
                        string word = r["word"].ToString);
                        if word.IndexOf"0.") > -1)
                        {
                            int coRow = int)r["row"][0];

                            foreach var f in bodyRows)
                            {
                                if f["word"].ToString).IndexOf"0.") == -1)
                                {
                                    int genRow = int)f["row"][0];
                                    if genRow == coRow)
                                    {
                                        if f["word"].ToString) == "")
                                        {
                                            f["word"] = "-";
                                        }
                                        finalR.Addnew VM_Oncomine { Correlation = word, Gene = f["word"].ToString) });
                                    }
                                }
                            }
                        }
                    }

                }
            }
            //导出Excel

            XlsGenerator.createXlsFilefinalR, "SPINT2", source, out string filePath);
            var dt = DateTime.Now.ToString"yyyyMMdd");
            string fileName = WebUtility.UrlEncode"Gene_") + dt + ".xls";
            filePath = source + filePath;
            Response.Headers.Add"content-disposition", "attachment;filename=" + fileName);
            return Filenew FileStreamfilePath, FileMode.Open), "application/excel", fileName);
        }

想要采集爬虫网络数据的可以邮件联系我 cdipbsxf@qq.com

【网络采集爬虫】oncomine.org基因数据爬取

用哪个弄湿

少儿国寿福庆典版优缺点在哪？不足和亮点各参半

狗狗吃卫生巾怎么办

房贷二次扣款会影响征信吗？这是很有可能的

2020最新自助免费申请Office365教育版，免费5TOneDrive云盘详细图文教程

MacBook怎么绕过BootCamp安装Win10双系统

上海电信千兆宽带速度究竟如何上海电信千兆宽带

国内知名的前端博客

域名为什么会被墙如何检测域名是否被墙域名被墙如何处理

【网络采集爬虫】oncomine.org基因数据爬取

更多文章

评论

发表回复 取消回复

站内搜索

标签云

热门文章

友情链接

发表回复取消回复