Using jupyter to deal with massive data is much better than excel!

Recently, an insightful article has attracted much attention. Many professionals prefer excel when dealing with large amounts of data analysis tasks, but author semi Koen believes that jupyter notebooks are more effective for many “big data” problems. < / P > < p > if you’re proficient in the jupyter compatible language, you’ll know what she’s talking about, and you won’t really plan to use Excel to process large data sets. < / P > < p > many approaches are related to a simple but devastating problem: it’s difficult to hide code cells. The management team certainly doesn’t want to see Python code in the files it receives to generate tables and charts. < / P > < p > an extension of jupyter can hide all input data and freely switch the invisible and visible code state in jupyter, but this is not effective for most export / static sharing schemes. < / P > < p > stackhoverflow post contains a suggested code fragment that can be pasted to the top of the notebook. It can enter the nbviewer and export HTML, but there are few export options < / P > < p > first, look at the best way to get things out of what looks like a hosted jupyter laptop, and finally back to simple files that can be attached to email. The HTML option is the best way to accurately represent the original notebook in a separate file. It can also maintain the hiding of code cells through the “stack overflow” code fragment mentioned above, but will not involve most widgets. In any case, html is not the format managers use to receive e-mail and may even be blocked by security software. < / P > < p > sounds more like PDF. But it’s hard to get directly, and it doesn’t hide your code unit. In fact, the best way to generate a PDF is to use the browser’s own print function, which requires it to output a PDF to display the notebook in the current browser. This includes everything you see in the browser at the time. < / P > < p > most of the other formats available are highly technical and are not really suitable for what we are trying to achieve here. So I started the nb2xls project, an easy to install additional export path that allows users to access Excel spreadsheet files based on the output cells of the notebook. < / P > < p > the advantage of this is that the table is still in a spreadsheet – you can drag the mouse over the numbers to see the total and average! Graphical output, such as a chart, is also displayed. There is no need to worry about page size, because if the notebook is long or wide, there are many rows and columns to fill in. < / P > < p > all of these “save as” options are actually just wrappers for the command-line utility nbconconvert, which invokes the conversion process with the default options. If you do need to specify any conversion options, you can use the command-line equivalent. < / P > < p > for example, when hiding all input data extension documents, it is recommended to use the following command to obtain the exported HTML file and delete the input code cells: < / P > < p > the save as option provides some useful infrastructure to export the skeleton of the notebook to a separate file. However, it is worth noting that in a familiar coding environment, it makes sense to generate some files directly from code cells. For example, if you have a large panda dataframe, it’s best to save it as a CSV so that recipients can load it completely. Like this: df.to_ csv。 In practice, you first import it into excel so that you can format the columns, underline the headings, and so on. < / P > < p > although the export option above allows only the core of the notebook to be obtained in a separate file format, sometimes you need to find a way to keep the existing jupyter format. < / P > < p > the jupyter project organization has some auxiliary projects. If you need to share laptops frequently, you can consider investing in some infrastructure. < / P > < p > jupyterhub is a way to centrally create jupyter workspaces on shared resources, so at least other users do not need to run their own jupyter servers. Unless you have a small organization that can run jupyterhub on your intranet, you need to consider how to add authentication to the right users. < / P > < p > binderhub does extend the jupyterhub, allowing users to launch a jupyter workspace based on a specific computing environment defined in the GIT repository, along with project related data files and notebooks. Users can start the workspace by directly accessing a URL. This provides a more formal and accessible showcase for your work. To see an example of a public instance called binderhub, check the “launch binder” link on the readme page of my nb2xls repo on GitHub. < / P > < p > in practice, neither of these projects is suitable for off the shelf tasks – management teams don’t want to change the way they type in their notebooks. Moreover, storing work in Git repo or workspace increases administrative overhead. < p > < p > nbviewer is a more suitable lightweight service that can easily host notebooks through a URL. Think of it as a managed web page version of the HTML export option discussed above: Javascript works, but there is no active kernel behind it, so users can only see the results of the work as output the last time they run the notebook. < / P > < p > like binder, a free trial of the hosted version of nbviewer is available. You can provide a URL to the notebook on GitHub, but you can also use the Dropbox link. For example, “copy Dropbox links” on Dropbox’s ipynb file, and paste the URL into the box of. You can share the URL of the generated viewer page with colleagues, but of course it’s not secure. < / P > < p > sharing URLs is more natural than sending HTML files to the data management team, but in reality, you don’t get a lot of things that HTML exports can’t do from nbviewer. So, so far, none of the major projects planned by jupyter seems to have helped much. A recent development of the Jupiter project may be exactly what you’re looking for: voil à allows you to host a laptop with an active kernel without requiring any shift input. By default, code cells are hidden and front-end execution requests are prohibited, so users can’t break anything even if they try! < / P > < p > in the case under discussion, this can be a very good way to share notebooks, but it still requires a lot of work. At the time of writing this article, only one single user link to the notepad application can be shared. If the independent operation of multiple users confuses the data flow in Notepad, multiple users may conflict. < / P > < p > there are plans to integrate voil à with jupyterhub, which will allow multiple users to access your voil à hosted laptop. Of course, every time a colleague chooses to view your laptop, you still need to make sure that the voil à server is running, so this is not something that you would normally run continuously on your local machine. < / P > < p > kyso is a third-party service that allows “blogging data science.”. There are some public notebooks listed on its home page to let you know how to share them. In a pay plan, you can limit collaboration within the team. By default, code input cells are hidden! < / P > < p > another service, Saturn cloud is a complete cloud hosting environment. As an alternative to Google cloud or Amazon cloud services, it has built-in “publish” function. While it’s easy for your colleagues to launch your laptop, it seems impossible to publish completely privately. < / P > < p > it turns out that this is much more difficult than we would like to share with non developers the results of the jupyter laptop experiment! The features we want are to ensure that the code is hidden, the results are easy to access, the display is interactive on demand, and the data is kept safe. < / P > < p > although there are already some export functions, these bring challenges or limitations when sharing complex data. Considering more substantive hosting options, either you need to implement your own infrastructure or transform the data science workflow into a new third-party cloud hosting service. < / P > < p > therefore, the best solution will depend largely on the data that is shared and how often similar reports need to be submitted. It’s worth noting that there are patterns in the various demonstrations that are made throughout the organization. For the particular sharing of big data analysis, spreadsheets still need to be compiled from carefully exported CSVS; but if this happens frequently, it may be time to invest in some available infrastructure to solve the problem. Continue ReadingAmerican companies begin to give up R & D: who should pay for corporate research?

Author: zmhuaxia