Converting a text document with special format to pandas data frame Announcing the arrival of...

Does GDPR cover the collection of data by websites that crawl the web and resell user data

How can I introduce the names of fantasy creatures to the reader?

Why aren't these two solutions equivalent? Combinatorics problem

tabularx column has extra padding at right?

How to ask rejected full-time candidates to apply to teach individual courses?

Are bags of holding fireproof?

“Since the train was delayed for more than an hour, passengers were given a full refund.” – Why is there no article before “passengers”?

Why is ArcGIS Pro not symbolizing my entire range of values?

Why "Go Out and Learn"

Creating one variable from a list of variables in R?

Import keychain to clean macOS install?

Pointing to problems without suggesting solutions

Is it OK if I do not take the receipt in Germany?

How do I overlay a PNG over two videos (one video overlays another) in one command using FFmpeg?

How to show a density matrix is in a pure/mixed state?

How to get a single big right brace?

Why did Europeans not widely domesticate foxes?

Weaponising the Grasp-at-a-Distance spell

How is an IPA symbol that lacks a name (e.g. ɲ) called?

Is my guitar’s action too high?

Can Deduction Guide have an explicit(bool) specifier?

Does using the inspiration rules for character defects tend to encourage players to display MGS?

Can I feed enough spin up electron to a black hole to affect its angular momentum?

What could prevent concentrated local exploration?



Converting a text document with special format to pandas data frame



Announcing the arrival of Valued Associate #679: Cesar Manara
Planned maintenance scheduled April 23, 2019 at 23:30 UTC (7:30pm US/Eastern)
Data science time! April 2019 and salary with experience
The Ask Question Wizard is Live!How can I reverse a list in Python?Converting a Pandas GroupBy object to DataFrameAdd one row to pandas DataFrameAdding new column to existing DataFrame in Python pandas“Large data” work flows using pandasChange data type of columns in PandasHow to iterate over rows in a DataFrame in Pandas?Python Pandas Error tokenizing dataConvert list of dictionaries to a pandas DataFramePandas convert datatime format





.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty{ height:90px;width:728px;box-sizing:border-box;
}







9















I am new to pandas:
I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a pandas with the following format:



Id   Term    weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?










share|improve this question























  • I can only think of regex helping here.

    – amanb
    1 hour ago








  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    1 hour ago











  • It can be done with explode and split

    – Wen-Ben
    1 hour ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    1 hour ago













  • The data is in text format.

    – Mary
    1 hour ago


















9















I am new to pandas:
I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a pandas with the following format:



Id   Term    weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?










share|improve this question























  • I can only think of regex helping here.

    – amanb
    1 hour ago








  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    1 hour ago











  • It can be done with explode and split

    – Wen-Ben
    1 hour ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    1 hour ago













  • The data is in text format.

    – Mary
    1 hour ago














9












9








9


4






I am new to pandas:
I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a pandas with the following format:



Id   Term    weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?










share|improve this question














I am new to pandas:
I have a text file with the following format:



1: frack 0.733, shale 0.700, 
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345


I need to covert this text to a pandas with the following format:



Id   Term    weight
1 frack 0.733
1 shale 0.700
10 space 0.645
10 station 0.327
10 nasa 0.258
4 celebr 0.262
4 bahar 0.345


How I can do it?







python pandas






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked 1 hour ago









MaryMary

454216




454216













  • I can only think of regex helping here.

    – amanb
    1 hour ago








  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    1 hour ago











  • It can be done with explode and split

    – Wen-Ben
    1 hour ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    1 hour ago













  • The data is in text format.

    – Mary
    1 hour ago



















  • I can only think of regex helping here.

    – amanb
    1 hour ago








  • 1





    Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

    – Quang Hoang
    1 hour ago











  • It can be done with explode and split

    – Wen-Ben
    1 hour ago











  • Also , When you read the text to pandas what is the format of the df ?

    – Wen-Ben
    1 hour ago













  • The data is in text format.

    – Mary
    1 hour ago

















I can only think of regex helping here.

– amanb
1 hour ago







I can only think of regex helping here.

– amanb
1 hour ago






1




1





Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
1 hour ago





Depending on how large/long your file is, you can loop through the file without pandas to format it properly first.

– Quang Hoang
1 hour ago













It can be done with explode and split

– Wen-Ben
1 hour ago





It can be done with explode and split

– Wen-Ben
1 hour ago













Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
1 hour ago







Also , When you read the text to pandas what is the format of the df ?

– Wen-Ben
1 hour ago















The data is in text format.

– Mary
1 hour ago





The data is in text format.

– Mary
1 hour ago












8 Answers
8






active

oldest

votes


















6














Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



import re
import pandas as pd

SEP_RE = re.compile(r":s+")
DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


def parse(filepath: str):
def _parse(filepath):
with open(filepath) as f:
for line in f:
id, rest = SEP_RE.split(line, maxsplit=1)
for match in DATA_RE.finditer(rest):
yield [int(id), match["term"], float(match["weight"])]
return list(_parse(filepath))


Example:



>>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
... columns=["Id", "Term", "weight"])
>>>
>>> df
Id Term weight
0 1 frack 0.733
1 1 shale 0.700
2 10 space 0.645
3 10 station 0.327
4 10 nasa 0.258
5 4 celebr 0.262
6 4 bahar 0.345

>>> df.dtypes
Id int64
Term object
weight float64
dtype: object




Walkthrough



SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



An easy way to visualize this is to use an example line from your file as a string:



>>> line = "1: frack 0.733, shale 0.700,n"
>>> SEP_RE.split(line, maxsplit=1)
['1', 'frack 0.733, shale 0.700,n']


Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



>>> id, rest = SEP_RE.split(line, maxsplit=1)
>>> it = DATA_RE.finditer(rest)
>>> match = next(it)
>>> match
<re.Match object; span=(0, 11), match='frack 0.733'>
>>> match["term"]
'frack'
>>> match["weight"]
'0.733'


The better way to visualize it is with pdb. Give it a try if you dare ;)



Disclaimer



This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






share|improve this answer





















  • 2





    Brilliant answer, I must say.

    – amanb
    50 mins ago











  • @amanb Thank you!

    – Brad Solomon
    46 mins ago



















3














You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



import pandas as pd
from itertools import chain

text="""1: frack 0.733, shale 0.700,
10: space 0.645, station 0.327, nasa 0.258,
4: celebr 0.262, bahar 0.345 """

df = pd.DataFrame(
list(
chain.from_iterable(
map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
)
),
columns=["Id", "Term", "weight"]
)

print(df)
# Id Term weight
#0 4 frack 0.733
#1 4 shale 0.700
#2 4 space 0.645
#3 4 station 0.327
#4 4 nasa 0.258
#5 4 celebr 0.262
#6 4 bahar 0.345


Explanation



I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
#[['1', ' frack 0.733, shale 0.700'],
# ['10', ' space 0.645, station 0.327, nasa 0.258'],
# ['4', ' celebr 0.262, bahar 0.345']]


The next step is to split on the comma to separate the values, and assign the Id to each set of values:



print(
[
list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
map(lambda x: x.strip(" ,").split(":"), text.splitlines())
]
)
#[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
# [('10', 'space', '0.645'),
# ('10', 'station', '0.327'),
# ('10', 'nasa', '0.258')],
# [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



Note: The * tuple unpacking is a python 3 feature.






share|improve this answer

































    3














    Assuming your data (csv file) looks like given:



    df = pd.read_csv('untitled.txt', sep=': ', header=None)
    df.set_index(0, inplace=True)

    # split the `,`
    df = df[1].str.strip().str.split(',', expand=True)

    # 0 1 2 3
    #-- ------------ ------------- ---------- ---
    # 1 frack 0.733 shale 0.700
    #10 space 0.645 station 0.327 nasa 0.258
    # 4 celebr 0.262 bahar 0.345

    # stack and drop empty
    df = df.stack()
    df = df[~df.eq('')]

    # split ' '
    df = df.str.strip().str.split(' ', expand=True)

    # edit to give final expected output:

    # rename index and columns for reset_index
    df.index.names = ['Id', 'to_drop']
    df.columns = ['Term', 'weight']

    # final df
    final_df = df.reset_index().drop('to_drop', axis=1)





    share|improve this answer


























    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

      – Rebin
      36 mins ago






    • 1





      @Rebin add engine='python'

      – pault
      33 mins ago











    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

      – Quang Hoang
      30 mins ago











    • I dont know how to add engine python? what is the command?

      – Rebin
      29 mins ago






    • 1





      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

      – pault
      27 mins ago



















    0














    Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



    import pandas as pd
    file=r"give_your_path".replace('\', '/')
    my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
    with open(file,"r+") as f:
    for line in f.readlines():#looping every line
    my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
    for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
    my_list_of_lists.append(my_id+term)
    df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
    df.columns=["Id","Term","weight"]#giving columns their names





    share|improve this answer































      0














      It is possible to just use entirely pandas:



      df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
      10: space 0.645, station 0.327, nasa 0.258,
      4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

      #df:
      0 1
      0 1 frack 0.733, shale 0.700,
      1 10 space 0.645, station 0.327, nasa 0.258,
      2 4 celebr 0.262, bahar 0.345


      Turn the column 1 into a list and then expand:



      df[1] = df[1].str.split(",", expand=False)

      dfs = []
      for idx, rows in df.iterrows():
      print(rows)
      dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
      dfs.append(dfslice)
      newdf = pd.concat(dfs, ignore_index=True)

      # this creates newdf:
      Id terms
      0 1 frack 0.733
      1 1 shale 0.700
      2 1
      3 10 space 0.645
      4 10 station 0.327
      5 10 nasa 0.258
      6 10
      7 4 celebr 0.262
      8 4 bahar 0.345


      Now we need to str split the last line and drop empties:



      newdf["terms"] = newdf["terms"].str.strip()
      newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
      newdf.columns = ["Id", "terms", "Term", "Weights"]
      newdf = newdf.drop("terms", axis=1).dropna()


      Resulting newdf:



         Id     Term Weights
      0 1 frack 0.733
      1 1 shale 0.700
      3 10 space 0.645
      4 10 station 0.327
      5 10 nasa 0.258
      7 4 celebr 0.262
      8 4 bahar 0.345





      share|improve this answer































        0














        Could I assume that there is just 1 space before 'TERM'?



        df=pd.DataFrame(columns=['ID','Term','Weight'])
        with open('C:/random/d1','r') as readObject:
        for line in readObject:
        line=line.rstrip('n')
        tempList1=line.split(':')
        tempList2=tempList1[1]
        tempList2=tempList2.rstrip(',')
        tempList2=tempList2.split(',')
        for item in tempList2:
        e=item.split(' ')
        tempRow=[tempList1[0], e[0],e[1]]
        df.loc[len(df)]=tempRow
        print(df)





        share|improve this answer































          0














          Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



          import pandas as pd
          from parsimonious.grammar import Grammar
          from parsimonious.nodes import NodeVisitor

          file = """1: frack 0.733, shale 0.700,
          10: space 0.645, station 0.327, nasa 0.258,
          4: celebr 0.262, bahar 0.345
          """

          grammar = Grammar(
          r"""
          expr = line+

          line = id colon pair*
          pair = term ws weight sep? ws?

          id = ~"d+"
          colon = ws? ":" ws?
          sep = ws? "," ws?

          term = ~"[a-zA-Z]+"
          weight = ~"d+(?:.d+)?"

          ws = ~"s+"
          """
          )

          tree = grammar.parse(file)

          class PandasVisitor(NodeVisitor):
          def generic_visit(self, node, visited_children):
          return visited_children or node

          def visit_pair(self, node, visited_children):
          term, _, weight, *_ = visited_children
          return (term.text, weight.text)

          def visit_line(self, node, visited_children):
          id, _, pairs = visited_children
          return [(id.text, *pair) for pair in pairs]

          def visit_expr(self, node, visited_children):
          return [item for lst in visited_children for item in lst]

          pv = PandasVisitor()
          result = pv.visit(tree)

          df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
          print(df)


          This yields



             Id     Term weight
          0 1 frack 0.733
          1 1 shale 0.700
          2 10 space 0.645
          3 10 station 0.327
          4 10 nasa 0.258
          5 4 celebr 0.262
          6 4 bahar 0.345




          share































            -3














            1) You can read row by row.



            2) Then you can separate by ':' for your index and ',' for the values



            1)



            with open('path/filename.txt','r') as filename:
            content = filename.readlines()


            2)
            content = [x.split(':') for x in content]



            This will give you the following result:



            content =[
            ['1','frack 0.733, shale 0.700,'],
            ['10', 'space 0.645, station 0.327, nasa 0.258,'],
            ['4','celebr 0.262, bahar 0.345 ']]





            share|improve this answer



















            • 2





              Your result is not the result asked for in the question.

              – GiraffeMan91
              1 hour ago












            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-data-frame%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            8 Answers
            8






            active

            oldest

            votes








            8 Answers
            8






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            6














            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object




            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






            share|improve this answer





















            • 2





              Brilliant answer, I must say.

              – amanb
              50 mins ago











            • @amanb Thank you!

              – Brad Solomon
              46 mins ago
















            6














            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object




            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






            share|improve this answer





















            • 2





              Brilliant answer, I must say.

              – amanb
              50 mins ago











            • @amanb Thank you!

              – Brad Solomon
              46 mins ago














            6












            6








            6







            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object




            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.






            share|improve this answer















            Here's an optimized way to parse the file with re, first taking the ID and then parsing the data tuples. This takes advantage of the fact that file objects are iterable. When you iterate over an open file, you get the individual lines as strings, from which you can extract the meaningful data elements.



            import re
            import pandas as pd

            SEP_RE = re.compile(r":s+")
            DATA_RE = re.compile(r"(?P<term>[a-z]+)s+(?P<weight>d+.d+)", re.I)


            def parse(filepath: str):
            def _parse(filepath):
            with open(filepath) as f:
            for line in f:
            id, rest = SEP_RE.split(line, maxsplit=1)
            for match in DATA_RE.finditer(rest):
            yield [int(id), match["term"], float(match["weight"])]
            return list(_parse(filepath))


            Example:



            >>> df = pd.DataFrame(parse("/Users/bradsolomon/Downloads/doc.txt"),
            ... columns=["Id", "Term", "weight"])
            >>>
            >>> df
            Id Term weight
            0 1 frack 0.733
            1 1 shale 0.700
            2 10 space 0.645
            3 10 station 0.327
            4 10 nasa 0.258
            5 4 celebr 0.262
            6 4 bahar 0.345

            >>> df.dtypes
            Id int64
            Term object
            weight float64
            dtype: object




            Walkthrough



            SEP_RE looks for an initial separator: a literal : followed by one or more spaces. It uses maxsplit=1 to stop once the first split is found. Granted, this assumes your data is strictly formatted: that the format of your entire dataset consistently follows the example format laid out in your question.



            After that, DATA_RE.finditer() deals with each (term, weight) pair extraxted from rest. The string rest itself will look like frack 0.733, shale 0.700,. .finditer() gives you multiple match objects, where you can use ["key"] notation to access the element from a given named capture group, such as (?P<term>[a-z]+).



            An easy way to visualize this is to use an example line from your file as a string:



            >>> line = "1: frack 0.733, shale 0.700,n"
            >>> SEP_RE.split(line, maxsplit=1)
            ['1', 'frack 0.733, shale 0.700,n']


            Now you have the initial ID and rest of the components, which you can unpack into two identifiers.



            >>> id, rest = SEP_RE.split(line, maxsplit=1)
            >>> it = DATA_RE.finditer(rest)
            >>> match = next(it)
            >>> match
            <re.Match object; span=(0, 11), match='frack 0.733'>
            >>> match["term"]
            'frack'
            >>> match["weight"]
            '0.733'


            The better way to visualize it is with pdb. Give it a try if you dare ;)



            Disclaimer



            This is one of those questions that demands a particular type of solution that may not generalize well if you ease up restrictions on your data format.



            For instance, it assumes that each each Term can only take upper or lowercase ASCII letters, nothing else. If you have other Unicode characters as identifiers, you would want to look into other re characters such as w.







            share|improve this answer














            share|improve this answer



            share|improve this answer








            edited 47 mins ago

























            answered 56 mins ago









            Brad SolomonBrad Solomon

            15k83995




            15k83995








            • 2





              Brilliant answer, I must say.

              – amanb
              50 mins ago











            • @amanb Thank you!

              – Brad Solomon
              46 mins ago














            • 2





              Brilliant answer, I must say.

              – amanb
              50 mins ago











            • @amanb Thank you!

              – Brad Solomon
              46 mins ago








            2




            2





            Brilliant answer, I must say.

            – amanb
            50 mins ago





            Brilliant answer, I must say.

            – amanb
            50 mins ago













            @amanb Thank you!

            – Brad Solomon
            46 mins ago





            @amanb Thank you!

            – Brad Solomon
            46 mins ago













            3














            You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



            import pandas as pd
            from itertools import chain

            text="""1: frack 0.733, shale 0.700,
            10: space 0.645, station 0.327, nasa 0.258,
            4: celebr 0.262, bahar 0.345 """

            df = pd.DataFrame(
            list(
            chain.from_iterable(
            map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
            map(lambda x: x.strip(" ,").split(":"), text.splitlines())
            )
            ),
            columns=["Id", "Term", "weight"]
            )

            print(df)
            # Id Term weight
            #0 4 frack 0.733
            #1 4 shale 0.700
            #2 4 space 0.645
            #3 4 station 0.327
            #4 4 nasa 0.258
            #5 4 celebr 0.262
            #6 4 bahar 0.345


            Explanation



            I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



            print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
            #[['1', ' frack 0.733, shale 0.700'],
            # ['10', ' space 0.645, station 0.327, nasa 0.258'],
            # ['4', ' celebr 0.262, bahar 0.345']]


            The next step is to split on the comma to separate the values, and assign the Id to each set of values:



            print(
            [
            list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
            map(lambda x: x.strip(" ,").split(":"), text.splitlines())
            ]
            )
            #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
            # [('10', 'space', '0.645'),
            # ('10', 'station', '0.327'),
            # ('10', 'nasa', '0.258')],
            # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


            Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



            Note: The * tuple unpacking is a python 3 feature.






            share|improve this answer






























              3














              You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



              import pandas as pd
              from itertools import chain

              text="""1: frack 0.733, shale 0.700,
              10: space 0.645, station 0.327, nasa 0.258,
              4: celebr 0.262, bahar 0.345 """

              df = pd.DataFrame(
              list(
              chain.from_iterable(
              map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
              map(lambda x: x.strip(" ,").split(":"), text.splitlines())
              )
              ),
              columns=["Id", "Term", "weight"]
              )

              print(df)
              # Id Term weight
              #0 4 frack 0.733
              #1 4 shale 0.700
              #2 4 space 0.645
              #3 4 station 0.327
              #4 4 nasa 0.258
              #5 4 celebr 0.262
              #6 4 bahar 0.345


              Explanation



              I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



              print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
              #[['1', ' frack 0.733, shale 0.700'],
              # ['10', ' space 0.645, station 0.327, nasa 0.258'],
              # ['4', ' celebr 0.262, bahar 0.345']]


              The next step is to split on the comma to separate the values, and assign the Id to each set of values:



              print(
              [
              list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
              map(lambda x: x.strip(" ,").split(":"), text.splitlines())
              ]
              )
              #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
              # [('10', 'space', '0.645'),
              # ('10', 'station', '0.327'),
              # ('10', 'nasa', '0.258')],
              # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


              Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



              Note: The * tuple unpacking is a python 3 feature.






              share|improve this answer




























                3












                3








                3







                You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



                import pandas as pd
                from itertools import chain

                text="""1: frack 0.733, shale 0.700,
                10: space 0.645, station 0.327, nasa 0.258,
                4: celebr 0.262, bahar 0.345 """

                df = pd.DataFrame(
                list(
                chain.from_iterable(
                map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                )
                ),
                columns=["Id", "Term", "weight"]
                )

                print(df)
                # Id Term weight
                #0 4 frack 0.733
                #1 4 shale 0.700
                #2 4 space 0.645
                #3 4 station 0.327
                #4 4 nasa 0.258
                #5 4 celebr 0.262
                #6 4 bahar 0.345


                Explanation



                I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



                print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
                #[['1', ' frack 0.733, shale 0.700'],
                # ['10', ' space 0.645, station 0.327, nasa 0.258'],
                # ['4', ' celebr 0.262, bahar 0.345']]


                The next step is to split on the comma to separate the values, and assign the Id to each set of values:



                print(
                [
                list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                ]
                )
                #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
                # [('10', 'space', '0.645'),
                # ('10', 'station', '0.327'),
                # ('10', 'nasa', '0.258')],
                # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


                Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



                Note: The * tuple unpacking is a python 3 feature.






                share|improve this answer















                You can use the DataFrame constructor if you massage your input to the appropriate format. Here is one way:



                import pandas as pd
                from itertools import chain

                text="""1: frack 0.733, shale 0.700,
                10: space 0.645, station 0.327, nasa 0.258,
                4: celebr 0.262, bahar 0.345 """

                df = pd.DataFrame(
                list(
                chain.from_iterable(
                map(lambda z: (y[0], *z.strip().split()), y[1].split(",")) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                )
                ),
                columns=["Id", "Term", "weight"]
                )

                print(df)
                # Id Term weight
                #0 4 frack 0.733
                #1 4 shale 0.700
                #2 4 space 0.645
                #3 4 station 0.327
                #4 4 nasa 0.258
                #5 4 celebr 0.262
                #6 4 bahar 0.345


                Explanation



                I assume that you've read your file into the string text. The first thing you want to do is strip leading/trailing commas and whitespace before splitting on :



                print(list(map(lambda x: x.strip(" ,").split(":"), text.splitlines())))
                #[['1', ' frack 0.733, shale 0.700'],
                # ['10', ' space 0.645, station 0.327, nasa 0.258'],
                # ['4', ' celebr 0.262, bahar 0.345']]


                The next step is to split on the comma to separate the values, and assign the Id to each set of values:



                print(
                [
                list(map(lambda z: (y[0], *z.strip().split()), y[1].split(","))) for y in
                map(lambda x: x.strip(" ,").split(":"), text.splitlines())
                ]
                )
                #[[('1', 'frack', '0.733'), ('1', 'shale', '0.700')],
                # [('10', 'space', '0.645'),
                # ('10', 'station', '0.327'),
                # ('10', 'nasa', '0.258')],
                # [('4', 'celebr', '0.262'), ('4', 'bahar', '0.345')]]


                Finally, we use itertools.chain.from_iterable to flatten this output, which can then be passed straight to the DataFrame constructor.



                Note: The * tuple unpacking is a python 3 feature.







                share|improve this answer














                share|improve this answer



                share|improve this answer








                edited 47 mins ago

























                answered 52 mins ago









                paultpault

                17.3k42754




                17.3k42754























                    3














                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)





                    share|improve this answer


























                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      36 mins ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      33 mins ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      30 mins ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      29 mins ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      27 mins ago
















                    3














                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)





                    share|improve this answer


























                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      36 mins ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      33 mins ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      30 mins ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      29 mins ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      27 mins ago














                    3












                    3








                    3







                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)





                    share|improve this answer















                    Assuming your data (csv file) looks like given:



                    df = pd.read_csv('untitled.txt', sep=': ', header=None)
                    df.set_index(0, inplace=True)

                    # split the `,`
                    df = df[1].str.strip().str.split(',', expand=True)

                    # 0 1 2 3
                    #-- ------------ ------------- ---------- ---
                    # 1 frack 0.733 shale 0.700
                    #10 space 0.645 station 0.327 nasa 0.258
                    # 4 celebr 0.262 bahar 0.345

                    # stack and drop empty
                    df = df.stack()
                    df = df[~df.eq('')]

                    # split ' '
                    df = df.str.strip().str.split(' ', expand=True)

                    # edit to give final expected output:

                    # rename index and columns for reset_index
                    df.index.names = ['Id', 'to_drop']
                    df.columns = ['Term', 'weight']

                    # final df
                    final_df = df.reset_index().drop('to_drop', axis=1)






                    share|improve this answer














                    share|improve this answer



                    share|improve this answer








                    edited 35 mins ago

























                    answered 48 mins ago









                    Quang HoangQuang Hoang

                    3,75711019




                    3,75711019













                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      36 mins ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      33 mins ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      30 mins ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      29 mins ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      27 mins ago



















                    • how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                      – Rebin
                      36 mins ago






                    • 1





                      @Rebin add engine='python'

                      – pault
                      33 mins ago











                    • @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                      – Quang Hoang
                      30 mins ago











                    • I dont know how to add engine python? what is the command?

                      – Rebin
                      29 mins ago






                    • 1





                      @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                      – pault
                      27 mins ago

















                    how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                    – Rebin
                    36 mins ago





                    how do you not getting error by ''' sep=': ' ''' which is 2 character separator?

                    – Rebin
                    36 mins ago




                    1




                    1





                    @Rebin add engine='python'

                    – pault
                    33 mins ago





                    @Rebin add engine='python'

                    – pault
                    33 mins ago













                    @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                    – Quang Hoang
                    30 mins ago





                    @pault weird, 'cause I already split by ' '. It yields correct data on my computer.

                    – Quang Hoang
                    30 mins ago













                    I dont know how to add engine python? what is the command?

                    – Rebin
                    29 mins ago





                    I dont know how to add engine python? what is the command?

                    – Rebin
                    29 mins ago




                    1




                    1





                    @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                    – pault
                    27 mins ago





                    @Rebin add it as a param to pd.read_csv - df = pd.read_csv(..., engine='python')

                    – pault
                    27 mins ago











                    0














                    Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                    import pandas as pd
                    file=r"give_your_path".replace('\', '/')
                    my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                    with open(file,"r+") as f:
                    for line in f.readlines():#looping every line
                    my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                    for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                    my_list_of_lists.append(my_id+term)
                    df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                    df.columns=["Id","Term","weight"]#giving columns their names





                    share|improve this answer




























                      0














                      Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                      import pandas as pd
                      file=r"give_your_path".replace('\', '/')
                      my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                      with open(file,"r+") as f:
                      for line in f.readlines():#looping every line
                      my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                      for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                      my_list_of_lists.append(my_id+term)
                      df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                      df.columns=["Id","Term","weight"]#giving columns their names





                      share|improve this answer


























                        0












                        0








                        0







                        Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                        import pandas as pd
                        file=r"give_your_path".replace('\', '/')
                        my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                        with open(file,"r+") as f:
                        for line in f.readlines():#looping every line
                        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                        my_list_of_lists.append(my_id+term)
                        df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                        df.columns=["Id","Term","weight"]#giving columns their names





                        share|improve this answer













                        Here is another take for your question. Creating a list which will contain lists for every id and term. And then produce the dataframe.



                        import pandas as pd
                        file=r"give_your_path".replace('\', '/')
                        my_list_of_lists=[]#creating an empty list which will contain lists of [Id Term Weight]
                        with open(file,"r+") as f:
                        for line in f.readlines():#looping every line
                        my_id=[line.split(":")[0]]#storing the Id in order to use it in every term
                        for term in [s.strip().split(" ") for s in line[line.find(":")+1:].split(",")[:-1]]:
                        my_list_of_lists.append(my_id+term)
                        df=pd.DataFrame.from_records(my_list_of_lists)#turning the lists to dataframe
                        df.columns=["Id","Term","weight"]#giving columns their names






                        share|improve this answer












                        share|improve this answer



                        share|improve this answer










                        answered 36 mins ago









                        JoPapou13JoPapou13

                        914




                        914























                            0














                            It is possible to just use entirely pandas:



                            df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                            10: space 0.645, station 0.327, nasa 0.258,
                            4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                            #df:
                            0 1
                            0 1 frack 0.733, shale 0.700,
                            1 10 space 0.645, station 0.327, nasa 0.258,
                            2 4 celebr 0.262, bahar 0.345


                            Turn the column 1 into a list and then expand:



                            df[1] = df[1].str.split(",", expand=False)

                            dfs = []
                            for idx, rows in df.iterrows():
                            print(rows)
                            dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
                            dfs.append(dfslice)
                            newdf = pd.concat(dfs, ignore_index=True)

                            # this creates newdf:
                            Id terms
                            0 1 frack 0.733
                            1 1 shale 0.700
                            2 1
                            3 10 space 0.645
                            4 10 station 0.327
                            5 10 nasa 0.258
                            6 10
                            7 4 celebr 0.262
                            8 4 bahar 0.345


                            Now we need to str split the last line and drop empties:



                            newdf["terms"] = newdf["terms"].str.strip()
                            newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                            newdf.columns = ["Id", "terms", "Term", "Weights"]
                            newdf = newdf.drop("terms", axis=1).dropna()


                            Resulting newdf:



                               Id     Term Weights
                            0 1 frack 0.733
                            1 1 shale 0.700
                            3 10 space 0.645
                            4 10 station 0.327
                            5 10 nasa 0.258
                            7 4 celebr 0.262
                            8 4 bahar 0.345





                            share|improve this answer




























                              0














                              It is possible to just use entirely pandas:



                              df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                              10: space 0.645, station 0.327, nasa 0.258,
                              4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                              #df:
                              0 1
                              0 1 frack 0.733, shale 0.700,
                              1 10 space 0.645, station 0.327, nasa 0.258,
                              2 4 celebr 0.262, bahar 0.345


                              Turn the column 1 into a list and then expand:



                              df[1] = df[1].str.split(",", expand=False)

                              dfs = []
                              for idx, rows in df.iterrows():
                              print(rows)
                              dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
                              dfs.append(dfslice)
                              newdf = pd.concat(dfs, ignore_index=True)

                              # this creates newdf:
                              Id terms
                              0 1 frack 0.733
                              1 1 shale 0.700
                              2 1
                              3 10 space 0.645
                              4 10 station 0.327
                              5 10 nasa 0.258
                              6 10
                              7 4 celebr 0.262
                              8 4 bahar 0.345


                              Now we need to str split the last line and drop empties:



                              newdf["terms"] = newdf["terms"].str.strip()
                              newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                              newdf.columns = ["Id", "terms", "Term", "Weights"]
                              newdf = newdf.drop("terms", axis=1).dropna()


                              Resulting newdf:



                                 Id     Term Weights
                              0 1 frack 0.733
                              1 1 shale 0.700
                              3 10 space 0.645
                              4 10 station 0.327
                              5 10 nasa 0.258
                              7 4 celebr 0.262
                              8 4 bahar 0.345





                              share|improve this answer


























                                0












                                0








                                0







                                It is possible to just use entirely pandas:



                                df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                                10: space 0.645, station 0.327, nasa 0.258,
                                4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                                #df:
                                0 1
                                0 1 frack 0.733, shale 0.700,
                                1 10 space 0.645, station 0.327, nasa 0.258,
                                2 4 celebr 0.262, bahar 0.345


                                Turn the column 1 into a list and then expand:



                                df[1] = df[1].str.split(",", expand=False)

                                dfs = []
                                for idx, rows in df.iterrows():
                                print(rows)
                                dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
                                dfs.append(dfslice)
                                newdf = pd.concat(dfs, ignore_index=True)

                                # this creates newdf:
                                Id terms
                                0 1 frack 0.733
                                1 1 shale 0.700
                                2 1
                                3 10 space 0.645
                                4 10 station 0.327
                                5 10 nasa 0.258
                                6 10
                                7 4 celebr 0.262
                                8 4 bahar 0.345


                                Now we need to str split the last line and drop empties:



                                newdf["terms"] = newdf["terms"].str.strip()
                                newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                                newdf.columns = ["Id", "terms", "Term", "Weights"]
                                newdf = newdf.drop("terms", axis=1).dropna()


                                Resulting newdf:



                                   Id     Term Weights
                                0 1 frack 0.733
                                1 1 shale 0.700
                                3 10 space 0.645
                                4 10 station 0.327
                                5 10 nasa 0.258
                                7 4 celebr 0.262
                                8 4 bahar 0.345





                                share|improve this answer













                                It is possible to just use entirely pandas:



                                df = pd.read_csv(StringIO(u"""1: frack 0.733, shale 0.700, 
                                10: space 0.645, station 0.327, nasa 0.258,
                                4: celebr 0.262, bahar 0.345 """), sep=":", header=None)

                                #df:
                                0 1
                                0 1 frack 0.733, shale 0.700,
                                1 10 space 0.645, station 0.327, nasa 0.258,
                                2 4 celebr 0.262, bahar 0.345


                                Turn the column 1 into a list and then expand:



                                df[1] = df[1].str.split(",", expand=False)

                                dfs = []
                                for idx, rows in df.iterrows():
                                print(rows)
                                dfslice = pd.DataFrame({"Id": [rows[0]]*len(rows[1]), "terms": rows[1]})
                                dfs.append(dfslice)
                                newdf = pd.concat(dfs, ignore_index=True)

                                # this creates newdf:
                                Id terms
                                0 1 frack 0.733
                                1 1 shale 0.700
                                2 1
                                3 10 space 0.645
                                4 10 station 0.327
                                5 10 nasa 0.258
                                6 10
                                7 4 celebr 0.262
                                8 4 bahar 0.345


                                Now we need to str split the last line and drop empties:



                                newdf["terms"] = newdf["terms"].str.strip()
                                newdf = newdf.join(newdf["terms"].str.split(" ", expand=True))
                                newdf.columns = ["Id", "terms", "Term", "Weights"]
                                newdf = newdf.drop("terms", axis=1).dropna()


                                Resulting newdf:



                                   Id     Term Weights
                                0 1 frack 0.733
                                1 1 shale 0.700
                                3 10 space 0.645
                                4 10 station 0.327
                                5 10 nasa 0.258
                                7 4 celebr 0.262
                                8 4 bahar 0.345






                                share|improve this answer












                                share|improve this answer



                                share|improve this answer










                                answered 33 mins ago









                                Rocky LiRocky Li

                                3,6831719




                                3,6831719























                                    0














                                    Could I assume that there is just 1 space before 'TERM'?



                                    df=pd.DataFrame(columns=['ID','Term','Weight'])
                                    with open('C:/random/d1','r') as readObject:
                                    for line in readObject:
                                    line=line.rstrip('n')
                                    tempList1=line.split(':')
                                    tempList2=tempList1[1]
                                    tempList2=tempList2.rstrip(',')
                                    tempList2=tempList2.split(',')
                                    for item in tempList2:
                                    e=item.split(' ')
                                    tempRow=[tempList1[0], e[0],e[1]]
                                    df.loc[len(df)]=tempRow
                                    print(df)





                                    share|improve this answer




























                                      0














                                      Could I assume that there is just 1 space before 'TERM'?



                                      df=pd.DataFrame(columns=['ID','Term','Weight'])
                                      with open('C:/random/d1','r') as readObject:
                                      for line in readObject:
                                      line=line.rstrip('n')
                                      tempList1=line.split(':')
                                      tempList2=tempList1[1]
                                      tempList2=tempList2.rstrip(',')
                                      tempList2=tempList2.split(',')
                                      for item in tempList2:
                                      e=item.split(' ')
                                      tempRow=[tempList1[0], e[0],e[1]]
                                      df.loc[len(df)]=tempRow
                                      print(df)





                                      share|improve this answer


























                                        0












                                        0








                                        0







                                        Could I assume that there is just 1 space before 'TERM'?



                                        df=pd.DataFrame(columns=['ID','Term','Weight'])
                                        with open('C:/random/d1','r') as readObject:
                                        for line in readObject:
                                        line=line.rstrip('n')
                                        tempList1=line.split(':')
                                        tempList2=tempList1[1]
                                        tempList2=tempList2.rstrip(',')
                                        tempList2=tempList2.split(',')
                                        for item in tempList2:
                                        e=item.split(' ')
                                        tempRow=[tempList1[0], e[0],e[1]]
                                        df.loc[len(df)]=tempRow
                                        print(df)





                                        share|improve this answer













                                        Could I assume that there is just 1 space before 'TERM'?



                                        df=pd.DataFrame(columns=['ID','Term','Weight'])
                                        with open('C:/random/d1','r') as readObject:
                                        for line in readObject:
                                        line=line.rstrip('n')
                                        tempList1=line.split(':')
                                        tempList2=tempList1[1]
                                        tempList2=tempList2.rstrip(',')
                                        tempList2=tempList2.split(',')
                                        for item in tempList2:
                                        e=item.split(' ')
                                        tempRow=[tempList1[0], e[0],e[1]]
                                        df.loc[len(df)]=tempRow
                                        print(df)






                                        share|improve this answer












                                        share|improve this answer



                                        share|improve this answer










                                        answered 27 mins ago









                                        RebinRebin

                                        193211




                                        193211























                                            0














                                            Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                                            import pandas as pd
                                            from parsimonious.grammar import Grammar
                                            from parsimonious.nodes import NodeVisitor

                                            file = """1: frack 0.733, shale 0.700,
                                            10: space 0.645, station 0.327, nasa 0.258,
                                            4: celebr 0.262, bahar 0.345
                                            """

                                            grammar = Grammar(
                                            r"""
                                            expr = line+

                                            line = id colon pair*
                                            pair = term ws weight sep? ws?

                                            id = ~"d+"
                                            colon = ws? ":" ws?
                                            sep = ws? "," ws?

                                            term = ~"[a-zA-Z]+"
                                            weight = ~"d+(?:.d+)?"

                                            ws = ~"s+"
                                            """
                                            )

                                            tree = grammar.parse(file)

                                            class PandasVisitor(NodeVisitor):
                                            def generic_visit(self, node, visited_children):
                                            return visited_children or node

                                            def visit_pair(self, node, visited_children):
                                            term, _, weight, *_ = visited_children
                                            return (term.text, weight.text)

                                            def visit_line(self, node, visited_children):
                                            id, _, pairs = visited_children
                                            return [(id.text, *pair) for pair in pairs]

                                            def visit_expr(self, node, visited_children):
                                            return [item for lst in visited_children for item in lst]

                                            pv = PandasVisitor()
                                            result = pv.visit(tree)

                                            df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                                            print(df)


                                            This yields



                                               Id     Term weight
                                            0 1 frack 0.733
                                            1 1 shale 0.700
                                            2 10 space 0.645
                                            3 10 station 0.327
                                            4 10 nasa 0.258
                                            5 4 celebr 0.262
                                            6 4 bahar 0.345




                                            share




























                                              0














                                              Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                                              import pandas as pd
                                              from parsimonious.grammar import Grammar
                                              from parsimonious.nodes import NodeVisitor

                                              file = """1: frack 0.733, shale 0.700,
                                              10: space 0.645, station 0.327, nasa 0.258,
                                              4: celebr 0.262, bahar 0.345
                                              """

                                              grammar = Grammar(
                                              r"""
                                              expr = line+

                                              line = id colon pair*
                                              pair = term ws weight sep? ws?

                                              id = ~"d+"
                                              colon = ws? ":" ws?
                                              sep = ws? "," ws?

                                              term = ~"[a-zA-Z]+"
                                              weight = ~"d+(?:.d+)?"

                                              ws = ~"s+"
                                              """
                                              )

                                              tree = grammar.parse(file)

                                              class PandasVisitor(NodeVisitor):
                                              def generic_visit(self, node, visited_children):
                                              return visited_children or node

                                              def visit_pair(self, node, visited_children):
                                              term, _, weight, *_ = visited_children
                                              return (term.text, weight.text)

                                              def visit_line(self, node, visited_children):
                                              id, _, pairs = visited_children
                                              return [(id.text, *pair) for pair in pairs]

                                              def visit_expr(self, node, visited_children):
                                              return [item for lst in visited_children for item in lst]

                                              pv = PandasVisitor()
                                              result = pv.visit(tree)

                                              df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                                              print(df)


                                              This yields



                                                 Id     Term weight
                                              0 1 frack 0.733
                                              1 1 shale 0.700
                                              2 10 space 0.645
                                              3 10 station 0.327
                                              4 10 nasa 0.258
                                              5 4 celebr 0.262
                                              6 4 bahar 0.345




                                              share


























                                                0












                                                0








                                                0







                                                Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                                                import pandas as pd
                                                from parsimonious.grammar import Grammar
                                                from parsimonious.nodes import NodeVisitor

                                                file = """1: frack 0.733, shale 0.700,
                                                10: space 0.645, station 0.327, nasa 0.258,
                                                4: celebr 0.262, bahar 0.345
                                                """

                                                grammar = Grammar(
                                                r"""
                                                expr = line+

                                                line = id colon pair*
                                                pair = term ws weight sep? ws?

                                                id = ~"d+"
                                                colon = ws? ":" ws?
                                                sep = ws? "," ws?

                                                term = ~"[a-zA-Z]+"
                                                weight = ~"d+(?:.d+)?"

                                                ws = ~"s+"
                                                """
                                                )

                                                tree = grammar.parse(file)

                                                class PandasVisitor(NodeVisitor):
                                                def generic_visit(self, node, visited_children):
                                                return visited_children or node

                                                def visit_pair(self, node, visited_children):
                                                term, _, weight, *_ = visited_children
                                                return (term.text, weight.text)

                                                def visit_line(self, node, visited_children):
                                                id, _, pairs = visited_children
                                                return [(id.text, *pair) for pair in pairs]

                                                def visit_expr(self, node, visited_children):
                                                return [item for lst in visited_children for item in lst]

                                                pv = PandasVisitor()
                                                result = pv.visit(tree)

                                                df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                                                print(df)


                                                This yields



                                                   Id     Term weight
                                                0 1 frack 0.733
                                                1 1 shale 0.700
                                                2 10 space 0.645
                                                3 10 station 0.327
                                                4 10 nasa 0.258
                                                5 4 celebr 0.262
                                                6 4 bahar 0.345




                                                share













                                                Just to put my two cents in: you could write yourself a parser and feed the result into pandas:



                                                import pandas as pd
                                                from parsimonious.grammar import Grammar
                                                from parsimonious.nodes import NodeVisitor

                                                file = """1: frack 0.733, shale 0.700,
                                                10: space 0.645, station 0.327, nasa 0.258,
                                                4: celebr 0.262, bahar 0.345
                                                """

                                                grammar = Grammar(
                                                r"""
                                                expr = line+

                                                line = id colon pair*
                                                pair = term ws weight sep? ws?

                                                id = ~"d+"
                                                colon = ws? ":" ws?
                                                sep = ws? "," ws?

                                                term = ~"[a-zA-Z]+"
                                                weight = ~"d+(?:.d+)?"

                                                ws = ~"s+"
                                                """
                                                )

                                                tree = grammar.parse(file)

                                                class PandasVisitor(NodeVisitor):
                                                def generic_visit(self, node, visited_children):
                                                return visited_children or node

                                                def visit_pair(self, node, visited_children):
                                                term, _, weight, *_ = visited_children
                                                return (term.text, weight.text)

                                                def visit_line(self, node, visited_children):
                                                id, _, pairs = visited_children
                                                return [(id.text, *pair) for pair in pairs]

                                                def visit_expr(self, node, visited_children):
                                                return [item for lst in visited_children for item in lst]

                                                pv = PandasVisitor()
                                                result = pv.visit(tree)

                                                df = pd.DataFrame(result, columns=["Id", "Term", "weight"])
                                                print(df)


                                                This yields



                                                   Id     Term weight
                                                0 1 frack 0.733
                                                1 1 shale 0.700
                                                2 10 space 0.645
                                                3 10 station 0.327
                                                4 10 nasa 0.258
                                                5 4 celebr 0.262
                                                6 4 bahar 0.345





                                                share











                                                share


                                                share










                                                answered 3 mins ago









                                                JanJan

                                                26.1k52750




                                                26.1k52750























                                                    -3














                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]





                                                    share|improve this answer



















                                                    • 2





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      1 hour ago
















                                                    -3














                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]





                                                    share|improve this answer



















                                                    • 2





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      1 hour ago














                                                    -3












                                                    -3








                                                    -3







                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]





                                                    share|improve this answer













                                                    1) You can read row by row.



                                                    2) Then you can separate by ':' for your index and ',' for the values



                                                    1)



                                                    with open('path/filename.txt','r') as filename:
                                                    content = filename.readlines()


                                                    2)
                                                    content = [x.split(':') for x in content]



                                                    This will give you the following result:



                                                    content =[
                                                    ['1','frack 0.733, shale 0.700,'],
                                                    ['10', 'space 0.645, station 0.327, nasa 0.258,'],
                                                    ['4','celebr 0.262, bahar 0.345 ']]






                                                    share|improve this answer












                                                    share|improve this answer



                                                    share|improve this answer










                                                    answered 1 hour ago









                                                    CedricLyCedricLy

                                                    11




                                                    11








                                                    • 2





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      1 hour ago














                                                    • 2





                                                      Your result is not the result asked for in the question.

                                                      – GiraffeMan91
                                                      1 hour ago








                                                    2




                                                    2





                                                    Your result is not the result asked for in the question.

                                                    – GiraffeMan91
                                                    1 hour ago





                                                    Your result is not the result asked for in the question.

                                                    – GiraffeMan91
                                                    1 hour ago


















                                                    draft saved

                                                    draft discarded




















































                                                    Thanks for contributing an answer to Stack Overflow!


                                                    • Please be sure to answer the question. Provide details and share your research!

                                                    But avoid



                                                    • Asking for help, clarification, or responding to other answers.

                                                    • Making statements based on opinion; back them up with references or personal experience.


                                                    To learn more, see our tips on writing great answers.




                                                    draft saved


                                                    draft discarded














                                                    StackExchange.ready(
                                                    function () {
                                                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f55799784%2fconverting-a-text-document-with-special-format-to-pandas-data-frame%23new-answer', 'question_page');
                                                    }
                                                    );

                                                    Post as a guest















                                                    Required, but never shown





















































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown

































                                                    Required, but never shown














                                                    Required, but never shown












                                                    Required, but never shown







                                                    Required, but never shown







                                                    Popular posts from this blog

                                                    Gersau Kjelder | Navigasjonsmeny46°59′0″N 8°31′0″E46°59′0″N...

                                                    Nässjö kommun Tettstader | Kjelder | NavigasjonsmenyeVIAFISNIGeoNamesMusicBrainz (area)

                                                    Kvitkval Innhaldsliste Taksonomi og utvikling | Utsjånad og levevis | Utbreiing | Åtferd |...